Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Compute Express Link

Starting point is 00:00:00 Okay, then let's get started. Good morning everyone. Today's lecture is about Compute Express Link. First, let's have a look at where we are right now. In yesterday's lecture, we left the CPU and memory domain and talked about storage. We talked about PCIe and peripheral devices. And CXL is an interconnect technology that fits very well to PCIe because it actually is based on PCIe. And so when we communicate over PCIe we're usually dealing with peripheral devices such as GPUs, network interface cards, disks, FPGAs. But with CXL, so we are not much dealing with DRAM anymore,

Starting point is 00:00:46 because this is usually directly attached to the CPU. But with CXL, we're also talking about DRAM again, because this is a technology that allows moving the memory dims out of, yeah, away from the main board and allows it to attach memory to peripheral devices. Regarding our timeline, after today's CXL lecture, next week, the first session on Tuesday, we will discuss the task two solution and then also introduce you to the third task,

Starting point is 00:01:19 which is about buffer management. And then we will have a networking a network session after that we will continue with GPU sessions, RDMA and then also two FPGA sessions. In today's lecture first I want to briefly discuss what are the current limitations of interconnect specifically what are the limitations of pcie and ddr because this is a motivation for cxl and what cxl tries to address then i will give you a brief overview of cxl what is cxl in a nutshell what are the most important things that you should remember i will go into detail about managing cache coherence. I will explain some topologies that are possible with CXL connected CPUs and devices.

Starting point is 00:02:11 I will very briefly talk about how you can actually program with CXL if you have the chance to work with some CXL memory devices. I will give you a brief insight about performance numbers that you can expect from such devices and also about available hardware so what's already either announced by hardware vendors or even already commercially available so with that so what is computer express link it's a standardized interconnect technology between basically cpus and devices. Such devices can be FPGAs, GPUs, but also storage devices such as SSDs or also ASICs. And CXL allows CPUs and devices to access and to cache data stored on each other's memory. So a CPU can access memory that is potentially located on the device, cached coherently, and also the other's memory. So a CPU can access memory that is potentially located on

Starting point is 00:03:05 the device cache coherently and also the other way around. CXL is responsible for maintaining cache coherence here and it is based on the PCIe physical layer. Meaning if you already have a system that supports or has a physical layer of PCIe 5 then you can yeah CXL basically uses it and specifies different protocols that are alternative protocols that you can communicate or that the CPU and the devices can communicate over the PCIe physical layer. And with the physical layer, PCIe physical layer as a basis, this is also the limiting factor in terms of throughput. So with PCIe 5 layer as a basis, this is also the limiting factor in terms of throughput. So with PCIe 5, for example, if you have a x16 connection, meaning 16 lanes, you can

Starting point is 00:03:54 achieve about 63 gigabytes per second per direction. And with PCIe 6, you could achieve 121 gigabytes per second. Okay. So I mentioned memory. you could achieve 121 gigabytes per second. So I mentioned memory. I mentioned peripheral devices, but we already have interconnects for that. So why do we need CXL? Peripheral devices are connected via PCIe. And if you want to access your memory from your CPU core,

Starting point is 00:04:19 then you usually communicate over the DDR interface on your main board. There are certain limitations. Let's have a look at the PCIe device first. If you want to access the system's memory from a PCIe device, let's say from a GPU, there is no cache-coherent read or write supported, and PCIe does not allow to cache system memory to exploit temporal or spatial locality. And also on the other side if we look at the CPU mHost cannot access a PCIe devices memory in a

Starting point is 00:04:56 cache coherent way so the device memory cannot be mapped to the to the cacheable system address space. And also when you want to utilize accelerators, data structures usually need to move from the host memory to the accelerator for data processing before then results are moved back to main memory. And multiple devices cannot access parts of the same data structure simultaneously, one of the limitations. Second is memory scalability.

Starting point is 00:05:30 We can observe that there is a high demand for memory capacity and memory bandwidth, so the demand is increasing with more data-intensive applications. And DDR memory lacks matching this demand. So, and for this limitation, I invite you to look to have a look at pins of DDR memory DIMMs and PCIe connectors. So on the left side here, you can see a DDR5 DIMM. It has about or it has 288 pins and on the right side you can see a connector of PCIe 5 x 16

Starting point is 00:06:13 connector and this has 82 pins. And if we compare the throughput here with the PCA connector again with 16 lanes here PCA 5 it supports PCA connector, again with 16 lanes here, PCA5. It supports about 63 GB per direction with only 82 pins, while this memory DIMM in this case supports about 50 GB per second, but it has 288 pins, so the bandwidth per pin is significantly lower. And also note that this is a quite beefy memory dim here with a transfer rate of 6.5 gigatransfers per second. So current CPUs, let's talk for example about the AMD Genoa or Intel Sapphire Rapids, they support 4,800 megatransfers per second.

Starting point is 00:07:07 So this is even higher here. And still, the efficiency is quite low compared to PCIe pins. And also, if you want to have more memory on your main board, you also need to have more memory slots. I mean, over CPU generations and main board generations accordingly, the number of slots also increased. But also temperature-wise, from a technical or electrical

Starting point is 00:07:38 engineering point of view, it's quite challenging to add more DRAM memory dims. Another limitation with memory located to the main board or directly attached by a DDR is the distance here. So I already mentioned the electrical engineering challenges and everything is tightly together, which makes it also hard to design. And with PCIe, for example, you can have much larger distances between your CPU and the actual device. You can, for example, also add some kind of riser cards, as you can see it here in this visualization. So, for example, let's say this is your motherboard

Starting point is 00:08:22 and you want to attach a PCIe device here, but you, for some reason, need a little bit more space, your PCIe devices are too big so that it doesn't fit in there. You can add some riser cards so that theoretically you could attach it here, but let's say for some reason it's still not enough, so then you can also add retimers, which basically are components which retransmit a fresh copy of the signal and then you can extend the distance between your motherboard and your PCI devices and then for example you install an SSD array here. So basically PCIe

Starting point is 00:09:00 allows increasing the distance and which also would help when it comes to providing a volatile memory. Another limitation or some factor that we see in today's data centers are stranded resources. Stranded resource is when idle capacity remains while another resource is fully used. A popular example here is CPU capacity and memory capacity. So let's say you have a very memory-intensive application, for example, an in-memory database system, and you utilize all the memory of your server, but not all the CPU cores might be fully utilized,

Starting point is 00:09:49 then to provide more memory, you might scale out, use an additional system here and use the memory of the second system. But on both servers now, you do not fully utilize the CPU resources. And also the other way around, you might have a very compute intensive application where you utilize all the CPU cores, but you barely or you only use a small fraction of memory. And to deal with this issue, usually servers are over provisioned, so they are configured in a way that the peak, ideally that the peak demand

Starting point is 00:10:30 of resources is provided, but this peak demand of an application is not the average demand. So if the server is configured in that way, then it will actually tolerate the peak workloads and will not fail in that situation or will not go out of memory if we focus on memory consumption here. But on average, there's now a lot of memory provisioned, which is on average not used here. And one way of dealing with this problem here is that we can separate individual hardware resources.

Starting point is 00:11:10 Let's say you have a server with CPU resources, but just a small fraction of memory, but then you have some kind of dis-regulated memory. You have some memory that is located in a different server rack or in a different server chassis in your rack and now you can from your server can access that memory and this is some use case where CXL is getting more interesting. Another limitation or issue that we can see is data sharing. In a distributed system where you have different nodes, small fractions of data is often shared and communicated over the network. key value store and they want to perform certain transactions here but they also to to decide in which order transactions should be executed and and they need to communicate with which with each other it's also common as distributed consensus,

Starting point is 00:12:25 so they need to follow consensus protocols. And there are often small fractions of data that are sent around over the network, and often with such small data chunks communicated over the network, this communication is a typical delay in data centers, and it often dominates the waiting time for certain workloads. Just to summarize here and also point to one or the other

Starting point is 00:12:55 additional limitation, PCIe does not allow coherent memory access so we cannot exploit a temporal and spatial locality of data accesses. Then with DDR we have a memory bandwidth limitation, as mentioned before. Also one aspect that I haven't mentioned yet is we are very limited in the type of memory that we can use if we only use DDR attached memory. On the server system, a single server supports usually only one type of memory. It's for example DDR4, DDR5. So we cannot attach, let's say, DDR3 DIMMs into our mainboard. So we are very restricted here. And also there are other alternative memory media types that are not common on mainboards.

Starting point is 00:13:51 But we definitely here limited if we only use DDR attached memory. And also, as mentioned with the stranded resources, we have a type coupling of CPU and memory resources and also when it comes to memory sharing PCIe does not support sharing memory across multiple systems while maintaining cache coherent access. ComputeXpress tries to solve this it comes or at least to a certain degree. I will talk about the details in a few moments but it basically is a specification that comes with three sub protocols which are communicated and negotiated over the PCIe physical layer as mentioned and the first one is PCIe.io. This is I would say the base protocol. It is used for example for device enumeration, initialization, device registration. So

Starting point is 00:14:53 you need it for every device for certain setups so that you are able to communicate via CXL. It is based on PCIe and it also supports non-coherent load and store semantics of PCIe. The more interesting ones are CXL.cache and CXL.mem. CXL.cache, as I tried to illustrate in the figure on the right, this is the brown part here, the brown error. This allows devices to cache data stored in system memory. And CXL.MEM is used for the other direction, so CPUs can access and cache data stored in CXL device memory. Then, based on the different protocols that a device supports, there are three different

Starting point is 00:15:45 device types. Note that the CXL.IO protocol is mandatory, so every device needs to implement it or needs to support it. And then the device could, for example, only support the cache protocol. And an example use case here is a smart NIC with coherent access to system memory. And in this case, the device, the smart NIC in this case, can benefit from having coherent cache to perform, for example, complex atomic operations, which are not trivial to do with PCIe. And this kind of device type is called, according to the specification, type 1 device. So the names are quite easy here, type 1, type 2,

Starting point is 00:16:29 and type 3 devices. The type 2 device supports all the three protocols. Use case here, or an example, is an accelerator with memory. This can be, for example, a GPU. This can be an FPGA. And a CPU can write data into the device memory and also read results generated by the device from the device's memory and attached memory.

Starting point is 00:16:56 So it has memory attached on the device. And it can fully, it also is able to cache the data that is stored in systems memory. And then the third type only supports CXL.io and CXL.mem, and this type of device is a memory extension device. So it also has memory attached, and the CPU can access the memory in a cache coherent way and use case here is mainly memory bandwidth and memory capacity extension. And as I mentioned before with the limitation of DDR you can also use different memory media types here. So you can use also DDR memory on your PCIe device, but you don't have to.

Starting point is 00:17:46 So you can also use alternative memory media types here. And there are also proposals, hardware proposals, in which SSDs are used that supports the memory protocol meaning you could actually access the ssd via the cxl.mem protocol from your cpu and then can actually perform persistent writes to the ssd even though from an application point of view, you're writing to virtual memory address space. And some more details about this, the Type 3 device, the memory extension devices. So this can also be seen as a cost power and pin efficient alternative to DDR, as mentioned before in the limitation part. And it also offers flexibility in the limitation part. And it also offers flexibility in system topologies

Starting point is 00:18:50 due to the longer trace length as shown in the PCIe extension card example. Then a special case is a global fabric attached memory. So this is basically a Type 3 device, but the scope is much higher here. So it allows to be connected to up to 4,096 endpoints. There are different CXL generations, which I will also talk about in a minute.

Starting point is 00:19:21 And this is rather a feature of a later generation or later revision. This is only available with CXL 3.0. And with that, I briefly want to introduce you about the timeline. When did it start? Where are we right now? What kind of revisions do exist?

Starting point is 00:19:42 Because it's very important to know about what revision we're talking, because the feature sets are quite different. CXL is an open specification, or interconnect specification, and the first generation 1.1 was published in 2019. It basically specifies the protocols that I showed you through the three main device types I introduced you to. And also it allows devices to be directly attached to the host system.

Starting point is 00:20:19 With 1.1, a few months later in 2019, there were some compliance tests, mechanism additions, so no further feature. With CXL2, this allows now larger topologies. So remember with CXL1, you can only directly attach devices to your CPU, to your mainboard, but with CXL2 it allows bigger topologies with one level of switching and it also allows partitioning inside devices. So you can have one physical device with multiple logical memory partitions inside which then in a let's say topology with the switch can be allocated to different servers. CXL3 was published in August 2022. Note here that this is the first

Starting point is 00:21:18 version that requires PCIe 6. The previous versions require PCIe 5. And this generation or this revision allows even bigger topologies with multiple levels of switching. It adds an advanced routing approach that also allows connecting 4,096 endpoints with a single device. It allows, and this is one quite important feature here, it allows memory sharing across multiple hosts and devices.

Starting point is 00:21:54 So you can imagine one device with memory attached that is connected to multiple servers and each server can access that memory in a cache coherent way. And it also supports direct communication between devices which the previous generations or revisions do not support. And then in November last year there was another version published, CXL 3.1, which has some other features that are not listed here, but it also supports non-coherent memory access across different CXL domains.

Starting point is 00:22:30 So before, the access was rather limited inside a single CXL domain, and now communication across such domains is possible in a non-cache coherent way. With that I want to dive a little bit deeper into how cache coherence is managed with CXL. For this I briefly want to introduce some more details about the CXL.cache protocol. As mentioned this protocol allows devices to cache data stored in host memory and it uses the Macy protocol with 64 byte cache line size. CXL is an asymmetric coherence protocol. So you might remember a few lectures before when we talked about locking we also introduced you to cache coherence to snooping protocols directive based protocols

Starting point is 00:23:27 and we also briefly mentioned the msu protocol here cxl is using that one on on 64 byte cache line granularity and it's an asymmetric coherence protocol meaning that a cxl device which wants to cache system memory itself is not responsible for managing the coherence but the host is responsible for the management and also tracks the coherence for peer caches. And peer caches here can be caches of other devices, so assume a topology where you have one server and for example two cxl devices with caches and both devices can cache system memory then the host tracks what a device is caching which which memory and this keeps the control simple at the device, since the coherence management logic is located on the CPU

Starting point is 00:24:28 side. The device never directly interacts with other caches. Again, this is what the host is responsible for. And the device only manages its own cache and sends requests to the host if it needs certain data. The CXL memory protocol again enables the device to expose device memory so the CPU can access that memory. It basically supports simple reads and writes from the host to the device. And it's only the protocol that is specified, not the memory media type.

Starting point is 00:25:10 So DRAM, PMEM, high bandwidth memory, flash memory could be supported. And it also allows a device back-to-back invalidation of cache line copies. And this is the part that is relevant for the scenario in which multiple servers share the same memory that is located on one device, because from a server's point of view, they're not in the same coherence domain, so therefore we need some some measures to invalidate cache lines. Let's say one server caches a certain cache line of our shared memory device

Starting point is 00:25:55 and then another server also accesses this memory and now writes to the data. We somehow need to communicate to the first server that this cache line should now be invalidated and for this reason CXL3 introduced back invalidation meaning that the device itself can send back invalidation requests to the connected servers And Yeah, also when you have memory that memory attached on the CXL device,

Starting point is 00:26:25 this is visible for the operating system. This will be exposed in the unified virtual memory address space. So from an application point of view, you're writing to virtual memory, which makes the interaction or integration of CXL memory quite easy from a programming point of view. Okay, how is CXL integrated into the CPU cache hierarchy? So if we look at the modern CPU

Starting point is 00:26:57 hierarchy, we have multiple level of coherent caches. We have all level one caches that are small, have low latencies and have the highest bandwidth compared to the other levels. We have our level 2 caches with larger capacities. They might be shared between multiple cores. We have our level 3 or last level cache that is shared across many cores. And now CXL. So we have two sides here first at the bottom we talk about or we see we leverage the CXL.MEM protocol to access memory but we can also have the case that we use the CXL.cache protocol, which allows devices that are indicated here

Starting point is 00:27:50 at the bottom in blue and greenish to cache memory of the system. And such a device cache can have a size of, for example, one megabyte, and then via the CXL cache protocol, so there needs to be a controller implemented in the CPU, we can now send requests that a certain memory line, a certain cache line should be cached. And then above our last level cache, we have, if we have a multi-socket server, we have our coherent CPU to CPU inter-socket interconnect. For example, for Intel this would be an ultra-path interconnect or on AMD side this would be Infinity Fabric, also known as XGMI.

Starting point is 00:28:52 And on our CPU, we also have our home agent, which on the one hand resolves conflicts of last level cache trying to cache the same cache line, same address, and it also fetches data from memory via the memory controllers. And this memory controller can now now if we do not use CXL just be an integrated memory controller where memory dim is connected to the DDR interface or this can also be a memory controller located on a peripheral device on a CXL device and in this case the home agent needs to communicate with this memory controller via the CXL.Memory protocol. I want to show you some example protocol flows, how the CXL.Cache and also how the.Mem protocol works. And for this we need to know some terminology here.

Starting point is 00:29:47 So we saw the home agent before, but there's not only one home agent. There can also be multiple home agents depending on the address that's being addressed. So if I, for example, want to access memory that is connected or that one of these memory controllers is responsible for, then I have my home agent on that socket. If I want to access memory that is directly attached to the neighbor socket, then the

Starting point is 00:30:16 neighbor socket's home agent is responsible. Then we have peer caches. A peer cache now, so we look at peer caches from the perspective of a CXL cache device. So the one that is marked here a different CXL device and also my caches on my CPU are considered peer caches. Then I have my memory controller. As briefly mentioned before, this can be an integrated one, but this can also be a memory controller located on my CXL device. And also note, this memory controller could also be located on a remote socket. Okay, if we want to write data, for example,

Starting point is 00:31:14 this is an example protocol flow here, note that we will use some meta value in this example flow. So, this is only mandatory for type 2 devices. Remember type 2 devices support CXL and memory so they can cache system memory but also allow the system to access its own memory and these metadata bits are necessary for this kind of type of devices because for each memory access from the server side the host now encodes the required cache line state and and with that it exposes the state to the device so now the device has further information further knowledge and can store the information about what cache line is cached in one of the

Starting point is 00:32:11 host caches or in the peer caches because this is relevant when it comes to protocol paths or communication paths because there's a differentiation between my server has copies of the memory and my server or the host does not have any copies but I will also talk about it in a moment right back to the to the right protocol flow here let's assume the host CPU wants to write data so therefore it sends a request to the memory controller we assume CXL type 3 device here so therefore it sends a request to the memory controller we assume a cxl type 3 device here so therefore it does not have it does not support the cache protocol so we send a memory write request with the meta data value here for type 3 devices this metadata value

Starting point is 00:33:02 is optional so it's up to the host how to use it. It can be used, for example, for security attributes or compression attributes. And the memory controller then sends the write command to the actual memory media type. So this can be DDRDIMM in this case, again with the meta data here, but the memory controller does not need to wait with sending a completion back to the host after getting the completion from the memory media.

Starting point is 00:33:39 It can already send a completion response back to the host when it can ensure that the next read or that the sent data will be visible when the next read request arrives. So this can also be earlier than the completion from the memory media type. But this is a quite simple flow how a CPU can use the memory protocol for a write. Let's also have a look at read here. In this case, we assume that the host wants to read a certain address and it also updates the meta value that is stored for the address. Again, it sends a request first to the memory controller, which then sends the read request to the memory media type.

Starting point is 00:34:35 The media itself sends the data back to the memory media sends also back the meta value that was used before. And since we need to update the meta value here, the memory controller again needs to send a write with the updated meta value. This is also completed, but my memory controller can already send the data back to the host after the data was sent back as a response from the memory media. And in this case then the memory controller also has a feedback for the host response with the meta value that was stored before. In the second read case, we do not have any meta value update here, so therefore this

Starting point is 00:35:33 workflow or this protocol flow is even easier. We just send the read to the memory controller and the read to the memory media, which then responds with the data and the meta value and both are then returned to the host. Okay, this was the CXL.mem site. Now let's have a look at the CXL.cache protocol. And remember that we have the home agent. This home agent, which home agent is used here depends on the memory address and where the target memory is located on which socket. We have our different peer caches, meaning the caches of other CXL devices or the caches of this case we have a CXL device which wants to cache a certain cache line and the state first is in state invalid so it sends a read request to my home agent and the home agent now is responsible for managing the

Starting point is 00:36:47 coherence so therefore it also needs to send snoop messages to the peer caches so there could be an peer cache that has this cache line that is requested in exclusive state and now since our CXL device will receive this cache line, we need to change the cache controller of this peer cache itself needs to change the state. Also remember that the caches and the cache controllers maintain the finite state machine usually and also the peer cache will then send a response that the the cache line state was changed accordingly in parallel already after the home agent received the request from the cxl device it can contact the corresponding memory controller, sending memory read requests here, which then also will respond with the data. And the home agent then sends two messages back to the CXL device that requested the cache line first.

Starting point is 00:38:00 There's a global observed event, meaning that globally, so this can also be multiple peer caches here, it doesn't need to be only one, but when the cache coherence, or when the cache states for all the peer caches are managed, the home agent will send the event that the cache lines are managed coherently, and also the data will be sent to the CXL device. In this case when the coherence management approval, the global observation arrives, the CXL device

Starting point is 00:38:38 can already switch or maintain its own cache line state and it will also receive the data then. The write flow looks a little bit different, it's a little bit more complicated, but if we compare with the read flow, the first half is very similar. So we have first our read request, the home agent manages the coherence of the other peer caches the peer caches respond accordingly the home agent fetches the data from the memory controller it will send a global event in this case we request the cache line. So before we requested only the... So we wanted only to read the cache lines and we had one peer cache that has the cache line in exclusive state. Since we want to read it, there is a cache line copy of this cache line in multiple

Starting point is 00:39:40 caches. So therefore we need to change this exclusive state to shared and then also when when the cxl device has the data it also has this cache line in shared state since multiple caches have a valid copy of that cache line in this case now we want to read at the cxl device side so therefore we need the cache line in exclusive state. So therefore the home agent needs to invalidate other copies of the cache line and then also similar to before it sends a global observed event back to the CXL device with the data and then at some point there will be a silent write. Silent write here means that there is no immediate write through.

Starting point is 00:40:26 So it can write the data on the device without any communication to the home agent or to other caches. And at some point, the cache line will be evicted. For example, because some other data needs to be fetched into the cache or depending on how the cache and memory hierarchy is implemented for the specific CPU also, from lower level cache to the last level cache but at some point an eviction of this cache line can be triggered and in this point we need to make sure that the data needs to be written back. So therefore the device sends this dirty evict event to the home agent. The home agent sends a write pull message to the CXL device. And with that, the data will be sent to the home agent. The cache line state will be updated at the device side.

Starting point is 00:41:42 And memory write happens triggered by the home agent so the home agent communicates with the correct memory controller again depending on the address and the memory controller at some point will send a completion event okay these were the easier examples so there are different ways of how memory that is located on a device is handled. Remember, as shown here on the right side, we have type 2 and type 3 devices, both have memory attached to the device. And if this memory or parts of the memory that is exposed to the CPU, this is called host-managed device memory or short HDM. And the CPU's caching agent interacts with this HDM via the CXL.MEM protocol, also as visualized in the architecture of the CPU before.

Starting point is 00:42:51 And with that, it is integrated into the host's coherence domain. There are different management options for this type of memory, for this exposed memory. The first one is the easiest one. It's a host memory expansion with host-only coherent memory. This means, or this is valid for a Type 3 device in this case, the memory only needs to be coherent for the host so there's no cache in the device that needs to be maintained so this is the easier case then there are two more interesting ones an accelerator memory exposed to host as device coherent memory. This is applicable for type 2 devices. And another type or another way of how cache coherence is managed is via device coherent memory using back-in validation. I briefly touched back-in validation before, which allows memory sharing across multiple servers and with that also the

Starting point is 00:43:47 way of how memory accesses or cache coherence is managed differs from from the other approaches yeah i briefly mentioned this one is the easiest i will not go into more details about it but let's look at let's have a four minute break and then look let's look into the more interesting ways of how to manage cache clearance I will briefly go back to to that slide to this one note here that we saw that when the CXL device requests a certain cache line it needs to communicate with the home agent

Starting point is 00:44:31 which then resolves coherence issues here and manages the coherence and communicates with the peer caches accordingly. But we definitely see that there's this flow from the device to the home agent. But it could also be the case that we want, or that the CXL device wants to access its own memory since it has memory located in the case of type two device.

Starting point is 00:45:01 And this is something that we want to look into right now. Right, so this is the case device coherent memory type 2 device can access its own memory without violating coherence and also without communicating with the host and for this I mentioned before that the host is responsible for tracking cache states of other peer caches, but there's also, there can be some tracking logic implemented in type 2 devices. In this case, the device contains a device internal caching agent. This again is then responsible for tracking which cache lines are cached at the host

Starting point is 00:45:55 side. So which other does the host have a certain of for each cache line that is stored or for each cache line of the device's memory is the cache line cached in one of the caches of the host and for this it can implement or yeah basically implement a bias table and this bias table then stores one bit per page. This bit indicates if the host may have a copy of the corresponding cache line. And then depending on the state, there can be two states here. There can be device bias, and in this case the host does not have a copy of a certain cache line, or the cache line can be in state host bias. In this case the host has a copy of the related cache line. In this case the read goes through the host as we saw in the example before. But if the cache line is in device bias state then the device can access its memory directly without any communication to the host.

Starting point is 00:47:10 Again, the host tracks which peer caches have copies. So type 2 device does not share further cache state information about other peer caches. And this protocol flow diagram shows how such a BIOS flow happens. So in this case, so before we also looked at the type 3 devices and how memory access happens. Now we focus on type 2 devices. This device has memory, therefore it also needs to have an internal memory controller. controller and CXL device here represents that basically that we or it rather represents

Starting point is 00:47:50 the cache of the device and let's assume that the device wants to cache its own memory but based on its BIOS table it knows that the state for this cache line is in BIOS host state. So therefore it cannot just simply read the memory from its own memory with its memory controller. It needs to communicate with a home agent. Because there could be some peer caches that have this cache line in certain valid states, shared, exclusive, modified. So therefore, it needs to send a read-on request to the home agent. Again, the home agent resolves coherence. And this can already happen while the CXL device is reading its own memory. This only works because there is

Starting point is 00:48:49 something called a read-forward message but I will come to this in a second but have in mind first we need to the CXL device needs to communicate with a home agent then as shown before it needs to resolve cache coherence states by communicating with the other peer caches and can then send a memory read forward message and this allows this memory read at the upper part here, what would be the alternative? So this read forward basically allows the device to read its own memory instead of requesting the memory. So if we would look at the cache protocol and the memory

Starting point is 00:49:39 protocol separately, we now, so we first, in terms of the CXI cache protocol we request a certain memory from the host with the cache protocol but the memory protocol is there to get certain data from memory. So when we have it separately and no BIOS flow optimization here, the home agent now needs to communicate via the memory protocol with the CXL device, gets the data and then the CXL device can retrieve it. But this would be communication back and forth, some kind of communication ping pong here, and this can be avoided by also supporting this read forward response by the home agent, which allows the type 2 device, instead of communicating with the home agent again that

Starting point is 00:51:07 the home agent should fetch the data it can fetch the data by itself and yeah the the internal memory controller then returns the data to the cache to the CXL device and also updates the cache line state from invalid to exclusive and in this moment the device also has enough information to know that all the other cache line states of the same cache line were now invalidated at the peer caches, so therefore now we have the bias device state for that specific cache line. Are there any questions to that flow? Because I think this was at the end a little bit more complicated with the data ping pong.

Starting point is 00:52:01 Yes? What's the purpose of the first memory of the device to its memory? We need to, we somehow need to read it. So the idea here of this bias flow is that we do not need to communicate or to fetch the data from the home agent but the cache agent of the CXL device can get the data directly from its integrated memory controller and therefore we need to somehow request it so otherwise we will not get the data response and if you see this error here maybe this was not completely clear. So the message from the home agent at that point, after managing the cache coherence at this part,

Starting point is 00:52:49 the message from the home agent goes directly to the memory controller. I mean, it basically also the CXL device, instead of sending it back to the home agent. So here's the question about, can we actually avoid this and have sufficient information in the memory forward so that it then goes back to? Yeah. Good point. Yeah, this is basically, I can also only tell you

Starting point is 00:53:40 what the protocol specifies. But I don't know if there is a certain reason why. I assume there is a reasonifies but I don't know if there's a certain reason why I assume there is a reason but I don't know the details right now why this is required here but good catch okay and now let's have a look at the other state or the other scenario in which our cache line that the CXL device wants to access is in a biased device state. So in this case, no other peer cache has a valid copy of that cache line and the device can simply read the data or send a read request to the memory controller and then the memory

Starting point is 00:54:22 controller responds with the data. Very straightforward. controller and then the memory controller responds with the data very straightforward and now the last way of managing cache coherence is with back invalidation remember this is a feature of cxl3 and in this case this is valid for type 2 and type 3 devices, those devices then can send back invalidate snoop messages via the CXL.memory protocol to the host. The CPU, when let's say a device sends this back invalidation message, the host then manages the coherence by invalidating the copies of the device's memories cache lines and this also allows the device to implement an inclusive

Starting point is 00:55:15 snoop filter and a snoop filter is doesn't show and now it does. Inclusive loop filter is basically logic on the device that tracks the host caching on cache line granularity. So the device always knows which of the cache lines of its own memory are cached at the host side. And this also, to enable this communication and the back invalidation, each device that supports this kind of back invalidation needs to have a caching agent. And note that type 3 devices do not support CXL.cache protocol. But now, this spec invalidation, which is some way of

Starting point is 00:56:08 managing cache coherence, is implemented in the memory protocol. So even though memory devices do not implement CXL.cache, this feature is at the memory protocol side, so therefore also memory extension devices, type 3 devices, need to support it if they want to back-embedded cache. And this is, again, if we go back to the example where we have a device with memory attached and multiple servers can access it in a cache-coherent way, this definitely, I think, makes sense that even though the device is not

Starting point is 00:56:42 able to cache any of the service memory, that we need this feature of back-invalidating the caches so that no inconsistent memory access or of write happens to a fraction of shared memory. Okay, with that I want to talk a little bit more about topologies and about the topology scope. So I introduced you to the different revisions 1, 1.1, 2, 3 and 3.1 and the first version of the first major versions are in the single server scope so we do not have bigger topologies with multiple levels of switches not even a single level of switch so it's really directly connected to a server with cxl2 we are expanding the scope to a rack level

Starting point is 00:57:40 with the single level of switch with the cxl3 and.1, we are still in the rack level or one might also consider a multi-rack level in which we have multiple switches. and then at the neighboring racks there might be, for example, only memory devices and switching infrastructure so that the different servers are connected to these memory-only devices. Architecturally, it's theoretically possible to connect 4096 endpoints. There is a quite comprehensive introduction to CXL by the CXL consortium. The CXL consortium is driving this technology and they also mentioned that even though theoretically it's possible they would rather expect the scale of hundreds endpoints rather than 4,000 endpoints. Just to let you know. Okay.

Starting point is 00:58:49 When we look at a single device, there are also differences in the different generations. So I want to talk about multiple logical devices that is introduced with CXL2. With CXL11 we can only have one logical device and one physical device. So each physical device represents one logical device but with CXL2 memory devices, so type 3 devices, can be partitioned. So there could be multiple logical devices that are one physical device can be partitioned into multiple logical devices. You can think about it in a way that our memory is partitioned, for example, in this case into four partitions.

Starting point is 00:59:37 And each individual partition then can be assigned to a different server. And with CXL2, such a device supports up to 16 domains. So you can assume 16 partitions that can theoretically be connected to 16 different servers. Then I want to talk about resource pooling, also a feature that CXL2 introduces. You can assume a topology in which you have switch nodes with CXL2.

Starting point is 01:00:09 It's only a single level of switching and there is a fabric manager that also then manages the assignment or which device is assigned to which host and also when is it released and maybe reassigned to another host. So these topologies are possible on the left side only with single logical devices but you can also combine it with multiple logical devices so that you have both in your topology. But also note that switches add latency. And if excess latency is a high priority for your application and you definitely, for some reason, want to avoid the additional latency of switching

Starting point is 01:01:01 topologies, you could also use direct connections in which you connect the servers to different devices but note now that one device has different multiple connections and this is of course only possible when the physical device also supports multiple connectors, multiple slots, in which you can connect your PCIe CXL cable here. With CXL 3.0, the scope is even larger with something called CXL Fabric. So CXL Fabric extends the topology scale from a node to reg or multi-reg level

Starting point is 01:01:49 as shown before in the visualization with multiple levels of switches. And before 3.0, CXL only supported tree topologies. And now this is not a restriction anymore. So non-tree topologies are supported and if you might have wondered why the limitation of 4096 endpoints that can be supported because cxl 3.0 implemented something that's called port-based routing and it uses 12 bits as an identifier for each individual port. So therefore with a maximum of

Starting point is 01:02:32 4096 ports Right and okay, so, and CXL Fabric also specifies the domain in which we can access different devices. So, single, so CXL domain scopes multiple hosts and ports and devices within a physical address space. So you can have a large topology, but you can have different CXL domains on this large topology. So let's say you have, as in the picture before, depending on which devices

Starting point is 01:03:28 and hosts are connected together, they form a CXL domain. Oh, sorry. And Fabric basically connects one or more host ports to devices within each domain. So therefore, in a topology with multiple switches, with multiple servers, multiple devices, we can have multiple CXL domains in the topology. And in terms of CXL fabric, there are some additional terms that you might find if you want to read more about CXL.

Starting point is 01:04:11 One is fabric-attached memory. This is basically a memory resource connected to multiple hosts. There are different ways of how this memory is now connected. It can be pooled or shared in the scenario of pooled memory or fabric attached memory here the memory is only assigned to a single host and as you can imagine in the shared scenario memory region is assigned to multiple hosts and you can also see this in this visualization here again we are talking about CXL3 in this case pooled fabric attached memory is yeah for example the the dark okay we only have one case here.

Starting point is 01:05:05 Okay, the orange one is only attached to the host 3 in that case, but we can see the bright green one. This is shared fabric attached memory, which is assigned to host 3, 4, and 5. All of them can coherently access it, as you have learned before with the back-end validation technique of cache coherence management. There's another term called logical device fabric attached memory.

Starting point is 01:05:37 And such memory is basically type 3 devices with up to 16 logical devices. So here, we do not use CXL Fabric, so we do not use port-based routing which is essential for CXL Fabric. Then we have global fabric-attached memory and this is fabric the opposite to LDFAM where we have fabric attached memory exposed using port-based routing links. And this as shown before this global fabric attached memory can be shared across multiple hosts or peer devices and this if we have now this kind of memory exposed to multiple servers and devices potentially this global fabric attached memory has is located so our memory resources somewhere located on a certain device. This is then a global fabric-attached memory device, a GFAM device.

Starting point is 01:06:49 And this has a specific physical address space. And now, from our server, if we want to access the memory located on that device, the device is responsible for translating the physical address to the device physical address. And also in the case if we have now multiple servers accessing this global fabric-attached memory, then this device needs to maintain multiple translation tables for the different physical address spaces of the servers and the device address space. Then also shown in multiple of the topology pictures before, we have this fabric manager and this is a logic separate from the switch that performs tasks related to resource binding. So it allows allocating resources,

Starting point is 01:07:48 let's say memory resources to a server, it allows releasing it and reallocating it. And in all the pictures before and also visualized in that one, it is located in the switch, but this is one option. So this does not necessarily need to be implemented physically in the switch but this is one option so this is this does not necessarily need to be implemented physically in the switch it can it is basically a program or some logic that can be located in the switch but there could you could also imagine that there is a separate device but it's important that the fabric manager however it is implemented it can also be implemented for example in the server side the host side this needs to have some connection to the switch for for

Starting point is 01:08:39 managing the resources and for binding and managing so for binding the resources and for managing the ports so for binding the resources and for managing the ports and devices it's yeah I think I briefly mentioned this it allows dynamic resource assignment and reassignment and also it has different ways how to communicate with it so the API is media agnostic. This can be for example Ethernet, this can be BMC or USB. Right, so I think with that I wanted to show you the most, in my opinion, interesting features of CXL 2 and 3 in the last slides. Do you have any question in terms of scal thing, or are there some similarities? It is very similar to, so there are PCIe switches existing

Starting point is 01:09:52 without CXL support. And they are similar to PCIe switches with additional support for the CXL protocols. And also, even though I mentioned the Fabric Manager does not necessarily need to be implemented in the switch, we can assume that it will be. So we also talked to some hardware vendors that also mentioned that this will be expected in the switch. And latency-wise, maybe I can briefly jump to it right now. There is an existing work from last year. In this paper, it's called Bond CXL-Based Memory Pooling Systems for Cloud Platforms. and one of the interesting parts is latency estimations here. And they compare local DRAM here with, in addition to some latency that happens on the CPU side and the fabric side, that happens at the memory controller and DRAM side, but then they also compare it with a switch for example in the

Starting point is 01:11:05 very bottom case here and just to give you some numbers here so if you have a directly attached device it would memory access would take based on this estimation here about 85 nanoseconds and a level of switching would add additional 70 nanoseconds here in this case. I mean you can also see that there is additional latency I will talk about it also later for short. Every CXL port adds additional latency we have we might have retimers. Retimers I showed you in the beginning at this PCIe slide where we can expand the distance between a core and in this case memory. So yeah just to give you a number that you can somehow compare to local memory access

Starting point is 01:11:56 latencies. Okay I very briefly want to talk about programming with CXL device memory. I mentioned before already that device memory is exposed in the uniform virtual address space and this also allows us to already use libraries that are already existing and have been used for virtual memory, for local memory, also for remote socket memory. So we can basically apply what we used in NUMA-aware data management, let's call it like that, for CXL device memory. So what we can see or what we see when we have a type 3 device attached to our server, is that we have an additional NUMA node. And let's say we have a single-socket server without any CXL devices,

Starting point is 01:12:51 we might have one NUMA node with modern CPUs. Even these memory regions could be subdivided, that on a single CPU you still have multiple NUMA regions. So in this case, then basically the memory controllers and the cores are grouped into NUMA regions. But you can think about it as a NUMA node. Type 3 device could be exposed to the operating system as a NUMA node without any CPU resources. So a memory-only NUMA node without any CPU resources. So a memory only NUMA node and then you can use if we assume a Linux operating system here, a system called such as mbind to set a memory

Starting point is 01:13:34 allocation policy for a specific memory region that you've allocated or set mem policy to change the memory allocation for your thread without limiting it only for a given memory region, or also move pages to move memory pages from one Numer memory region to the other, also meaning from local memory to CX or in the other direction. I briefly want to talk about some performance numbers that we can also really find in CXL work that is in the field of data management. There was one paper published last year at Micro.

Starting point is 01:14:22 It's called Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices. It's a long list of authors here and they have a CXL device memory benchmark and analysis. In their work they have three CXL memory devices. They used two ASICs based one and one FPGA-based one. And they basically perform a set of microbenchmarks and end-to-end benchmarks. And at the top right figure here, they also show random memory access latencies for the different devices. They used as benchmark tools first

Starting point is 01:15:02 the Intel Memory Latency Checker. So if you ever want to do some memory access latency measurements, this is a widely used tool provided by Intel, but they also use Memo, which is their benchmark tool that they provide or that they developed and used for their work and also provided as open source contribution. In this figure here, you can see the x-axis the different workloads

Starting point is 01:15:29 let's focus on the memo workloads here first so on the very left this is the memory latency checker and here with their memo tool they have a load a non-temporal load the store instruction and non-temporal load, the store instruction and non-temporal store instruction. Just to get you on the same page, non-temporal basically means that you bypass the cache, so you can read data from memory without propagating it into your cache hierarchy and also the other way around. and the different options that they have here so they have at their y at the y-axis they have the axis latency normalized to to local memory and the other options that they have here are in blue the different cxl devices and in yellow

Starting point is 01:16:22 this repair presents the axis latency for memory that is located on a remote socket. And, okay, now we don't have absolute numbers here, but we can assume, let's assume that the local case is about 100 nanoseconds. It can be a little bit lower, it can also be a little bit higher depending on the system, but just for a rule of thumb here we could calculate with 100 nanoseconds and then we can see that all of these X's are within the range of 100 and 500 nanoseconds and interestingly the non-tempal stores perform quite good compared to the other access operations here and to just to give you a little bit more information about the different devices they have yeah they

Starting point is 01:17:17 call them CXLA, B and C here there's an additional table in the paper that you can see in the right bottom here A A and B are the hard IP, meaning this is ASICs-based, while the last one is soft IP, meaning FPGA-based. And I mentioned before that we do not have memory media type restrictions. But in this case, the authors used devices with DDR memory attached. In two cases, it was DDR4. In one case, DDR5 with a different speed here that you can see at the end of this description. And also the resulting memory bandwidth shown here.

Starting point is 01:18:00 So with these devices, we are in a range between 16 and 38 gigabytes per second. And I already briefly showed it. There's another work called PONT, CXL-based memory pooling systems for cloud platforms. It was also published last year. And they also have different analysis in their paper, for example, workload sensitivity to memory latency, or also effectiveness and latency of different CXL memory pool sizes.

Starting point is 01:18:37 But what I find quite interesting is this figure with a prediction. They also, one of their contributions is a prediction model for latency and resource management at data center scale. They evaluated with simulated memory, so they did not have real CXL hardware, but I think this figure that they also provide is quite insightful, showing that your memory access really depends on your topology so you cannot i definitely say memory access has a latency of 500 nanoseconds because it's

Starting point is 01:19:16 even if you have the same device it depends on the topology how the device is attached to your CPU, how long it takes to actually access it. And as you can see here, from 85 nanoseconds in the local case to more than 270 nanoseconds in the case where we have a large memory pool in which we have additional we have first we have the CXL port latency which adds 25 nanoseconds we have retry retimer which adds 30 nanoseconds here then we have a switch again with two ports each adding 25 nanoseconds then we have we might have some network on chip latency or also some switch arbitration latency so this is basically a process to manage and coordinate access to a shared communication medium that happens in the switch and then again retimer latency so it really depends how does the

Starting point is 01:20:24 topology look like what what's the quality, what are the characteristics of the device. And even, let's assume we have a CXL device directly attached to our CPU and we have our given PCIe throughput or our PCIe bandwidth, then theoretically there could also be another bottleneck on the device. In commercially available products, I would assume, maybe I'm not saying too much here, but I would assume that we do not have a limitation on the device, but that rather the PCIe connection

Starting point is 01:21:05 is the limiting factor. But theoretically, as you also show here in this figure, in these prototypes that the authors used, they have DDR memory DIMMs attached on the CXL device. And the limitations that you can see here are rather based on the maximum bandwidth that you can achieve with these DDR DIMMs attached. Alright, I want to finish up with some available hardware.

Starting point is 01:21:34 I will be quick since we are running out of time. There are already CPUs available that support CXL. For example, the server CPUs from Intel, Sapphire Rapids, or also the later version, Emerald Rapids, they support both. So both of them support CXL 1.1. And there's also Intel FPGAs available, until like seven FPGAs, which also support 1.1, CXL 1.1.

Starting point is 01:22:03 In terms of, yeah, regarding the Sapphire Rapids, you can see a visualization here. And I talked about NUMA nodes and also NUMA sub-regions. This is an example of a CPU that can be exposed as multiple NUMA nodes. So this CPU physically consists of four dice or four CPU chips in light blue which are then connected this is the purple part here with an inter die interconnect and basically each of these quarters here can be exposed as one NUMA node with its

Starting point is 01:22:42 integrated memory controllers yeah the the red part here are the DDR memory interfaces and if you want to use CXL now you basically have to use the orange parts. The orange parts are the PCIe CXL ports. Each of these ports support 16 lanes, so therefore per die it would be 32. And in total you have 128 PCIe CXL lanes that you can use for peripherals that are only PCIe devices or CXL 1.1 devices. AMD also has CXL support, for example the AMD Genoa CPUs also support 1.1 and there's another CPU AMD 400G adaptive smart, sorry, no, not a CPU, a smart network card, which supports CXL 2.0. And also recently ARM announced that with NUVERSE V3,

Starting point is 01:23:53 there will be up to 128 cores and they support CXL 3.0 and high bandwidth memory three. When we want to look at devices, this is just an example of Samsung devices. So they announced recently memory modules that are connected via PCIe 5 with 8-lane connections. they have 256 gigabytes of capacity a bandwidth up to 28 gigabytes per second and 520 nanoseconds of average excess latency they call it Samsung CMM short for CXL memory module D and if you have multiple of them then you have a CMMB this is basically a memory module that can host up to eight of the above ones.

Starting point is 01:24:51 So then this... Okay, yeah, then this would... I was just confused by the numbers. But yeah, then it would end up to maximum of two terabytes I haven't looked into the details about the bandwidth here so Samsung claims 60 gigabytes per second oh yeah I definitely want to look into that because in each individual supports 28 gigabytes per second I wonder what are the limitations here but that's at least what they claim for for the second device with also excess latencies lower than 600 nanoseconds and it should be cxl 1.1 and 2 compliant and one feature of the cxl 2.0 specification is the support for pooling of single logical devices.

Starting point is 01:25:47 In summary, if your friends and family ask you tonight what is CXL, I hope you can tell them that CXL tries to cover different limitations of PCIe DDR interconnects. So the previous one do not provide non-coherent memory accesses. They have quite memory bandwidth limitations. They only support homogeneous memory media types. They are tightly coupled to CPUs or tightly coupled CPUs and memory together. And they do not allow memory sharing, which CXL is trying to address. We talked about the different protocols and device types that this CXL specification defines.

Starting point is 01:26:31 We looked into managing cache coherence, how is CXL integrated into a modern CPU, and how does the CPU and the device communicate via CXL.cache and mem. We briefly talked about the different revisions, about feature sets of these revisions and also how the revisions determine the hardware scope. And I also showed you some example topologies and how topologies can be built and also what the consequences of different topologies are, for example, in terms of memory access latencies.

Starting point is 01:27:11 With that, we are at the end of this session. Thanks for your attention. Are there any questions? All right. Everyone wants to go for lunch. And then the next topic will be networking. There will be a session about it next wednesday. But the next session on next tuesday will be another task

Starting point is 01:27:33 Session in which, as i mentioned in the beginning, Present the solution of task 2 and also introduce you to the Task 3, which is about buffer management. Thank you.

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Compute Express Link

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Compute Express Link

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.