Storage Developer Conference - #146: Understanding Compute Express Link

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 146. My name is Devendra Das Sharma. I lead the IO Technologies and Standards Group at Intel. I'm a co-chair of the CXL Technical Task Force and have been leading CXL since its inception. I will be delving into Compute Express Link, which is also abbreviated as CXL.

Starting point is 00:01:01 So when we look at the industry landscape today, we see some very clear mega trends emerge. Cloud computing has become ubiquitous. Networking and edge computing is using the cloud infrastructure and it has also become ubiquitous. AI and analytics to process the data is driving a lot of innovations across the board. All of these are driving the demand for faster data processing. We see increasing demand for heterogeneous processing as people want to deploy different types of compute for different applications. So whether it is general purpose CPUs, GPGPUs, custom ASIC, or FPGA. Each is important and best suited to solve some class of problems. There is enough volume to drive customized solutions in each of these segments.

Starting point is 00:01:56 And increasingly, we see people deploying a combination of these different types of compute in their platform depending on their needs. In addition to the demand for heterogeneous computing, we also see the need for increased memory capacity and memory bandwidth in our platforms. Significant technical innovations in storage class memories have also resulted in those memories approaching DRAM-like latency and bandwidth characteristics while maintaining non-volatile characteristics and also they have larger capacity. So we have a class of memory between DRAM and SSD that needs to be thought of as a separate memory tier. This tier offers compelling value proposition due to its performance capacity and persistence so for example now we

Starting point is 00:02:46 could store an entire database in this new memory and with that makes search faster and we could do a lot more of ai and analytics type of applications so these mega trends that we are discussing here it takes advantage of these types of memory in addition to the evolution that we are going to see in the traditional DRAM memory and storage plus the heterogeneous computing. Compute Express Link is defined ground up to address these challenges in this evolving landscape by making the heterogeneous computing as well as these different types of memory efficient and it has been designed to sustain the needs of the different compute platforms for many years to come. So the question is why do we need a new class of interconnect? So if you look at the picture on the top, this represents a typical

Starting point is 00:03:45 system today. So you got CPU, you got memory that is attached to the CPU, which is typically DRAM, and those get mapped as coherent memory. Now, for coherent memory, data consistency is guaranteed by hardware. The memory that is attached to a PCI Express device, on the other hand. It is also mapped into the system memory, but it is mapped as uncast memory or memory mapped IO space. So the memory attached to the CPU is different in its semantics than the memory that is attached to an IO device. So when the CPU wants to access its DRAM, it simply caches the data,

Starting point is 00:04:24 accesses the data from its local cache, later on does a write back if it has updated the data. On the other hand, the memory that is attached to the IO device, it cannot be cached. It's an uncast memory. So when a CPU has to access it, it does so using load store semantics, but that load store semantics always has to traverse to the entire hierarchy and access the device for every access. So similarly, when an IO device,

Starting point is 00:04:53 it wants to do a read or a write from the system memory. It does that through what is known as the PCI Express DMA mechanism. So if you are an IO device, typically you are connected into the CPU. The CPU would have a write cache. And whenever it gets a write, DMA write from the IO device, it basically goes ahead and does the fetching of that cache line using the caching semantics. And it does the protocol translation between the producer-consumer ordering model and then the ordering model that exists in the system memory. So it does the merge of the data, and then it can do the writeback.

Starting point is 00:05:33 The IO device is not allowed to cache any of the system memory. It always has to issue explicit reads and writes to the host in order to get access to that memory. On a read, the CPU simply does, the root port simply goes ahead and asks for a snapshot, coherent snapshot of the data and provides it to the device. Now, this is the producer-consumer ordering model. It really works well for a wide range of IO devices,

Starting point is 00:06:00 has worked really well over the last two plus decades. And it's really good when performing bulk transfer. So such as those involved with traditional storage or moving data in and out of a networking interface. It's very efficient, works really well. And we definitely want to preserve that kind of DMA model going forward for those usages. However, things like accelerators for example they want to

Starting point is 00:06:28 do fine grain sharing of data with the processor and then that so in that case the pci express mechanism it needs to be augmented with some extra semantics so that that will allow these devices to be able to cast the data from the CPU. And also, by the same token, if they have memory attached to them, be able to map all or some of that memory into the system's coherency space. And that's basically what CXL allows. So if we fix those effectively in all of these things that we had in the red here, those, of course, will be prevalent. They will continue to exist.

Starting point is 00:07:10 But in addition to that, we enable write back memory on the device side. We enable direct memory load store on behalf of the device where the device can cast the data. And similarly, your PCIe DMA, if you're a pure pci device right if you do pci dma whatever that looks like is also going to look very similar if you are doing dma directly into the memory that is attached to this new cxl and enabled environment so in in summary this allows us for doing efficient resource sharing we can share memory pools across different devices, you know, devices across the system, different types of compute elements can work on problems collaboratively

Starting point is 00:07:52 without having to effectively transfer the entire data and then a flag to tell whoever they're partnering with that they're done with the processing. They can just work through all of these accesses in a very seamless manner. But for this to happen, what we really need is an open standards so that the industry can innovate on this, on these common set of standards.

Starting point is 00:08:17 And together we should be able to then get the best performance, power efficient performance in, in the compute systems that we have so that's basically the proposition of compute express link so this is an overview of the cxl consortium if we look into you know back back in march of 2019 when, when we went public with the CXL consortium to now, we have more than, right now, this is 100 plus members. I believe it is more than 123 members now. And it's growing rapidly. And this entire membership, it reflects the width of the industry and then the depth of the industry. And it's very essential for us to create this vibrant ecosystem. And Compute Express Link is not a one-time thing. It is going to evolve. And there are work groups,

Starting point is 00:09:12 technical work groups, five technical work groups that are developing the next generation of CXL 2.0 specification in a backwards compatible manner. And this journey continues as more and more problems come through. We are going to go ahead and address them. And of you know, this journey continues as more and more problems come through, we are going to go ahead and address them. And of course, we are going to go through the speeds and feeds adjustment in order to meet the demand. So that's, in a nutshell, what is CXL. And, you know, these are the board of directors, you can go to the ComputeX Express Link website to get more details. So now let's get into an overview of what Compute Express Link is all about. So Compute Express Link is, this is looking at the system outside in.

Starting point is 00:10:02 So you can see that this is a data center, right? You access, there are a bunch of networking connections. You have a whole lot of racks in a data center. Within each rack, you've got a lot of chassis. And in each chassis is like, there is a system which is one or more CPUs connected through a symmetric cache coherency protocol. They have their own memory. There will be IO devices.

Starting point is 00:10:24 This is where CXl fits in it is defining a new protocol semantics to work at this level and you know tightly coupled with the cpus between accelerators and memory and as i said this is uh and we'll see that this leverages PCI Express and targeted for AI, machine learning, HPC, comms, and a variety of emerging applications. So what is CXL? CXL is basically based on PCI Express infrastructure. So if you think of this as a processor, you've got IO link, PCI Express link.

Starting point is 00:11:03 This is PCI Express 5 going through a by 16 PCI Express connector and you can plug in either a PCI Express by 16 card or you can plug in a CXL card it will work either way PCI Express is the ubiquitous interconnect technology across the entire computer's continuum it spans everything from your handheld, laptop, desktop, server, comms, you name it, right? So that's the reason that, you know, it's present everywhere. PCI Express 5.0 defined the alternate protocol mechanism. So Compute Express Link sits on top of that.

Starting point is 00:11:41 So this way what happens is you get a flexible port. Let's say in my system, I have five by 16 slots. I could choose to put in all five as by 16 PCIe cards, or I could choose to put in all five by 16 CXL cards, or I could choose any combination thereof depending on the user's needs. So by keeping it in this particular way where it is completely interoperable, you are offering users the choice. Otherwise, you had to give them

Starting point is 00:12:14 dedicated slots and that causes more real estate, more power, more area, and more pins on the CPU, not the best way to make progress, right? So, you know, we truly believe in this plug and play. And PCI Express, as I said, has the alternate protocol mechanism. So fundamentally, what happens is when it comes up, when the link comes up, we run at 8B, 10B encoding with PCI Express 2.5 gig. Very early on, the CPU is going to query to the device saying i support cxl do you support cxl if the device says yes i support cxl we are going to talk cxl if the device says i don't know what it is in in which case it's not going to respond uh we will simply go ahead and move with pc express so you know by the time the time even the link comes up

Starting point is 00:13:05 through just the Gen 1 rate, very early during the training, you are decided whether it is PCI Express or whether it is CXL and it is done dynamically. So you could have a slot, the same slot can have a PCI Express today or in somebody else's system, the same slot can have CXL.

Starting point is 00:13:22 It will work plug and play. And CXL usages will work plug and play. And CXL usages are expected to be the key driver for more data rate in PCI Express and we believe this will be one of the lead

Starting point is 00:13:37 usage models for PCIe 6.0 data rate transition. So we basically reuse everything on the PCI Express side. We reuse the PCI Express link training. We use the circuits, the channels, everything. And we'll see a more detailed picture of this coming up. So this one talks about how do different CXL protocols coexist.

Starting point is 00:14:09 So if you look into this picture here, we have three types of protocols that run with CXL and they all run on top of the PCI Express infrastructure. So CXL consists of these three protocols, CXL.IO, CXL.CAS and CXL.Memory. CXL.IO is the IO part of the stack. This is almost identical to PCI Express. We use it for discovery, configuration, register access, interrupts, virtualization, and most importantly, the bulk DMA with producer-consumer semantics. More or less identical to PCI and it is mandatory in CXL. So that part is there, right? As we said, PCI-5, PCI or CXxl log phi there are some modifications that are needed to do the alternate protocol negotiation that i talked about same thing on the other side now let's look into the other two protocols so you got cxl.cache this is optional for a device

Starting point is 00:15:19 it allows a device to be able to cache the system memory. So memory that is attached to the system, host memory, this is the coherent memory. So CXL.cache will allow the device to access that memory, store it in the local cache, and effectively have the same kind of caching agent functionality that a core might have, CPU core might have. CXL.memory protocol is also optional for a device. And what it does is if the device has memory attached to it, the device can choose to allocate a portion or all of it or none of it to be mapped into the system memory space as coherent memory. So that way, you know, the CPU can access that part of the memory

Starting point is 00:16:07 using the same types of semantics that it would if it were accessing host memory. So, you know, CXL.CAS, CXL.MEM, and then CXL.IO, they all go through their own independent stacks, but they get multiplexed at the file level. And CXL specification defines flits as the basic unit of transfer. It dynamically multiplexes these three protocols. Our size, basic unit of size for CXL transfer is 528 bits. So what this allows us to do is if a device is, let device. So we don't want that snoop to be stuck behind a 512 byte payload because our flit sizes are small.

Starting point is 00:17:11 They are, as I said, 528 bits, which is about 66 bytes. At that flit boundary, we can pause the DMA transfer, send the snoop to the device and then resume back to the DMA bulk transfer that we were doing. So all of these are provided in CXL. The flits are protected by a 16-bit CRC. And on the IO side, of course, we have the same link level CRC on top of that. So now let's talk a little bit about CXL features and the benefits that it offers.

Starting point is 00:17:53 CXL has been designed ground up for low latency. And we'll talk through this in terms of the stack, right? So each of these three critical usages, they have their own latency critical elements into it. So, you know, if it is the IO part of the stack, it's identical to PCI Express. So, you know, we made some enhancements into the link layer and the transaction layer so that it can operate in a CXL.IO environment. These are fairly small changes.

Starting point is 00:18:25 Things like, you know, you need to be able to pause things. You need to be able to break things on a flit boundary. All of those, right? CXL.Cache and CXL.MEM, that's optimized for latency. And that fundamental flit size is the flit size in the CXL protocol. And these transactions, as you can see, we multiplex just before we hit the logical five. So, you know, we talked about how we can interrupt a 512 byte or a large payload size for performance critical one and that way that helps us with the latency now let's look into an alternate approach that we could have taken which we didn't right

Starting point is 00:19:13 we could have done the the multiplexing between these different protocols at the transaction layer level go through the pci link layer, there are two issues with that from a latency point of view. First of all, for the 512 byte example that I gave, if I have a snoop, I have to really wait for that particular transaction to end before I can send my transaction. Whereas here, I don't have to do that.

Starting point is 00:19:38 I just have to wait till the flit boundary and the flit boundary is a much more smaller, 64 byte or 66te granularity. Two bytes is for the CRC. The other thing is that PCIe, since it deals with variable packet sizes, the link layer, I mean, you can have up to four TLPs in a given clock cycle, assuming it's one gigahertz with the gen 5 data rate by 16 and you can have a TLP that can go across multiple cycles so the CRC logic for example has to be able to deal with all of those

Starting point is 00:20:14 consequently it has a lot of pipeline stages and it's built in throughout the whole thing it works really well for PCIe and it is really well suited for that. It's really optimized for that. But coherent traffic works on a cash line basis. It's more or less small packets. It's not very efficient from that point of view from a PCI sizes are fixed, CRC sizes are fixed. So that causes a lot of efficiency in terms of how quickly we can get to the latency numbers that we want. So the question is, you know, how much of latency are we targeting? What is good enough for us, right?

Starting point is 00:20:58 And the answer is very simple. We want for CXLCAS and CXLMEM, those latencies to be in the same ballpark as what we would if we were doing a symmetric cas coherency protocol so we gave some guidance in the cxl specification for example if you are a device and you get a snoop request on the pin, then we expect you to return the, on a snoop miss, we expect you to return with the response for that snoop. Basically, I mean, you know, I don't have the line invalid is the response, right? You'll come back with that within 50 nanoseconds, five, zero, right? Pin-pin. So that's fairly aggressive. And we can do that because

Starting point is 00:21:48 we have this structure here. Similarly, if I'm giving a memory read request here, and then the device has got either HBM memory or DDR memory, it is supposed to respond with the data starting at 80 nanoseconds. So pin-to-pin latency is 80 nanoseconds is what is given in the spec. Now, you can say that, hey, what about storage class memories? Because those have higher latencies, and that's fine. That's not an issue. We have a reporting mechanism, HMAT kind of table, where you will say what kind of memory you have

Starting point is 00:22:22 so that the system software is aware what kind of what kind of a device you have what kind of latency you have and map the memory to your space accordingly cxl it's an asymmetric protocol so the protocol flows and the message classes are different between the host processor side and the device side this has been a conscious decision to keep the protocol simple and the implementation easy and i'll go through that reasoning we have experience with enabling the industry with symmetric cash coherency protocols invariably you will find a lot of excitement initially and then the vast majority of them okay 90 plus percent in my experience basically just do not make it

Starting point is 00:23:18 to the finish line because of the complexity because of the huge design effort, validation effort. And most importantly, what happens is symmetric cache coherency protocols changes over time. So doing something that is backwards compatible becomes a challenge. Now let's look into each component. Why is that the case? Host processor, it has a mechanism to orchestrate cache coherency between its multiple caching agents right it has got course it has got home it has got um you know i've not shown here but pci express uh root ports all of that it might have peer cpus that it is connected through

Starting point is 00:23:59 in uh using a symmetric cache coherency protocol link. So it has to deal with orchestrating. And I want to emphasize on the word, orchestrating cache coherency between multiple caching agents. So typically this involves, you'll take the request, you'll resolve any conflict, you're going to start tracking cache lines.

Starting point is 00:24:23 So this is what is known as home agent functionality functionality and that's what is represented here in this diagram that's the complicated piece that's the one that is very tied to the individual micro architecture that's the one that changes from generation to generation consumption side which is the caching agent that's relatively straightforward and i've not seen very many generations of cpu i've worked in two companies for more than two decades i have not seen very many i actually have not seen any cpu work with its predecessor cpu using its cache coherency link and there are very good technical reasons why it doesn't on the other hand from a device perspective it needs to if it needs to cash right it has to cash something because it has a need it has a performance benefit that it

Starting point is 00:25:12 can gain because it has to cash something but it really doesn't is not in the business of orchestrating cash coherency between different cores or between other caching agents or other accelerators that might be there so there is no need for the device side to get bogged down in terms of orchestrating cache coherency by having a home agent functionality which is complex and as i said changes across generations so what the cxl specification does is it abstracts away it says that home agent functionality use cpu or you host a processor you you are dealing with it in any way so use whatever you do i'm just going to provide with you with a simple set of abstracted cache coherency commands and those are similar to what you would find in a messy protocol.

Starting point is 00:26:05 So things like, you know, I want to read this cache line. You know, I want to read it shared. I want to read it private, all of those, right? If I updated a cache line, I want to do a write back. And, you know, occasionally I might get a snoop, in which case I give a response as a snoop. That's pretty much what we need. We don't need any of these,

Starting point is 00:26:26 you know, how do I resolve conflicts across multiple caching agents and all of those things, right? So that basically keeps it simple. Now on the same token, if I have memory, I really am not in the business or I should not be in the business

Starting point is 00:26:41 of orchestrating cache coherency. That's on behalf of the CPU. All I do is there is a read to that memory location. I provide the data. You want to write something to the memory location. I write the data, right? It's just I'm trying to do the best I can to manage the memory and do the lowest latency possible that I will have.

Starting point is 00:27:00 So contrast that with a symmetric cache coherency protocol where it is all, you know, every side is a peer, they have got caching agents. So this is what basically, as I said, it's a very deliberate decision. It's to keep things simple,

Starting point is 00:27:16 keep things simple for the developers, keep things simple for the accelerators, providers, for the memory expansion providers. And in any case, the CPU needs to orchestrate cache coherency as we talked about. So, you know, keep things where they really belong, right? Now, I did mention that, you know,

Starting point is 00:27:36 there is the CXL.memory allows for a device side, right? So host memory is mapped mostly into the coherent memory space. But in the case of a device, it can be mapped either into its or its local usage and or part of it can be mapped into the system.

Starting point is 00:27:56 So the two sides, so the two different views of how things can be. And as I said, you can have like a part of the memory that is in one type and the other part would be on the other type. It's really up to the device how much it needs to map into the system memory space.

Starting point is 00:28:15 Now, if the device owns the memory, it is not mapped into the coherent memory space in the system, then that is what is known as the device bias so in the device bias what happens is the memory is still mapped into the system memory space but it is mapped as memory mapped io or uncast so anytime the cpu wants to access that memory it does that through using an uncast flow just like it does today in pci express the picture on the right even though the memory physically resides with the device an uncast flow just like it does today in PCI Express.

Starting point is 00:28:46 The picture on the right, even though the memory physically resides with the device, it's really mapped into the system memory space. So the home agent in the host processor is in charge of that memory. So even if the device wants to access its memory, it has to go through the home agent and then it can come back and take the data. And that is because you want to make sure that, you know, when you are accessing a location, nobody else has it. Right. You have the right coherency semantics built into it.

Starting point is 00:29:16 And you can flip the biases for a given location between these two and even if you didn't get it right by construction CXL protocol is such that it will be it's going to work properly in the sense that data consistency will be still guaranteed. Now let's talk a little bit about some of the use cases of CXL. So what you see in this picture are three types of devices the leftmost one is a type one device type one devices are used for they use the cxl.io and cxl.cast semantics and we have provided some sample some example usages like a smart nick that can benefit from casting now if the smart nick implements a partition global address space, PGAS, it needs to ensure that the ordering model is preserved.

Starting point is 00:30:10 Now, note that PCIe has got producer-consumer ordering model. So that mandates that rights be able to bypass prior reads to avoid deadlocks in the PCIe hierarchy. So this is known as posted transactions bypassing non-posted transactions that can cause a problem with the PGAS model when two strongly ordered transactions cannot complete out of order. So the way to work around this ordering issue is to serialize access

Starting point is 00:30:40 which can result in performance implications. Now, if you have cxl.cash the nick can simply cache these locations prefetch them because since you know the beauty of cache coherence is that you can prefetch the locations and you know when you retire the transactions or when you do the right as long as no it has not been snooped out you're guaranteed that that data is still there with you. And then in that case, it just needs to complete the transactions in order in its local cache. Now, another type of usage for type 1 device is around the atomics.

Starting point is 00:31:21 Now, increasingly, applications are using advanced atomic semantics involving floating point operations. Any standard IO protocol, such as PCI Express, can be enhanced to support this natively. In fact, PCI Express has got some amount of atomic semantics. Now, to the usage of these atomic semantics, it evolves so fast, right? By the time you get to specify and get to an implementations on the CPU side, on the device side, that can take years,

Starting point is 00:31:45 which slows down innovations. On the other hand, if you have CXL.CAS, you can simply get ownership of that line, perform any complex atomic semantics that you want by keeping ownership of that data. And then there you are done. You don't really need to make sure

Starting point is 00:32:03 that somebody else has to implement atomics the way you envision them to the middle type of device is what is known as a type 2 device so this typical usages are your gpgpu and fpgs for dense computing now these devices may have some amount of local memory attached to them and that's used for their computation but then they can also be mapped into the system space and we expect these type 2 devices to implement all three protocols now the caching and memory semantics would be used to populate and pass operands and results back and forth between the computing entities with very low latency and high efficiency. So this is where you would use the bias flipping, you'd crunch on some data, and then you would

Starting point is 00:32:51 just tell the CPU to come pick it up, right, without having to send an entire set of data and flag and all of that. It's just local. Same way the CPU crunches something, you basically can get unfettered access without having to go through explicit synchronization that, hey, here is the data, here is the flag kind of a thing. Third type on the right is a type 3 device. And the usage would be memory bandwidth expansion, memory capacity expansion, and storage class memory. Now, these only need to implement the CXL.IO and CXL.memory semantics.

Starting point is 00:33:29 This memory will be mapped to system memory as coherent memory. But then the host processor orchestrates the cache coherency as we mentioned earlier. The devices doesn't have to know anything about coherency flows. All it does is it simply has the memory semantics which is a set of reads and writes it doesn't even need to implement cxl.cas so what we are doing with cxl is that we are enabling a lot of capabilities with

Starting point is 00:33:59 some very simple extensions on top of PCI Express. And the purpose of this is to make sure that the ecosystem can really innovate and can build some really good either accelerators or memory expansion devices or whatever. These are very powerful constructs, very simple constructs. And the barrier for entry is fairly

Starting point is 00:34:26 minimal because you know all the main things like pci express phi and all of those those are all taken care of right you can get that you have that anyway and on top of that these are a handful of um a handful set of semantics that you need to implement and get the benefit and you don't have to orchestrate cache coherency and all of that so for example i can imagine that somebody will build either a in memory or near memory processing by doing type 2 device you got you got a bunch of memory attached to the device on the other side you want to do some kind of search or whatever it is that you want to do on that memory and then you can work very collaboratively with the processor that it you know you've got very high bandwidth access to that type of memory

Starting point is 00:35:15 and all of that doesn't need to really cross the wire and you can provide with a very valuable solution to a particular problem that you're trying to solve, right? I mean, you can do, you can speed up database accesses, you can do a bunch of things with, because you've got caching semantics built into, built into the protocol along with the memory semantics. So it's a very powerful construct that we are enabling the industry with. So now let's look into, you know, what we started off by saying, right? I mean, why a new interconnect and see how well we did, right?

Starting point is 00:35:56 You know, effectively, I hope I convinced you that, you know, the memory that is attached to the device is write back memory because you have map use CXL dot mem semantics to map that. Because of CXL dot CAS, you could do memory load and store just like the way a CPU core would do. And if the PCIe DMA, if just like it is happening to the memory attached to the CPU, if you

Starting point is 00:36:23 were to do peer to peer access, it looks very similar. So we have, in addition to the existing load store semantics, we have been able to bring these into picture. And this will result in, you know, if I'm operating between different sites, I could do efficient, you know, population of operands and results. And you could borrow memory resources when you need to work on something,

Starting point is 00:36:48 and you could do a bunch of things like user kernel level data access and data movement. And all of these are very low latency, as we talked about. Extremely low latency. Latency is similar to CAS-coherent symmetric multiprocessing system, which is, by the way,

Starting point is 00:37:03 much smaller than a pci load store type of latency pci load store latency is itself small but these are much much smaller in terms of the latency access characteristics and we talked about some of the numbers so in summary cxl has the right features and the right architecture, the right level of abstraction, most importantly, very low barrier to entry to adopt things to enable a broad and open ecosystem, right? Both are important, broad open ecosystem and an open ecosystem.

Starting point is 00:37:40 And that will enable us to do heterogeneous computing, allow for a bunch of different memories to be put in the system. And also we provide with the right level of abstraction so that you can classify different memories differently and optimize your performance accordingly and server disaggregation right so coherent interfaces we leverage pcie and again we only innovate where it makes sense we want to really piggyback we wanted to piggyback on pc express because allowed us for the plug and play and all of those things that i talked about we don't have to do any of the heavy lifting in terms of the phi in terms of the channels in terms of even discovery for io all of those right, are completely leveraged from PCI Express. On top of that, we built low latency approach for.cas and.mem targeted at near CPU cas coherent latency.

Starting point is 00:38:38 We expect it to be in the same ballpark, right? So if you're looking into, you know know pick your favorite vendor i can talk about our intel cpu so for example on our upi link whatever latency you see we expect cxl cast mem to have similar latency characteristics we are talking about asymmetric complexity which eases the burden of cast coherent interface designs because as an accelerator or as a memory expansion developer, you need to really worry about a handful of things. You really are not in the business

Starting point is 00:39:14 of orchestrating cache coherency across a wide plethora of designs. That is the asymmetric nature of CXL that is extremely valuable is the asymmetric nature of cxl that is that is extremely valuable keeps latency low makes sure that backwards compatibility is going to work really well last but not the least very important it's an open industry standard with growing broad industry support right if you look at it all cpu vendors are in cxl committed to it all the memory vendors are in cxl all the fpga vendors are in cxl all the gpg vendors are

Starting point is 00:39:53 in cxl you know cloud service providers are there in cxl oems com service provider you saw the list right it's a very impressive uh list of um you know companies that are fully committed to cxl we are not uh stopping on the laurels with cxl 1.1 that we published in march uh we um q2 sorry q2 of 2019 uh march we published 1.0 uh 1.1 was with compliance added to it. And we are doing the next generation CXL 2.0, which we are, you know, looks like we are on the final stretches. I'm keeping my fingers crossed.

Starting point is 00:40:36 So this journey will continue. And looking at the types of companies that are there, the investments that we are getting and the interest that is there, the investments that we are getting and the interest that is there. It's an ongoing journey. You know, my take on it is if you are not a member, please consider becoming a member. And thank you all for attending. I'm really glad to have you all as co-travelers on this journey. Stay safe, my friends. Thank you. Thanks for listening. Thank you. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.

Starting point is 00:41:26 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #146: Understanding Compute Express Link

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #146: Understanding Compute Express Link

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.