Storage Developer Conference - #157: Compute Express Link 2.0: A High-Performance Interconnect for Memory Pooling

Episode Date: December 3, 2021

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 157. Hi, welcome. My name is Andy Rudolph. I'm the persistent memory software architect at Intel, but here I'm really doing this presentation as a member of Compute Express Link, or CXL, and somewhat as a member of CXL 2.0 spec around things like memory pooling and persistent memory. And you'll see how these things are all related as I go through it. First, I'm going to just give you a brief background on CXL and just make sure we're all sort of starting at the same
Starting point is 00:01:22 spot here. So what is CXL? It's obviously this open industry standard for a cash coherent interconnect. It's put together by a consortium of more than 160 member companies. And you can see we put the logos of the board of directors here on the slide for you. So you get an idea of who's at the helm here. But let's go into some of the details. Now, I pulled this slide right off the CXL website, the computeexpresslink.org. You can see the URL down on the top of that. And this is really the high point slide. Again, this is an open industry standard for high bandwidth, low latency interconnect. But what I really want to do is draw your attention down to this bullet about PCIe, because this will come up a few times in this talk.
Starting point is 00:02:17 CXL is based off of the PCIe electricals. A CXL slot is a PCIe slot that's able to run these additional protocols. And it has these three protocols, CXL.io, which is essentially the PCIe protocol itself, CXL.cache for caching and CXL.memory for memory semantics. And we're going to spend a lot of our time emphasizing this CXL memory aspect. The CXL 1.1 spec was published in June of 2019. Though what we're here to talk about is the 2.0 spec, which was published last November.
Starting point is 00:02:50 And what we added to 2.0, well, it's a lot of stuff. The spec got huge in comparison to the 1.1 spec. It is backward compatible with 1.1. And we added a whole bunch of features. Among them are things like some support for memory pooling and support for persistent memory. And you're going to see as I start to go through this, these things are all related. And there's a lot of sort of generic memory device support that got added that got added to handle a lot of these different things. Again, as some background from the CXL website, there are these representative use cases known as a type 1 device, a type 2 device, or a type 3 device. And you can see on
Starting point is 00:03:34 the type 1 and type 2 device, they're accelerators, right, of different types. And we're not really going to talk about those much today. We're going to concentrate on a type three device, which is about adding more memory to the system. And you can see it draws a little picture here of a, of a processor, having a CXL bus on it. And then one of these memory buffer devices with what is, is looked at as system main memory available to general purpose programs running on the processor. So we're going to spend a lot of time kind of digging into that.
Starting point is 00:04:07 Now in CXL 2.0, like I said, there's support for a lot of stuff has been added. Switches is one of them. And as part of switching, you get this pooling configuration. And so I pulled this slide out of a pooling webinar that the CXL Consortium did a while back. And you can see there's a couple of good examples here. One is for, it's called the single logical device model here, where you have these memory devices shown down on the bottom of the picture here. And through a switch switch they get attached to different hosts and the color coding
Starting point is 00:04:46 kind of shows you how this works so these hosts see these devices essentially these type 3 CXL devices as if they're attached directly to them and the switch makes this kind of flexibility of of changing what host you're connected to possible and that's what gives you pooling. On the right is maybe a more flexible, more powerful version of pooling, where you have these things called multiple logical devices, MLDs, where you can take some of their capacity and assign them to hosts. So now you see you have kind of a finer granularity of what capacity can be given to a host. And this is all made possible by this thing called the CXL Fabric Manager, which is another part of the CXL 2.0 spec. So the Fabric Manager is the one
Starting point is 00:05:34 that kind of controls what goes to where here and how this gets configured and set up. Now, building is one thing that got added. And as I said, persistent memory is the other. And you're going to see in a minute why I'm showing these both together and how closely related these things actually are. Persistent memory on CXL is really a great match of technologies. You can see in this latency ladder that we draw here all the time. We talk about this gap between the latency of DRAM and the latency of storage.
Starting point is 00:06:09 And creating something like persistent memory on CXL fills that gap perfectly. The numbers just work out, right? It fills the need for something that has the latency that's going to have to be as fast as DRAM, but needs to be faster than storage. Also, unlike persistent memory today, latency that's going to have to be as fast as DRAM, but needs to be faster than storage. Also, unlike persistent memory today, this moves persistent memory out off the memory controller to the CXL bus.
Starting point is 00:06:34 And that means that your persistent memory, when you plug it into the system, won't impact your DDR slots like they do on current systems. So that's pretty exciting. And as you'll see when I get into the details, we've enabled a standardized management of memory devices in general, including persistent memory devices. And of course, these devices come in a wide variety of industry standard form factors. So with those kind of seed technologies, pooling and persistent memory, let's talk about what we changed in the 2.0 spec to add support for this. To do that, I need to
Starting point is 00:07:13 just give you a little background, a little reminder of how persistent memory works today, because that drove a lot of the decisions that we made. So today, I drew a little picture. Now, you know, I work at Intel, so I do kind of an Intel picture of persistent memory. It's true for also the non, most of what I talk about here is going to be true for the non-Intel persistent memory. Here I do these little Optane persistent memory modules plugged into a system today in the DDR slots. And the Intel product has a couple of modes that you can use. On the left is this volatile mode where the memory, even though it's persistent, you ignore that fact and you use it just to make a big memory system. And the persistent memory is the capacity and your DDR
Starting point is 00:07:57 DRAM is a catch to make it perform better. Or you use it on the right-hand path, which on our product, we call it the AppDirect path. But this is the persistent memory programming model on the right where applications get direct access. And I drew a little picture of it over here so you can kind of see what I'm talking about. An application is running and using the SNEA programming model is able to get access directly to the persistent memory. And this allows applications to do very cool things with loads and stores, you know, memory access. You can load a value from your persistence and get that persistent value in
Starting point is 00:08:32 about 350 nanoseconds. That's really, you know, not possible with storage today. So it's, it's actually a very, very powerful model. And it's all based off of this SNEA programming model that we did a few years back. So I grabbed one of those programming model slides that I've shown for years from a previous deck and just kind of showing you again what it looks like here. And I won't go through the details because that part is, you know, pretty well explained and documented by now. But the whole point is to give these applications direct access to their persistent memory. When we show this picture, we usually stop short of describing how does the OS
Starting point is 00:09:11 know that there's persistent memory in the system? We just kind of assume that it's there, but today that's going to be a much bigger focus of our talk. So the way that the operating system knows that persistent memory is in the system is through a table that we added to ACPI. We added it to ACPI 6.0 called the NVDIM firmware interface table or NFIT. The existence of this table causes the OS to load up its driver stack, whatever these generic NVDIM drivers are. And so, you know, and I keep calling it the OS because this is true for all the OSs that support persistent memory. You know, Windows, Linux, and other OSs like VMware, CSXI all do this. They look for the NFIT and they load the appropriate drivers. And when those generic drivers need to access persistent memory, of course, they could just use loads of stores.
Starting point is 00:10:05 But when they need to do something that's kind of more of a management flavor, like check its health or configure it or unlock it, things like that, they do that by calling back into these device-specific methods that are provided by the platform. They're usually written with this kind of funny underscore DSM syntax. And so, you know, on an Intel platform, it's the BIOS that builds the NFIT and supplies these device-specific methods. And the operating system has a nice generic driver here, doesn't really know about the hardware details of the persistent memory. And that actually worked really well for us.
Starting point is 00:10:47 And so I'm going to tell you some of the positives that we learned from doing this, but also some of the negatives. And that's going to drive a lot of the decisions of how we made it work for CXL. So some of the positives. Well, first of all, the ENFIT table worked out great. It described not only Intel's PMAM, but also the NVDIMM-N products that are in the market. Because that allowed us to create generic kernel drivers, we were able to get these things upstream very early into the various kernels. The DSMs are what allowed it to be generic, kind of abstracted away all the hardware details. And I have to say, this actually has evolved fairly gracefully. We've made a few
Starting point is 00:11:31 small additions to this, but nothing earth shattering. It really is the same framework that we originally put into ACPI 6.0. It's been working for a number of years now successfully. But on the con side, well, as we look forward to buses like CXL, they're much more expandable. They've got switches and a lot more slots and hot plug and things like that. And PCPI doesn't really have great support for that level of dynamic scaling. And it's really meant for kind of a small number of empty sockets, if any. And the DSMs, if we had a bug in a DSM, it's kind of a logistical challenge to get the bug fixed and rolled out through a new BIOS. Since the DSMs have all this vendor-specific information in them, it's nearly impossible to make a BIOS that supports
Starting point is 00:12:25 multiple persistent memory products from different vendors because it's the vendor who gives you the BIOS. And no vendor has a BIOS that has all of the other vendors' changes in it either. So that's been kind of a pain point for us. So taking all that, the pros and the cons, and taking what we learned, I'm going to talk about what we did to add persistent memory support to CXL. And you're going to see that this isn't really just about persistent memory. It's about memory in general, and it applies to follicle memory and persistent memory and memory pooling as well. So when we started out making these changes, we realized
Starting point is 00:13:02 we shouldn't just go add a bunch of persistent memory specific stuff to CXL. Anything that actually makes sense for other types of memory, we should make sure that it's specified that way. It's generic and only do specific changes when absolutely necessary. One very early example of this is how do you find the devices? Remember that ENFIT table that I told you about? It really doesn't apply to CXL, the N's, N's, or N, V, D, N's. It's really meant for those DDR-attached devices. On the other hand, CXL includes the PCIe protocol,
Starting point is 00:13:44 and PCIe has a very mature framework in pretty much every OS these days for locating devices, figuring them, and including things like hotplug, very, very well-established frameworks. So we shouldn't throw all that away and try and force some old ACPI table into that space. Instead, we want to leverage all that. And that's exactly what we did. The OS finds these CXL 2.0 memory devices by looking through PCIe, by doing PCIe enumeration. When it finds these devices it maps a bunch of additional registers into an MMIO space, memory mapped IO space, and that gives it a mailbox interface that you're going to see a lot more about in a moment. And that gives us a command interface where we can submit commands to the device.
Starting point is 00:14:27 That's part of the spec. It's part of the CXL 2.0 spec. So it's no longer a vendor private bunch of code for interfacing with a device. And no longer do you have this problem of, is this a driver for this vendor's device or that vendor's device? And you'll see some examples of this in a minute. As we threw all this stuff into the spec, we started building up this tomb of information of how this is meant to work and what are the flows and so on. So one of our contributors, Chet Douglas, wrote this software guide for driver
Starting point is 00:15:06 riders. It's got a lot more expansive information there. The URL for it is at the bottom of this slide and the slides are published, are going to be published on the SDC website. Sorry, that URL is big, long and hairy. So I typed it into a tiny URL, and you can easily pause the video and type that one in much easier than this big, ugly thing. But these URLs should get you to the same document. So let's talk a little bit about those mailbox commands. So again, as we started looking at these mailbox commands, for some of them, we started thinking, this would be useful to any cxl device not just a memory device so the first thing we did was we we said uh let's come up with a list of cxl device opcodes that could apply to any device and remember for for pm this stuff was all vendor
Starting point is 00:16:00 private now it's all standardized all this stuff i pulled this table right out of the spec And I know it might be a little small and too much detail to read while you're watching this video, but you can easily go to the spec in public and click on these and read the details of all of them. Standards are a bit of a double-edged sword. It's great now. We've got these generic drivers, just like MVM Express. There's a generic NBM Express driver, and you can get NBM Express drives from all these different vendors, but that driver works. Same thing is now true, or going to be true for CXL Type 3 memory devices. On the other hand, as a device manufacturer, every time we decide we need to add a new command, we have to now go back to the committee because it needs to be in the spec for the generic driver to have it. And that might sound like a bad thing. I mean, it is overhead,
Starting point is 00:16:49 but I have to say the committee visit is usually a positive thing. It usually gets a lot more scrutiny. Committee helps make a lot of these very specific proposals into more generically useful proposals. I think everybody actually wins. In particular, the ecosystem as a whole wins because they get these generic drivers and a lot less vendor lock in on these products. We really learned a lot from doing the NVDIMS and we're trying to leverage what worked and fix some of the pain points. And we started with some of these commands that we had done for NVBDIMS, but we now realize are generic.
Starting point is 00:17:27 These are things like looking up event records or logs or interrupts, you know, with timestamps. So this is all very generic stuff that's now in one table of the spec. And then we took the commands that were more specific to memory devices and we put them in a table just for that. So most of the commands that we added are here because they do apply to memory devices. But keep in mind, again, we were trying to make sure
Starting point is 00:17:55 that we didn't just put system memory support in. We put in stuff that was useful to all sorts of memory devices. So in this table of commands, again, which are probably too small for you to read right now, but you get the general idea. Many of these are marked as mandatory when the spec requires them. And then many of them are marked as PM, persistent mandatory, mandatory for PMEM. So these commands apply to persistent memory, but the other commands apply to all types of memory. So even just a CXL device that's got basic DRAM on it,
Starting point is 00:18:28 and you put it in as a type 3 device, it will support these mandatory device commands, and that means the same driver, the same management tools can be used for all those devices, which is really great. So DSMs are gone, right? We don't have all that complexity anymore. We ripped all that code out of the BIOS. So we will be once we move to CXL. And that's really great because now the OSs can just use these mailbox commands directly in the BIOS too. And they can have generic implementations of their CXL support. But another way to look at it maybe is that we've removed complexity from the BIOS, but we moved it to the OS. And that's certainly true.
Starting point is 00:19:07 It does allow us to manage it a little better and allows these implementations to be generic. So let me just give you an idea of what one of these commands might look like. Just pick the first one, the identify memory device command. Here's what the payload looks like. That software gets back. You can just see the spec defines it very concisely. What's in the payload looks like that software gets back. You can just see the spec defines it very concisely what's in the payload. I didn't even put the whole thing here, but it gives you an idea. It shows you the total capacity of the device,
Starting point is 00:19:34 how much can only be used as volatile capacity, how much can only be used as persistent. And if you can partition between the two, how is that done? And so you can see this isn't a PMEM specific command. It's not a DRAM specific command. It's a CXL memory device command. And by unifying things around that CXL memory device, this command can be used to show you the capacity of a volatile device or a persistent device or a device that's been given to you from a pool, for example. So you can see how all of this sort of comes together on all these new features that we put in CXL. We talk a little bit about Chet's software guide. It really is a must read if you
Starting point is 00:20:14 want to understand thoroughly how these things work, especially if you're going to write a driver. This is way too small for you to read, but I just put it on the slide because I just love these tables that show the flows that Chet draws. They're really very powerful. They show you different phases of the boot, for example, and what gets programmed where and whose responsibility is it to use different registers and things. It really draws together the terse spec language description that's in the CXL spec. It draws it together. It shows you how all these things work together.
Starting point is 00:20:47 The full document is available at the URL I had a couple of slides ago. I really recommend reading that for a full understanding of how CXL memory devices work in the system. Let's talk about some of the more interesting details. One of my favorite topics is interleaving. So, you know, interleaving is something, of course, you do between memory devices like DIMMs. And now that we have memory devices on CXL, you might think, well, we probably want to interleave there. Absolutely, especially for performance. I think interleaving is critical.
Starting point is 00:21:19 That's achieved through these registers that are part of the CXL spec called HDM decoders. HDM stands for host managed device memory. And these decoders allow interleaving across devices. Now, that's something that I haven't seen on PCIe before. Right. Until now, there's been no idea of a bunch of different PCIe devices having, you know, presenting a resource that's interleaved across them. So this is really a new concept, and it's an important concept for memory. It's especially important for persistent memory. You know, for volatile memory, if you change how you interleave between it every time you, say, boot the system, you might get different performance characteristics, but otherwise it's still going to work correctly.
Starting point is 00:22:05 But for persistent memory, if you change the way you interleave, you lose your data. So we had to look into, you know, how do we figure the interleave for persistent memory in a way that itself is persistent? So you get the same interleave every time. And we did that by adding this idea called the label storage area. It's defined in the 2.0 spec, and it provides a definition of the interleave set, which in the spec we call regions, and of namespaces, which are little subdivisions of those regions into devices. Just to kind of give you an example, here's a picture from the spec to kind of show you how it works. Here, if you had these, you know, eight devices, which I show as the circles, and they're persistent memory devices, you can see we've got them interleaved so that the first block you access is on device zero, the next one is on device one, and so on. If you get past block seven, then block eight comes back to device zero one and so on. You get past block seven,
Starting point is 00:23:07 then block eight comes back to device zero and so on. So that interleaving needs to happen the same way every time the system boots. And so each device has this little label in the label storage area that talks about, you know, how many devices are in the interleave set that it's a member of and, you know, what's its position in the interleave set that it's a member of? And, you know, what's its position in the interleave set? So with this information, the OS or the BIOS, either one, whichever one sets up the interleave set, is able to reconstruct it the same way every single time.
Starting point is 00:23:43 So this gives you an example of the kind of stuff that's in the label storage area. And more than that, it also helps them determine if something's missing or if there's something misconfigured. So that's kind of one example of things that got hard that we had to solve in CXL. Another was hot plug. So CXL is a hot plug bus. You know, PCIe electricals, they're hot plugs. So CXL is hot plug. And we wanted to make sure that an OS could handle mapping these new memory devices when they get plugged in without some really complex flow of calling back into the BIOS and then BIOS giving the OS some information, you know, back and forth. We wanted to make it as simple as possible.
Starting point is 00:24:20 And so the way we did it is we created these windows, which are called the CFMWS windows. Here's a little, you can see that right there, the acronym. And these windows are created by the preboot environment, by the BIOS. And they give the OS these pre-programmed interleaving areas in main memory. And then when the OS sees a new device, it can just map the memory into somewhere in these windows and it never has to talk to the BIOS. It can handle a hotbed completely on its own. This actually turned out to simplify things a lot in our design. And it worked so well that we just decided to use it for all PMEM. All PMEM now is just configured this way. BIOS typically ignores all the persistent memory and just leaves the interleaving of it up to the OS,
Starting point is 00:25:12 except one exception being if you need it to boot from, in which case the UEFI driver will configure that device. So it greatly simplified a lot of things and gave the OS a lot of power on how to set things up. Another kind of interesting aspect of CXL that I just wanted to touch on, because it applies to all these features we're talking about, is this thing called global persistent flush. Global persistent flush, you can see from the complex flow here from Chet's document. It's a pretty complicated feature, but it's analogous to what we used to call ADR or EADR with persistent memory on the DDR bus, where things are flushed automatically if the system loses power or it crashes.
Starting point is 00:26:00 And, you know, like a lot of these mechanisms, it's simple when it works, but it's complex when it fails. And when it fails, we had to define things that we learned a lot from while we were doing the initial NVGIM support. And so we took a lot of those learnings and brought them forward into CXL. We were able to really leverage a lot of this stuff, so much so that, you know, the way this works, even when it fails, looks the same to applications. They're there on DDR persistent memory, DDR bus-based persistent memory, whether they're on CXL-based persistent memory. The applications are the same binaries. They don't need rebuilding. They don't need changing or anything like that.
Starting point is 00:26:39 It's very important. So let's talk a little bit about CXL software enabling then with everything we've been talking about there is enabling going on in all the major OSs but because I don't work at the companies of most of those major OSs I can really only tell you what's going on Linux. Linux actually already has a type 3 CXL driver upstream already. It's still getting features added to it, but you can see that it's already made some great progress. The maintainer of the NVDIMM framework, Ann Williams, is also the maintainer of this driver here and has really done a great job getting this moving and making sure that the driver is going to be available when the ecosystem needs it. In order to test things before there are devices available,
Starting point is 00:27:30 we have Ben in our Linux team who produced patches to QEMU to emulate a Type 3 device, and that's how they tested the driver code ahead of time. So all that's public and available to people to play with right now. And I have to say, CXL has had so much energy and excitement behind it that there's an immense amount of cross-company and cross-committee collaboration on these specs. ACPI guys collaborating with the CXL guys and so on. And so I'm really highly confident that we've avoided a lot of collisions where a bunch of features come together and don't work together.
Starting point is 00:28:12 We talk about a lot of different features like persistent memory and memory pooling and things like this. We talk about these things constantly in the weekly CXL meetings and they're heavily attended by a large number of companies. So that kind of excitement is really fun,
Starting point is 00:28:32 makes the whole thing actually quite fun to work on, I have to say. So in summary, I just want to point out the persistent memory programming model that I showed at the beginning remains the same. If you've written an application to use persistent memory, when persistent memory shows up on CXL, that same binary will run, it will work, it does not need changes. That was a very important key aspect of the programming model. We don't want to force people to sort of chase a changing spec all the time, right? So we add new features and new interesting things,
Starting point is 00:29:07 but we don't do the old stuff. It still works. CXL, when it arrives, is going to offer moving persistent memory off the memory bus so it doesn't impact the performance of other memory devices. It offers the scalability for all types of memory, persistent and otherwise, and it offers flexibility of things like memory pooling.
Starting point is 00:29:26 So that's another reason why we're so excited about what CXL has to offer. All of this stuff was published last November, and the OS enabling, as I said, is coming along. It's emerging. So that's my talk. Don't forget to rate the session. I'm going to stop sharing here just for a moment and kind of wind things up. If you have any questions during the session, you know, be sure to ask them.
Starting point is 00:29:59 But also, there are a lot of us on the CXL committee, myself included, that are happy to field questions about the CXL Type 3 memory device support anytime. We think it actually has a very exciting future. So thanks for attending. Thanks for listening. podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.