Storage Developer Conference - #191: A Persistent CXL Memory Module with DRAM Performance

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode number 191. I'm Bill Gervasi, the Principal Systems Architect for Nantero. And I'm going to go, if you're not familiar with Nantero, I'll go a little bit into the technology. But really, the purpose of this talk was to explore CXL in the context of persistent memory.

Starting point is 00:01:01 And what are the things that putting persistent memory on that channel can do for us so what i want to do is just start with a look at cxl as as that growing trend it's you know the way i think of it it's it's like it's the bluetooth of fabrics because we have these big fabric wars going on we have the c6 guys and we had all gen z we had a ton fabric wars going on. We had the C6 guys and we had all Gen Z. We had a ton of stuff going on. And it's just amazing to me how quickly the industry consolidated around this interface when it came out. However, despite all the marketing hype, I can recognize, I know some of the guys in this room are engineers, maybe all of you are engineers, and you recognize that nothing is as easy as it looks from a

Starting point is 00:01:51 marketing brochure. And so I wanted to focus this talk on what are those things that people are worried about? What are the fears that people have? And what are the things that may slow adoption if we don't address them. Because then I want to brag about Nantero NRIN, which is a persistent memory technology, which is going to help solve some of those concerns about the adoption of CXL for systems users. And so this is an exploration. How different is persistent memory? What does persistent memory bring to the party? And how does it solve the problems? And then where are we going in the future as this technology evolves? So CXL is a rising technology.

Starting point is 00:02:38 Like I said, like the Bluetooth of fabrics, it established this baseline by which you could take lots and lots of processing units, lots and lots of storage, lots and lots of memory, and bolt them together in ways that are configurable by the end user, which is the most important thing. You want to be able to have a platform which can be reconfigured in many ways just because no two users' requirements are exactly the same. But we still have work ahead of us if we want to go and explore this potential. So I made a few assumptions about the architectures, and this is partially based on instinct and partially based on speaking to customers. The first of which is, if you look at this fabric, the first thing is that CXL expansion memory is not going to replace the direct-attached DIMM.

Starting point is 00:03:36 Those memory modules are still going to offer the lowest latency, highest performance for bulk memory storage, and they're not going to go away even if you have lots of memory on expansion buses. Similarly, you're not going to see the SSD and the NVMe drives go away. This is going to be something that is going to continue to be in the server platform itself in addition to the large bulk storage remotely. And that is going to be predicated by the fact that we're 50 years into an industry

Starting point is 00:04:16 where non-volatile main memory has forced you to do checkpointing. And as long as that checkpointing requirement is there, we're still going to need that local SSD to save data away. Finally, the CXL allows the concept of pooled memory. And pooled memory

Starting point is 00:04:37 sounds pretty cool from the marketing brochure, but there are a lot of implications of having pooled memory resource, including the fact that now you're going to have a large number of processors, each of which has many, many, many cores accessing that pooled memory. And so with a DIMM and talking to a CPU directly, the CPU can make a lot of assumptions about patterns. They can do reordering and so forth. When it goes out over this fabric, you've lost that.

Starting point is 00:05:15 You're no longer going to have a processor that can do things like order a bunch of reads and writes in the same page because they know what's up for that memory. Similarly, when they need to schedule a refresh, the CPU knows how to do that. CXL loses all concept of scheduling around refresh. So these are the assumptions that I went into this discussion with. Now, one of the things that is also obvious is that we're kind of at a crossroads with memory expansion. And the thing that affected memory expansion is, and I'm sure any of you with a hair color similar to mine, you remember that we started out with DDR1 had allowed four DIMMs per channel. Why did it allow four DIMMs per channel? It's because it was only going up to 400 megahertz. And even back then, we knew how to terminate buses

Starting point is 00:06:14 so that we could have these big flyby buses with four sockets on there, two ranks per channel, all those electrical disruptions at that speed, we could ignore. And we had a 256 megabit memory back then. So that gave us about four gigabytes per server back then. Then we went to DDR2. Well, now DDR2 doubled that speed, went up to 800 megahertz, 800 megatransfers per second, but we'll ignore that subtlety difference. So at that speed went up to 800 megahertz. 800 mega transfers per second, but we'll ignore that subtlety difference. So at that speed you could no longer do four DIMMs per channel. We lost a socket in each channel. However, that generation of controllers was able to go to two memory channels. So that gave us six sockets over the four of the previous generation

Starting point is 00:07:06 at the higher speed. Plus, we were still on that really nice ramp of about a doubling of memory density every 18 months back then. And it timed pretty well that we could step on up to the gigabit chip in that generation and get a whopping increase of 6x capacity from DDR1 to DDR2.

Starting point is 00:07:31 DDR3 came along. We lost another socket. Because performance now was going up to, what, 3200 megabits per second? No, DDR3, 1600 megabits per second. And it was just hard to close those budgets. So we lost another socket, but it was getting easier to make BGA CPUs. So it wasn't as bad to have three channels on that CPU. So it looks like things are going in the right direction.

Starting point is 00:08:02 Unfortunately, things started to slow down in terms of the ramping up of the memory density. We still got a 4x improvement up to the 4 gigabit level, but all that put together, that 96 gigabytes was about a 4x increase over the previous generation. DDR4, somehow we magically kept two DIMMs per channel, so that was cool. CPUs went to four channels, and the 16 gigabit chip came out. So we basically

Starting point is 00:08:35 were able to get this nice 5x improvement. So this roadmap was looking really, really good until DDR5. DDR5 is the weak link. The problem there is to hit 6400 megabits per second per pin, we no longer could support two DIMMs per channel. And adding 300 pins per channel made it essentially put a limit

Starting point is 00:09:06 on how many channels you could put on the CPU. Generally speaking, a lot of them are at 8 channels. Some are looking at going to 12 channels in order to try to get some improvement in capacity. And at the same time, the DRAM hit an asymptote.

Starting point is 00:09:23 The DRAM only went to the state at 16 gigabit with probably 32 gigabit being mainstream by the end of life for DDR5. So now instead of seeing 6X, 4X, 5X improvement, we're seeing the same memory capacity or maybe a doubling of memory capacity. And we're throwing 2400 pins on

Starting point is 00:09:49 the cpu at memory interfaces this is why things had to change so the slowdown in capacity improvements for end users at the same time that new applications were coming in, like artificial intelligence, machine learning, data mining, all of these things wanted more capacity, and they were being throttled by DDR5. So this was going to be bad news, and the emergence of CXL as a solution

Starting point is 00:10:21 could not have come at a better time because this allowed for still having that baseline of the DIMMs, but then for the memory starved applications, gave you headroom to expand that memory. So what's not to love? There's a nice simplified picture of what it looks like. You have a CPU, it has a CXL interface. You go out over this high-speed PCIe bus, and now you have a CXL controller

Starting point is 00:10:51 and a bunch of DRAMs behind it. Sounds great, right? Now it's down to 40 or 50 pins instead of 300 pins. A lot of goodness to this, but it has some problems. Number one, that conversion from a parallel bus to serial and back to parallel eats into your time. You cannot escape the fact that it's going to add latency to every access that you make. You're going to have to do much more sophisticated levels of pipelining to overcome that. Let me set this up so I can see what time it is, by the way.

Starting point is 00:11:34 Okay, next thing. Raw performance. Obviously DDR4 is still a very attractive option, so people are looking at putting DDR4 behind that CXL controller. Well, that has its own problems as well. For example, it limits your bandwidth. If you want to keep up with the bandwidth of a PCIe 5 interface, you need to be running that DDR4 at at its maximum speed, 3200 megabits per second, and a full 72 bits wide to get data and ECC. So that is 300 pins. Now, how many pins do you now want to start throwing at your CXL controller? So you start getting limited in terms of performance by that, and you can start hitting bottlenecks pretty quickly next one is power you don't do a CERTIs for free the CERTIs is going to

Starting point is 00:12:33 basically be your coffee warmer you're going to want to rest your coffee mug on your CXL controllers because these guys are going to be burning in the order of six watts in order to satisfy that channel. So you have to take the power consumption into account when you do your overall total cost of ownership analysis. Next one is, okay, it's a DRAM. Bad enough that your main memory DRAMs are volatile. But if you're shutting off your CPU,

Starting point is 00:13:09 the fact that the memory goes away is not the end of the world. But now we're in a different world. Now we're talking about fabric memory. And if fabric memory goes away, it may or may not be associated with CPUs being shut down. It may be shared by 20 different CPUs. And so the power management

Starting point is 00:13:30 and the volatility of DRAM starts to become a much bigger problem than it was at the individual server level. And then finally is capacity. We know that DIMMs are limited to 40 packages per DIMM.

Starting point is 00:13:48 But how much can you put into a CXL solution and offer up? So capacity is going to be a big concern. So that's kind of the range of stuff that I look at when I hear about customers being unsure as to how quick their CXL rollout is going to be. So now I get to brag about NREM. Nantero's NREM is a DDR5 replacement with non-volatility. It has the same speed as a DRAM. It actually delivers more data at the same clock rate than a DRAM, has non-volatility, scalable beyond a DRAM,

Starting point is 00:14:33 and lower power than a DRAM. So it's kind of a holy grail of memories that we've been looking for, and especially in the context of CXL, this set of features turns out to be really, really valuable. I recognize some of the people in the room, some of you have heard about what NRAM is, but I put a couple of slides in here just for people that have not heard my talks before. NRAM uses carbon nanotubes as switching elements. And the way this works is that it uses the Van der Waals barrier to do a switching mechanism that bonds adjacent carbon atoms.

Starting point is 00:15:13 So they stay bonded forever. There's no dielectric to get worn out or cracked. There's nothing in there. There's just bonding of carbon nanotubes. And then you use the opposite to break the connections, and essentially what you're doing is changing the overall resistance of the network, which we can detect as a memory cell. We take that and arrange those into cross point arrays, very similar to, rest in peace, the Optane.

Starting point is 00:15:48 And the Optane had the rows and columns and rows and columns, the word lines and bit lines alternating. We do the same thing. But instead of having a phase change memory cell between, it's a resistive element based on the carbon nanotube switching mechanism I just showed. So what you can do is apply a voltage on a word line, ground the appropriate bit line, and that cell gets the memory. When you want to read, you put a half voltage across, and it comes through the resistive array,

Starting point is 00:16:20 and you read that resistive value. So all of the stuff in orange, looks like orange on my screen, all that stuff is the cross point arrays. The stuff in blue is a standard DDR5 interface. So it follows the standard DDR5 protocol of rows and columns and banks and bank groups and all that other stuff, but it translates all that. It just thinks of them as a bunch of address bits, translates all that to the internal cross point addressing, delivers the data, drops right into a DDR5 solution, but provides a non-volatile solution in that array. So what does having a non-volatile core mean? Because now I want to

Starting point is 00:17:05 start connecting the dots to what the N-RAM is going to do and what the CXL limitations are. No refresh or self-refresh. That's where the added data comes from. The fact that

Starting point is 00:17:21 you no longer have to spend 11% of your time refreshing cells means you can have 11% more data movement at the same clock frequency. Pretty cool how that works out, huh? Zero power standby is a big ticket item. Maybe not quite as much for servers. For cell phones and automotive, this is going to be huge. But even for servers, there are times when, like that model where people rent out footprints, they may have servers that they want to put in standby, and so this zero power standby could be helpful. No pre-charge. And I'm going to go into some painful levels of detail on this, because it

Starting point is 00:18:03 turns out to be really, really interesting in solving the CXL problems. And tied to the fact that there's no pre-charge is there's no wasted data access. Where DRAM accesses 8 kilobits in order to read or write 64 bits, that's 99.2% inefficient. We're going to take that inefficiency away. And finally, from a security standpoint, DRAMs have what's called a Rohammer sensitivity from repetitive activations to rows. And the architecture of a crosspoint is not susceptible to Rohammer because of some of its other characteristics.

Starting point is 00:18:42 So it's a more secure solution as well. So let's talk about those problems. Addressing latency. You don't get away from latency. This additive hit is going to appear on every cycle. So we don't solve that problem. NRAM is not going to solve that problem, except that there's potential for a quicker access time that may compensate for some of that loss. But that loss is going to be there. It does mean CPUs are going to just have to get smarter

Starting point is 00:19:17 about scheduling operations and keeping the channel busy. Addressing performance. So mostly the media timing is going to determine your performance. Like I said, if you put DDR4 behind there, you're basically at the PCIe 5 speed. That 32 gigabytes per second completely aligns and you've saturated your channel. Then you might want some headroom. You might want to be doing parallel operations because CXL is full duplex. You can do reads and writes at the same time. So now you want two channels. Now you're up to 600 pins for those interfaces. And again, you're kind of bottlenecked in terms of how many operations you can be doing, what you can be doing with those interfaces. And again, you're kind of bottlenecked in terms of how many operations you can be doing, what you can be doing with those interfaces. So basically, the media is going to

Starting point is 00:20:09 define one big aspect of it. But so does refresh. Like I said before, the CPUs, when they are talking to a DIMM that's directly attached to the CPU, they know when the memory is not so busy. And they can schedule refreshes in the part of the memory that's not busy. You lose contact with that when you're on CXL. So this means that what's going to happen on CXL is that 11% of the time, a request is going to come from the CPU and over on the host side, it gets that request, checks the DRAM status.

Starting point is 00:20:49 89% of the time, everything's fine and it just returns and does the data operation. But that other 11% of the time, you're going to take that hit. The DRAM is going to be busy at the time you need it and that up to 350 nanoseconds of additive latency is going to be busy at the time you need it, and that up to 350 nanoseconds of additive latency is going to be pegged onto your access time for that memory. That asymmetry is going to get increasingly painful as we see the deployment of DRAM controllers on CXL. But I said, what if you can get rid of refresh? Well, first of all, you get rid of that hit on your access latency,

Starting point is 00:21:32 and you get 11% more data. That has an interesting impact. So this means that if I compare N-RAM on CXL, take a persistent memory on CXL and compare it to a DRAM attached DIMM, you start getting some interesting characteristics. If it were on a DIMM, you'd just get that 11% improvement. But now put it behind CXL

Starting point is 00:22:02 and you are going to take a performance hit. But look at that number. And that's the green in the bottom, in the middle there. 96% of DRAM DIM performance using a persistent memory on CXL. Pretty darn close. So we're getting there.

Starting point is 00:22:25 And then finally, the industry is in the process of developing CXL to DRAM controllers. And so when we go that route, we're back to that 11% advantage with a persistent memory versus a non-volatile memory versus a volatile memory in the same socket. What about power? Well, power is relative,

Starting point is 00:22:56 right? You have to compare it to something. So what I did was I took an LR DIMM. Now I recognize the LR DIMM is kind of obsolete in the industry. It's being in the process of being replaced, but still gives us an envelope to compare things. People forget that the LRDM is not just DRAM. It also has a register and data buffers. And that register and data buffers are already at two watts. And when we go up to DDR5-6400, they may even hit as much as three watts in addition to the DRAM power. So that is the metric that we're going to use. So the numbers I came up with for this is roughly 13 watts per DIMM. So now let's do some comparisons. First comparison is, what if you just put DRAM behind CXL controller. So now instead you have 6 watts instead of 2 watts

Starting point is 00:23:48 plus you have the DRAM power so instead of 13 watts you'd be at 17 watts for the same amount of memory. Over on CXL there are two most popular form factors. One of them is the E1 form factor.

Starting point is 00:24:07 Now, the E1 form factor is kind of interesting in that it only fits 20 chips, half as much as a DIMM. And you're going to throw six watts of CXL control on top of it. As a result, you take a nearly 50% hit in terms of a new metric that I'm inventing called gigabytes per second per watt. In other words, how much data do you get to process for every watt you spend?

Starting point is 00:24:36 And that's where you're at 56% efficiency with that form factor. I think the number of systems that adopt this may be impacted by that formula. However, the bigger E3 form factor is different. The big E3 form factor, you can pretty easily fit 40 chips in it, and now you're talking about at 17 watts versus 13 watts, so you're around 77% of the efficiency using that gigabytes per second per watt number I came up with. However, with careful layout, you can pack 80 chips into the E3 form factor, and now you're up to essentially the same amount of power for CXL versus a DIMM, because that's equivalent to two DIMMs worth of data.

Starting point is 00:25:31 Now, like I said, I wanted to get really nerdy and talk about non-destructive activation, because this is something that most of you guys have never had to deal with. You've been in the DRAM world where, as you know, when you activate a DRAM, that's grabbing a row of data out of the DRAM core and bringing it up to the sense amplifiers. That's the activation process. It's destructive. When you read it out to the sense amps, it drains the contents of the capacitors.

Starting point is 00:26:10 So you no longer have the data that you thought you had when you started out. Then you do, say, a read operation of 64 bits out of that 8 kilobits that you activated. But your core is now invalid. So that's where an operation called pre-charge comes in. Pre-charge says, oh, I better take the contents of the sense amplifiers, write it back into the core. So you've now done 16 kilobits of access, 8 kilobit read, 8 kilobit write, in order to read 64 bits out of that core. Now you're down to

Starting point is 00:26:46 99.5% wasted data access. Writes are pretty much the same. You activate, you write, modify the data in the sense amps. You need that pre-charge operation to send it back to the core. So the bottom line message You need that pre-charge operation to send it back to the core. So the bottom line message here is that pre-charge operation is needed

Starting point is 00:27:09 even if all you're doing is reading data. Now think about what a non-volatile memory does. An activation is just you select a row and leave it in the core. A read comes directly from the core. So now a read is only a 64-bit operation. And you're done because the data is still in the core. There is no need for a pre-charge to rewrite that data.

Starting point is 00:27:41 Pretty exciting, huh? And then, same thing on writes. You can write directly into the core. Writes are now a 64-bit operation and activates burn no power whatsoever. So with this idea of, oops, this idea of activating place, you eliminate the need for a pre-charge and you just deal with the 64 bits plus ECC that you cared about accessing in the first place. So trying to show you the kinds of evolutions that I went through in analyzing how this stuff was going to work. So what does that do for power?

Starting point is 00:28:19 Well, that's pretty cool that you're not going to be doing those precharges. Turns out precharge is 21% of DRAM power. So right off the top, the elimination of pre-charge buys you that 21%. And then I thought, well, but I'm not comparing apples to apples. If I'm going to take advantage of the fact that I'm not doing refresh, I'm going to be doing reads and writes during that time. So I better add that power back in. Turns out it didn't change the formula much at all. A couple of tenths of a percentage point, it still comes in at 21% power savings by eliminating pre-charge. And then do a little

Starting point is 00:28:56 fancy math because this is kind of a marketing pitch, right? So if you take that 11% power data improvement and your 21% power reduction, you multiply those out, you get a 34% better gigabyte per second per watt. So with NREM, things go a little differently. So now we talk about if you normalize to a couple of DIMMs attached to the CPU, a DDR5 CXL controller is 8% more power. But now with the power reduction by packing 80 chips into a CXL module, now we're coming in at lower power than equal amount of memory in a DIMM.

Starting point is 00:29:50 And it's non-volatile. So if you can eliminate the checkpointing, if you can eliminate the traffic to the SSD, if you can eliminate the SSD, now we're talking about another 100 watts per server that can be taken out of the equation. So, there are lots of ways to address volatility. We've been living in a volatile world for, like I said, 50 years, ever since the DRAM was introduced. And what do you do?

Starting point is 00:30:23 Checkpointing means periodically you save the state of your machine off to the SSD. The CXL world with DRAM is going to do the same thing. You're periodically going to have to do CXL copying out to the SSD. Now, the good news is the CXL3 protocol enhances the use of peer-to-peer, so the CPU doesn't have to get in the loop. So that's good. They are addressing the volatility challenge in some ways. Some companies, like the gentleman in the pink shirt in the back corner there, take a supercapacitor and attach a supercapacitor to their memories, something called an NBDM.

Starting point is 00:31:03 And this is a great solution as well as long as you can fit those super capacitors in your system. Or you use non-volatile memory. This is where a non-volatile CXL controller in a module that can be user-friendly, plugged into a system chassis, and provide that non-volatile data service is a pretty powerful addition. I'll talk a little bit more about the N-RAM. The initial products, 16 gigabit and 32 gigabit, SD-RAM compatible. And for any of you that are involved with JEDEC, you may know that I'm writing a JEDEC specification called the DDR5 NVRAM standard

Starting point is 00:31:48 that says it's like a DDR5 SDRAM, and it has these additional features because it's non-volatile. So the elimination of refresh and all that stuff is in that spec. So now, do a little bit of math. Here's where we see the market initially being. As you know, CXL3, I call it CXL Service Pack 3.

Starting point is 00:32:12 Service Pack 3 says that PCIe is going to go to 6.0, doubling the speed. Reality is, I don't think any systems guys are going to be ready for that for a few years. So I envision the initial rollout as being a PCIe 5-based solution at 32 gigabytes per second for duplex. And then probably four channels of NRAMs behind that. The 25.6 gigabytes per second pretty much lines up with your 32 gigabytes per second. The fact that you have four sub-channels allows you to ping-pong between them, and as a power management thing, put the rest of them in standby. Can you fit 80 chips? Turns out, yes, you can. I've done the layout. So what does that look like? Well, this is our massive memory expansion option. So we're going to be looking at supplying anywhere from 64 gigabytes to 8 terabytes per E3

Starting point is 00:33:15 module with this non-volatile DDR5 replacement device. The E1 form factor, you just basically strip away all those channels that you can't fit. This is still going to be your commodity memory expansion. If people have to expand memory, they're willing to take that additional power hit, well this is going to be a solution for them as a way to expand their system environment. And then a little bit about the DAX solution. The DAX solution is really essential for making this transition from a volatile world to a non-volatile world.

Starting point is 00:34:01 The DAX environment, it's a roadmap, right? It allows you to have your traditional applications that are all file system based banging your SSDs but then you can take that and migrate from those mounted drives from SSDs

Starting point is 00:34:20 to mounted RAM drives minor software change or no software change at all, and then eventually rewrite software or new software can be developed that can go to direct access mode. So this roadmap is really important for seeing the expansion of CXL expansion memory. And then there's a snapshot of the specification that I'm talking about. But there's one more thing

Starting point is 00:34:48 that needs to happen. And that is, ideally, the world that we'd love to live in is data persistence end-to-end. That power can fail at any time and you don't lose any data.

Starting point is 00:35:03 And how are we going to get there? And that's one of the final notes that I wanted to make was that having persistent memory moving into CXL, moving it into the DIMM channel eventually, and maybe even someday right onto the CPU in the form of HBM and RAM, that you can have an environment where power can fail at any time and you can eliminate this checkpointing. Just let the data stay in place because it's all persistent. This is going to lead us to that holy grail that I think is instant on. And it's going to take that we go non- volatile memory across the entire spectrum of the computer system design

Starting point is 00:35:51 to get there. So, recap what we just looked at. The CXL for memory expansion is great, has a lot of momentum, but it has some warts. And persistent memory is a technology that addresses a lot of the limitations of that CXL and will help people feel comfortable adopting this technology. Showed you a little bit of a hint about DDR5 NRAM from my company. A little bit about how the DAX model is going to be the roadmap for this adoption. But eventually, the fact is we want to be at instant on the holy grail of system architectures. With that, I can close it out and take some questions. I knew that the guy in the front row would be the first one to ask a question. When can I get our sample?

Starting point is 00:36:49 Well, let's exchange cards. It's possible you could have it today, but it would be a much smaller device than what I just described. But yeah, we'll exchange cards. But the answer is for the 16 and 32 gigabit, that is aligned to roll out with the early introduction of CXL in data centers. Any other questions?

Starting point is 00:37:17 Back there. So for DAX mode, would you recommend the application developers or would this be something that you'd be able to take with? Yeah, that CXL, I mean, that DAX migration is, every customer is going to have a different problem to solve. So the idea of moving from SSD to mounted RAM drive is where I think a lot of customers are going to go initially

Starting point is 00:37:45 because it's the kind of thing that can be done in a makefile or a shell script where they point to drive F instead of drive E. And done. The software just runs faster. Are they getting the best performance? Not even close. So I know Intel has been very aggressive about

Starting point is 00:38:05 talking with developers about changing from F open, F close, F read, F write, to a memory match file. And then eventually to direct access.

Starting point is 00:38:21 And frankly, the number of software companies that will be willing to do those rewrites it's going to be relatively small it's probably going to be a more like they will rewrite modules where they're spending 90% of their time so I see that happening but the holy grail of course of being applications written for direct access to persistent memory, and that's basically going to be the new generation of software. Now the good news is artificial intelligence is kind of at

Starting point is 00:38:56 that point on the curve. Today the relative volumes of, say, CPUs or GPUs for artificial intelligence is mice nuts compared to CPUs, right? It's measured in the hundreds of thousands or low millions of units. And as that ramps up, the availability of persistent main memory can be more real. People will feel more comfortable with it. And I think that's where we will see that a lot of new applications like AI and ML will take advantage of persistent main memory. It's going to be on a case-by-case basis.

Starting point is 00:39:35 Sir? Can you talk about the temperature and the endurance of the system? The endurance of what? Upgrading temperature and endurance. Sure. The endurance of what? Sure. Of the NRAM, I assume you mean, right? The NRAM was tested on the space shuttle.

Starting point is 00:39:59 So it's in space now. It's circling Jupiter right now. It operates from... We tested it from minus 55 to plus 300. Other customers have tested it at four degrees Kelvin and at 600 degrees centigrade. The temperature range is ridiculously wide. The endurance is a work in progress based on how many chips, how many months of cycles we've been able to run. So that's a growing story. We're pretty comfortable with our endurance numbers so far. Is there a CDA to work with for object-based part-time?

Starting point is 00:40:39 Do you have a bar to see today? Nope. It's treated like a DRAM that just happens to hold its data. Do reads and writes, and then the data is just there. Yes, sir. I'd like to get back to the problem you had

Starting point is 00:40:55 with the EMDR issue, and the attachment, and the collection, and the end rate. We think we all know the grid is not quite, and it's just, oh, maybeency is still going to be a challenge. Yeah. And that's, but if you think about it,

Starting point is 00:41:28 putting the guards on the coherency and the flushing and checking for completion... By the way, the time to completion for NRAM, 60 nanoseconds. It's not very long. So, 60 nanoseconds. It's not very long. So, 60 nanoseconds. So, once it's gone onto the chip, everything's good. But you're right, there's going to be write buffers,

Starting point is 00:41:56 there's going to be data in flight on the channel, there's a whole bunch of other stuff. CPUs are going to have to be sensitive to that. But compare that to checkpointing to an SSD. Yeah, not even in the same ballpark. I'm fairly comfortable with that. We're 12 minutes after. We're still doing great for time, right?

Starting point is 00:42:16 Next talk is eight more minutes? Okay, there's some. Thanks for the talk. Do you need to do less garbage? Not yet. Not that we've seen. Just read it. Kind of like an NVDIMM.

Starting point is 00:42:36 The NVDIMM, when power fails, it disconnects itself from the bus, uses that supercapacitor to copy the contents out to flash. It's kind of the same thing, but just imagine that that is 60 nanoseconds instead of two minutes to do the save and restore. Yeah. You mean like where the benchmarks came from? Yeah. Yeah. Yeah, I'd like to say it was a very, very sophisticated model.

Starting point is 00:43:34 It was not. It was an Excel spreadsheet, and it's based on a handful of assumptions. But I have reviewed that handful of assumptions with at least five major cloud providers, and they're comfortable that we made the right set of assumptions but it's not a real sophisticated model it's Excel so that we can quickly plug in new stuff and see what happens No, not yet. Any other questions? No, 60 nanoseconds is the right recovery time for ensuring that once the data has entered the N-RAM

Starting point is 00:44:27 that it's committed to the carbon nanotube cells. 60 nanoseconds is startlingly good. Yeah, and power could never fail in less than 16 nanoseconds, so it's intrinsically insensitive to power failure. Is it an issue for the CXL latency? This is only once it reaches the pins of the NREM. The CXL latency, for example, one of the things that's under consideration is whether you want right buffers on the CXL controller. And why would you want right buffers on it? It's not even clear that it really helps because, as you can see, we're balancing the front end versus the back end. But if we end up going that route, then you would have to have an energy source

Starting point is 00:45:28 long enough to flush what's going on in the CXL controller, similar to what the Optane modules do. I think we have time for a couple more. Any other questions out there? Yeah. Yeah. more. Any other questions out there? Yeah. Yeah. Yeah, of course yes so do you see a benefit of flash

Starting point is 00:46:07 in terms of if you might do a few years dude you should have been at my flash memory summit talk before covid because that was actually the exact topic that I covered was an entire

Starting point is 00:46:23 segment on SSD design. It's such a cool topic. I'm glad you asked it because if you think about taking and putting a persistent memory onto a SSD controller, today they currently use DRAM, which means that they have to have a bunch of energy source on every SSD that in some cases consumes up to 25% of the board space of an SSD because supercapacitors aren't good enough. They all go to tantalum capacitors and you can't get 10 farad tantalum capacitors that are in an 06-02 form factor. So what are you going to do? The elimination of that

Starting point is 00:47:10 need to save the content of your DRAM cache on an SSD is huge. Okay, so you get 25% additional board space back where you can now fit another flash chip. So now you've also increased the capacity of your module by maybe 20%. But one other thing that's more subtle and it's not obvious until you think about it is the general guideline for SSD design is you want a gigabyte of cache for every terabyte of flash. Why? Well that's kind of where the sweet spot is of performance balancing using that cache. But what if your analysis said, oh, but I get an extra 20% performance if I double that cache to 2 gigabytes for every terabyte. Well, now what you've just done is you've doubled your need for capacitors

Starting point is 00:48:11 to keep the system alive while it flushes that much cache. And that's the less obvious win, is that now you've decoupled your cache size from your storage size. And it now just becomes a design trade-off that you make based on your performance analysis instead of how many 10-ohm capacitors you can fit on the module. And with that,

Starting point is 00:48:41 I show about 60 more seconds. We have one quick question. Or we can wave at each other and say goodbye. Can you compare the endurance between a D-RAM and, let's say, a 50? Oh, it's much better than a flash device. Yeah, but we're not currently to D-RAM levels, but partly that's because we just haven't had enough test chips and test systems to run enough cycles.

Starting point is 00:49:11 When we get the mass production devices, we have to restart anyway, and our endurance testing will begin that new generation of parts. Okay, well, thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, Okay, well thank you very much. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #191: A Persistent CXL Memory Module with DRAM Performance

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.