Storage Developer Conference - #91: Memory Class Storage and its Impact

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 91. Well, I'm going to get introduced to this term, and I'm going to try to back it up with some logic as to why the industry needs a new, yet another three-letter acronym. What I'm going to do is I'm going to talk about the technology that's developed by my company, Nantero, and it's carbon nanotube-based memory technology. What I'm going to show you are some basics about how carbon nanotubes work and how you can apply this function of carbon nanotubes to memory technology.

Starting point is 00:01:13 Now, this is not the first in this series of talks that I've been giving all year. I've been speaking at Persistent Memory Summit, Flash Memory Summit, Hot Chips, and so forth. So the first part of this is actually going to be a little bit on the sketchy side. It's not the level of detail that I've disclosed at the previous conferences. So by all means, grab my card if you would like to follow up on more levels of detail about the CNT technology itself. I'm just going to sketch that. What I wanted to do for this particular conference, and what I negotiated with the people coordinating the conference, is to now take that idea of how

Starting point is 00:01:51 the carbon nanotube memory is going to work and start applying it to real applications that apply directly to the topic of the Storage Developers Conference. developers conference. So let's talk a little bit about the carbon nanotube basics. So carbon nanotubes are these cool little magical things. They're strands of carbon in a circular form and they have some really great properties. They're resistant to heat. They're great conductors of heat. Heat doesn't bother them. They're very strong. And they have nice, predictable resistive properties. So what you're going to be seeing here is I'm essentially going to walk you through

Starting point is 00:02:38 how we're going to turn this into a version of a resistive ram. So what happens is with these carbon nanotubes, the effect that we're going to take advantage of is something that you may remember from your physics class. It's the Van der Waals barrier. The Van der Waals barrier is that energy barrier that keeps atoms apart when the atoms are apart. And then once you cross the Van der Waals barrier,

Starting point is 00:03:07 it's what keeps atoms together once they're together. That distance for carbon atoms happens to be 1.7 angstroms. And so the idea is that you generate energy to cross over the barrier, and then it requires absolutely no energy whatsoever to maintain that connection, and in fact requires an injected energy to break it back over the barrier to separate them. As a result, what we end up with is a connection

Starting point is 00:03:39 that is going to last literally for thousands of years. Our data retention analysis shows that at normal temperatures, we're looking at 12,000 years of data retention from a media using carbon nanotubes for that connectivity. Now, granted, this is also a weakness because our studies also show that that data retention reduces to 300 years at 300 degrees C so if you have systems that require higher than 300 degrees C operation

Starting point is 00:04:14 let me know and I'll laugh at you because the screws are going to fall apart before the carbon nanotubes do we've also seen no wear out mechanism and it's for the same reason one of the things that we do is we have lots and lots of these two switching so once these things are switched they stay switched

Starting point is 00:04:33 the data retention of this has turned out to be really great too because these things are going to be connecting and disconnecting in a void, There's no substrate. So it's not like filament growing or anything like that that's going to be done in a substrate. This is done just out in a void. And so these connections can be made and broken an infinite number of times. So far, no, I have to admit, we've only built enough test devices and run enough cycles to talk about numbers like 10 to the 12th, 10 to the 13th for now. But that's just because we haven't had time to run the additional cycles

Starting point is 00:05:10 to prove it all the way out to the unlimited number, which people generally consider to be E to the 15th. So we've talked about how the cells work, the individual carbon nanotubes. Let's now build something with this. And so what we're going to do is we're going to take a bunch of these carbon nanotubes and put them into a memory cell. So it's not a single connection of one nanotube. It's literally going to be hundreds to thousands of carbon nanotubes that are going to be wiggling around inside the cell, mechanically wiggling around and making contact and breaking contact. So the idea is that when you have a random array of these guys, you end up with this stochastic connection of a random

Starting point is 00:05:57 number of tubes. And as long as we get a minimum of 100 tubes switching from connected to disconnected, we can detect that resistance change and do it in a way that's one setting for an entire boatload of wafers. It's not something that has to be tuned down at the individual level because what we're getting out of this is as we do sets and resets on these cells,

Starting point is 00:06:21 we're getting between 10x and 20x resistance change. And that's plenty for us to be able to sense in a reliable way between the set and the reset states. Some of the other random data I threw together on this slide here covers things like, what's the performance of this? Well, it's only a few angstroms of movement. The fundamental cell performance is five nanoseconds, and it's a balanced read and write at five nanoseconds. So it's a persistent memory cell, runs at five nanoseconds. What could you build with that? And the other thing that I threw in here was the bit about the temperature. You can see that this thing operates identically across temperature extremes. Now,

Starting point is 00:07:11 keep in mind that's just the cell. You do have to attach that to logic, but the logic is going to be the weak link. It's not the, it's not the memory cell itself. So what are we going to do with this? Well, lots of stuff. First thing that we can do is just layer this stuff onto a piece of logic, which is absolutely process agnostic. This can be done on a logic process. It could be done on a memory process. It doesn't even have to be silicon. It could be silicon, any other kind of thing, like a germanium or whatever, underneath. As long as you expose metal contacts on top of that logic, you can connect these carbon nanotube arrays into that and do cool stuff with it. So what are the things that we can do? Well, at 28 nanometers, we're fitting 4 gigabits of carbon nanotube cells in 100 square millimeters of dye space.

Starting point is 00:08:14 You can do the math yourself pretty quickly to figure out what you could do with that. But also, this can be taken three-dimensionally. We can layer multiple layers of carbon nanotubes up on top of one another, alternate word lines and bit lines, and turn this into a 3D cross-point array. So that's one way that we can expand this, and that takes us with four layers of carbon nanotubes from 4 gigabits to 16 gigabits per chip.

Starting point is 00:08:50 However, there are other things you can do, especially like with the DDR4 protocol. DDR4 has chip IDs, so you can stack multiple die together. And so we support that as well, and you'll see that that allows us to go to 128 gigabits per device. But there's other things that we can do in the future. I already told you that this is on a 28 nanometer process. At 28 nanometers, we have some 2,500 carbon nanotubes switching per bit.

Starting point is 00:09:26 But I said all we need is 100. So if you do the testing or the modeling down to the finer geometries, who could guess where 100 nanotubes is? Turns out that's a one nanometer process. This is very, very scalable, well into the future. And we could probably get below one nanometer by dialing in the size of the carbon nanotubes to match the process geometries. So we're very, very comfortable that in the DDR5 timeframe, we're expecting that seven nanometer process will be commonplace. And With that, we'll be at

Starting point is 00:10:05 512 gigabits per die or a terabit in a stack of memories. This is just using the carbon nanotubes as a single bit. A piece of data that I didn't really show you was that we also have tested this thing for multi-level

Starting point is 00:10:22 cell functionality. Turns out that there's a nice linear response of the injected voltage versus the resistance of the cell. Basically, the higher the voltage, the more carbon nanotubes spring up. And so we can adjust that voltage to get multi-level cell operation as well. So we clearly could be in the multiple terabits per die range

Starting point is 00:10:44 in the very near future. We're pretty excited about that. But again, what can you do with this? You can do lots of things. Except change the screen. I need that abracadabra word. So one of the ways you can use this thing is just build a simple transistor for every resistor.

Starting point is 00:11:08 And it's pretty obvious how you would build that. You just deposit the carbon nanotubes right on top of the drain of your transistor, and bang, you have a memory cell. Very, very simple. But it kind of consumes a lot of space because now that means you need a transistor for every cell, and the layout of your transistors becomes your gating item for how tight you can pack things.

Starting point is 00:11:34 So generally, we save this approach for when customers want megabits to low gigabits kind of stuff. This is not going to give us the 16 gigabit device that we want to take to market. That's going to require something more like this. And this is the cross point that I was describing. So what you're seeing here is that you alternate vertical and horizontal lines and put carbon nanotube structures in between each of those lines.

Starting point is 00:12:06 And without a selector, you're able to address any of those four resistors simply by the assertion of the appropriate word lines and bit lines. So it's pretty powerful. This is what's allowing us to get that four gigabits per layer. And the other interesting thing about this is if you think about it, this is what determines your access time. If you have a 5 nanosecond core,

Starting point is 00:12:33 that doesn't mean you have a 5 nanosecond memory. Remember, you have to send energy down a word line. You have to wait for it to settle. You have to get the energy off of the bit line. Then you have to put it into a sense amplifier. Then you have to put it into a FIPO. Then you have to go through the IO drivers at the pins. And so that's what turns a five nanosecond core into something like a 56 nanosecond cycle time for a DDR4 DRAM, or in this case, a DDR4 NRAM. And that's exactly where I'm taking this line of reasoning.

Starting point is 00:13:14 Any questions? Oh, there's more stuff. Oh. Okay, so now that you have this core, what can you do with this core? Well, we're just going to throw this array of carbon nanotubes into the background, and then what we can do is attach any phi that we need to it

Starting point is 00:13:35 for the front end. Again, it's a 5-nanosecond persistent core. You can put a DDR4 or a DDR5 interface on that. HBM. You can do LPDDR or GDDR. It doesn't matter what the PHY is on the front end. That's the only difference between these basic memory technologies. So what I'm saying is that we're going to be able to offer multi-gigabit level devices in all of these device categories. And that should get you guys a little bit more excited. So that's why I'm justifying the meaning of the term memory class storage. Storage class memory terminology has been around for a while.

Starting point is 00:14:17 I know SNEA has tried to change the industry terminology to persistent memory, and that's fine. But that doesn't really capture any difference. That's just a semantic change, at least to me. And so the problem that I had with storage class memory is that those are really those devices that sit in that wasteland between DRAM and Flash. And sure enough, you know, there's, what, a 10,000x difference in performance between DRAM and flash. So that's fine that there's a wasteland, but I sure as hell don't want my parts to be caught in that trap of being in the wasteland.

Starting point is 00:14:58 Because the problem with the devices in the wasteland are a number of them. They don't meet DDR timings. Therefore, they can't plug into the socket that a DDR chip plugs into. They don't have unlimited write endurance. And that's really the big ticket item. If you don't have unlimited write endurance, you can't be a DRAM replacement

Starting point is 00:15:22 because a DRAM is a fully deterministic interface. It gets an address, it expects data, and that data has to be good. If you have a device that has endurance issues, it has to go offline periodically for housekeeping, wear, leveling, and so forth. That's the biggest difference. So what I'm suggesting is that we need another term

Starting point is 00:15:46 that I've called memory class storage to capture that device that meets all of the timings of a DDR device, has unlimited write endurance, and therefore never has to go offline for housekeeping. And that's why I think we need these terminologies. Now, does that mean we will not have other versions of this N-RAM technology

Starting point is 00:16:05 that will fit into the storage class memory wasteland? No, we can do that too. Like, let's say we stacked 128 die high. We're not going to meet DDR4 or DDR5 timings, but we can give pretty close to DRAM timings and get to flash levels of density. So there's another future there, but that's not what I'm going to focus on today. What I want to focus on today is what you can do with this technology. So here's our DDR4 and DDR5 devices. They're basically the same thing with slightly different PHYs.

Starting point is 00:16:42 What we do is we take that DDR physical interface, and then we're going to translate that. So, we do have impurities in our carbon. Impurities that are in the parts per billion, but they are impurities. So we do need to deal with bit fallout

Starting point is 00:17:00 in the big arrays. Two ways that we deal with that. One is we just do post-package repair and map around bad blocks. But the other one is that we do accommodate that we might miss something in testing. So every one of our designs does a full single-bit correct,

Starting point is 00:17:18 double-bit detect error correction scheme on the incoming data. So we bring the data in, translate 64 bits to 72 bits, and we store all 72. When we do a read cycle, we collapse that 72 back down to 64 and ship it out on the channel. That was not in DDR4. It is in DDR5. DDR4e was a device that had been proposed in the industry that would have added this ECC. We've just taken it one step further

Starting point is 00:17:48 and incorporated it right into our DDR4 interface as well. So what does this give you? Well, it gives you a device that's going to meet all of the timings of a DDR device. It's going to provide non-volatility. And that's kind of the purpose of this talk talk is to talk about the non-volatile aspects of a DRAM class device what happens when you don't need to refresh your device and what are the impacts of a non-destructive read what can you do with something that when

Starting point is 00:18:24 you read the data, it doesn't discharge a capacitor, but it is just reading a resistance from a cell that's not going to change based on the fact that you've read it. So these things have a big impact, not only in our chip design, but for you and your systems as to what you can do with this technology. So you can do a comparison of DRAM to NRAM. The active power is the same. A DRAM cell, roughly 5 to 7 femtojoules per bit. An NRAM cell, roughly 5 to 7 femtojoules per bit.

Starting point is 00:19:00 Therefore, runtime performance, power is going to be the same. However, that's when things stop. First of all, DRAMs have what are called banks. And one of the things you do when you access a DRAM is that you do something called an activate. An activate takes the contents of the DRAM array and puts it into sense amplifiers. You do IOs to that. But then you have to restore that

Starting point is 00:19:27 because it's a destructive read. So there's something called a pre-charge that occurs in a DRAM that says, take the contents and write it back out into the array. And then it also does some other magic stuff about normalizing voltages on lines and stuff like that. But the main function is that writing back of the array. But I said we have a non-destructive read.

Starting point is 00:19:49 We don't need pre-charge at all. And in fact, we know op-it when a controller sends us a pre-charge command because there's nothing to do. The data is still there. One of the aspects of that pre-charge is that you have to close all your banks periodically in order to do a refresh. Because refresh can only work when all your banks are closed. So this imposes a pretty big performance penalty on your systems.

Starting point is 00:20:17 And it means that every 3.9 microseconds, you're going to have to go out, close all the banks that were active, do your refresh command. Then, now you're done with that, you wait 350 nanoseconds, and then you start activating banks and bringing them back online so you can do some stuff. All of that goes away completely. You just keep running the device. Once you've activated a bank, it never closes. It's available forever until you power the chip down or until you activate another column into that same bank to replace those contents. So you constantly have your 8 kilobytes of data available at all times, and you never have to close anything because there's no refresh requirement so these are the kinds of things that affect you how you start viewing this thing it starts looking a lot

Starting point is 00:21:10 more like an SRAM if you think about it than a DRAM and it becomes a lot more deterministic from the system standpoint power is a great one so with the nram there's no self-refresh mode you don't need self-refresh mode the only time you do anything like that is if you want to change the operation frequency you want to shut off the dll and turn it back on at the new frequency that's the only time we need anything resembling a self refresh mode because data is retained all the time and you have no power burn if you're not using the device plus we have the additional feature that you can turn power completely off to the device and the data is still there when power comes

Starting point is 00:22:01 back up it starts changing how you think about system design when you don't have to worry about these things like self-refresh modes and energy storage. So this gives you an idea of the additional determinism is that the other thing that refresh does is it gobbles up a lot of your bandwidth. And the number 15 percent comes from 16 gigabit device at high temperature requires refresh every 3.9 microseconds

Starting point is 00:22:33 and refresh recovery time is 315 nanoseconds do the math 15 percent of your system bandwidth is given up to the fact that the DRAM is not available. The other things that we were able to take out of the architecture, and this might be good. I'm sure you probably have some telecom people in the room. Another thing is that we got rid of the four activations window. If you're a telecom guy, you know that four activations window is really painful because that says that once you've opened four banks, you can't do diddly squat for a long time before you activate the fifth bank because

Starting point is 00:23:11 power supply issues. We don't have that. We took that out of it too. Bank groups. We don't have bank groups. We're emulating that bank group structure. We have the same timing all the way across the chip. We don't need bank group timings. You start adding these things up, and you can see perhaps as much as 20% additional data throughput at the same clock frequency as the DRAM while being in a compatible footprint and application space. So let's think about what the implications are. So now you've got the background on what it is that we want to build so let's talk about what you can do with this in your applications so we can talk about things like what happens when you don't have power fail concerns how does

Starting point is 00:23:57 that affect your ssd how does that affect your telecom device how does that affect your artificial intelligence device? The elimination of data reload time. I mean, this is big for everybody, right? Right now, power fails. When power is restored, it takes you a while before you can get back to work because you have to reestablish all of your bitmap tables if you're an SSD. You have to zero out your link list structures. You have to do all that stuff because the data's gone. Well, that goes away. Recalculation times. This is for artificial intelligence people. The artificial intelligence world,

Starting point is 00:24:38 you're loading in these data models, then you're streaming user data in, and then you're updating the models. If you have power fail, not only do you have to reload the, the models and reload the user data, but you have to rerun all the calculations that were not checkpointed when power failed. And then we'll do, look at some other things like, you know, just the elimination of checkpointing, reduction in data buffering for telecom applications. And then, finally, there are a lot of people that have concerns about,

Starting point is 00:25:12 what, you really have a device that can last for 12,000 years? There might be some security concerns there, so we want to look at that as well. So, most of you are familiar with nvdim-n. Raise your hand. Yeah, more than half the room. The nvdim-n is kind of a cute little device. It was really how persistent memory got introduced into the industry. And I don't want to slam it, I just want to kill it. So the cool thing about the NVDIMM is that you have this FPGA that's going to sit there and do nothing except burn power

Starting point is 00:25:54 while you're exercising the DRAM that's on that memory module. And then on the back of the module, just in case power fails, you're going to take and disconnect that module from the system and take all the contents of the DRAM and save it in Flash. And then power comes back. And you take the contents of the Flash and you load it back into the DRAM. And then you tell the system, let's rock and roll, dudes. Well, that's great. Except it also requires this ugly super capacitor

Starting point is 00:26:23 that's hanging on this cable that's blocking your airflow and you have this expensive FPGA, you have the power supply, you have the fact that it's only half the memory capacity because you can only fit DRAM on one side of the module and the math starts not adding up really so well. So that's an easy market for us to kill because our registered DIMM uses standard register,

Starting point is 00:26:48 standard data buffers if you're going to make an LRDM out of it. You just replace the DRAMs with an NRAM. And guess what? Not only does it operate like a DRAM module, but when power fails, the data is still there. When power comes back on, you just resume from where you left off. No save and restore function needed, no super capacitor, and twice the memory capacity, because you can put memory on both sides of the module. What about power glitches? Yeah. Yeah, the power glitches would affect it the same way anybody else would be affected, in the sense that what happens if you have a power glitches would affect it the same way anybody else would be affected, in the sense that what happens if you have a power glitch today? Hopefully, your CPU detected it.

Starting point is 00:27:34 If your CPU detected it, what the CPU does is it completes the burst that's in process, puts the memory in self-refresh, and that's kind of like the trigger point for nvdim-n today. With this architecture, the same thing. Completes the burst that's in process and stops using it, and then power can go away, and there's no problem.

Starting point is 00:27:59 It's only when a data burst gets interrupted that you have data loss. But that's consistent with any persistent memory, right? If the bus is screwed up, your data's screwed up. I can't solve that problem. So what did persistent memory give the industry? The persistent memory is a great idea. And the good news is that the nbdim-n has made its way into the industry. The persistent memory is a great idea. And the good news is that the NBDM-N

Starting point is 00:28:26 has made its way into the marketplace. We have gotten the Windows drivers and the Linux drivers to support it. We've gotten DAX mode and all that other cool stuff that allowed us to get away from this problem. This was the problem that we were all trying to solve, which is when you have data loss, you don't want that data loss to prevent your bosses paying your paycheck from going into the auto deposit at your bank. So you better have some checkpoints. And that's what was gobbling up system performance was all these damn checkpoints where you run for a while and then you have to checkpoint how far you got before you run for a while and checkpoint again. It ate up system performance.

Starting point is 00:29:07 It ate up space. It really didn't provide anything useful except for the fact that, well, DRAMs were built from capacitors that lost data. Cool thing about persistence is you just run, run, run, run, run. So persistent memory was a great thing for the industry. You just keep running, and you just keep running as long as you want.

Starting point is 00:29:28 Yes, sir? Hi, Bill. I have a question about that. One of the general truths in the software business is that people like to write stuff up, and checkpoints have always been handy for rolling back to the original code. Yeah, that's a really good point. So what he's pointing out is that sometimes you'd want to do checkpointing even if you have persistent memory.

Starting point is 00:30:16 But the nice thing is, now your checkpoints can be in memory instead of the checkpoint that requires that you copy data out of DRAM, shove it up through PCIe into your SSD. That's the part of checkpointing that can go away because that's what's really killing your performance. You can checkpoint in main memory,

Starting point is 00:30:36 and it's a very small number of nanoseconds. When you checkpoint off to SSD, now you're talking about microseconds of delay, and it's a whole different category of delay, I think. Don't you agree? Okay. That sounds like a great offline conversation. Let's talk about that later. Okay, so killing the battery. So killing the battery is great. But the reason I wanted to bring this topic here, because this should be obvious to probably everybody in the room, but there are some things that are not quite as obvious.

Starting point is 00:31:12 So, for example, right over on the left-hand side there, that is kind of what everybody is building today and has been building for a long time. What you have, what I call an SSD controller, is pretty much that whole variety of things. That can be sitting on a SATA bus, can be on PCIe, can be a fabric interface. That controller is going to have typically flash memory for the mass storage. That's the general architecture of an SSD. But that could also be the rotating media for your hard drive. But hanging off to the side is that cache. And that's the weak link,

Starting point is 00:31:51 of course, that you're putting data in that cache, you're doing command reordering and so forth, and you're sensitive to data loss. So what everybody in this industry has to do is put an energy source out there. Super caps, tantalums, something. God forbid, even batteries. And that external energy source ripples through the entire design because it limits how big your cash can be. And that's kind of the weak link here. And in fact, it has this funny ripple effect that, for example, you might have, say, a one terabyte and a two terabyte SSD, and your one terabyte SSD might perform better

Starting point is 00:32:37 than your two terabyte SSD because keeping two terabytes of flash alive requires more energy than keeping one terabytes of flash alive requires more energy than keeping one terabyte of flash alive. And so you might actually have less available cache when it's in the two gig, two terabyte configuration versus one. Big problem for marketing guys today. And it's all because you can only fit so many tantalums on your module. Well, what if you got decoupled from that? Well, there are a lot of obvious things that come from decoupling and getting rid of that energy store.

Starting point is 00:33:14 First of all, big one is you can add more flash and increase the capacity of your drive and make it more marketable. So that's pretty big. But the second thing is you've now decoupled your cache size from your storage, right? Now people generally have like a gigabyte of cache versus for a terabyte of mass storage. That formula is not driven by performance optimizations. It's driven by how many capacitors you can fit. Now you're decoupled from that. What if your performance metric said you could do better with four gigabytes of cache? Well, now you can do that because it doesn't require any additional energy.

Starting point is 00:33:55 So it changes how you think about architecting this. There are other things you can do now that you have this DDR4 class cache device hanging off the side. What if this SSD controller was going into a notebook? Right now, notebooks, when they go into Hibernate, they have to take the contents of DRAM, stream it up through PCIe into the flash at flash speeds. So it takes 20, 30 seconds to hibernate your notebook. And then you have to load it back once power is restored. Another long delay, which makes our

Starting point is 00:34:33 notebooks so painful to use that most of us probably use our cell phones instead of the computer just because we don't want to deal with boot up times. What if you put your Hibernate partition right here? And now, instead of it being 20 seconds to go to sleep and 20 seconds to wake back up, what if it's now less than half of a second to store all the contents of DRAM into your Hibernate partition and then restore it back when the lid is reopened and you could have an instant on notebook it like i say starts changing how you approach system design as well so pretty excited about this kind of stuff that you can do cell phone same kind of problem right now you running on your cell phone and you're not doing anything the cell phone likes to put its LPDDR memory into

Starting point is 00:35:28 self-refresh mode. Well, guess what? Self-refresh mode still draws power. It's sitting there internally every 3.9 to 7.8 microseconds doing a read-modify write on a row of memory, and it has to do that all the time, and it runs your battery down. With NRAM, you can turn power off completely, turn it back on, data's still there, and you keep going. So it's a true zero standby power implementation.

Starting point is 00:36:00 Big change. But like I keep alluding, artificial intelligence is really a home run. If you've looked at the artificial intelligence architectures, they're pretty different than the traditional von Neumanns that we've been working with. Instead of going with a single central processing unit, maybe with a few cores, but still basically one execution unit. Artificial intelligence architectures have tens to hundreds to thousands

Starting point is 00:36:29 of execution units across. Very little ones. Eight 16-bit kind of processing elements. Very simple multiply, accumulate kind of stuff for giving weighting, for doing compression to JPEG, that kind of stuff. And they need a really wide data set coming in and out. So the AI guys are bringing in something like

Starting point is 00:36:50 high bandwidth memory to get the width of the data interface. And this, for example, is the Intel Nirvana chip. That's 32 gigabytes of HBM sitting there. They're going to load in... And then how do you get in and out of this chip serial pipes and then they take these serial pipes and they interconnect all these nirvanas into a toroid or a hypercube so now let's imagine that you're this is the one that's connected to your io subsystem and you're now going to take your memory from your SSD, load it into your cube structure.

Starting point is 00:37:30 Well, that means you have to go through all these serial hops to fill each of these chips one at a time. Multiply 32 gigabytes times, say, 1024 processors, and you can see that the boot times for some of these devices can be a day. Now talk about power loss and the cost of power loss. I'm sure you guys have all seen the statistics on how much energy loss and power failure costs data centers. And so this is one of those great cases where, holy moly, you'd be really out of a lot of money if this thing went power fail. And to complicate things, the early artificial intelligence applications just loaded a model and then ran user data through the model. Well, now artificial intelligence is giving way to

Starting point is 00:38:20 deep learning. And deep learning is a different approach. Deep learning says, learn from your mistakes. Take and feed back into the model the choices that you made that turned out to be wrong so that you can do better next time. Now what they're doing is modifying the model in that HBM so that the next time user data comes through, it makes a better choice. Now you're not only in danger of losing all of that load time on PowerFail,

Starting point is 00:38:51 but also learned user data, and that is even more valuable. So what do these guys have to do? Exactly what the database guys have to do. They have to checkpoint, which means that they update their models, and then periodically, at an interesting point, they have to take the contents of the HBM,

Starting point is 00:39:09 shove it back out over that series of links on the serial channel to get back out to an SSD to checkpoint. Extremely expensive, and more importantly, it gobbles up system bandwidth that could be better used running user applications. Now imagine putting persistent memory in those architectures. And so replacing HBM with an in-rem HBM is an obvious win for this marketplace as well. And this is graphically showing what i was just talking about so this is where these architectures are built so the data comes in on the left it runs through these algorithms doing waiting when it makes the determination whether you had a hit or a miss that's when it has to do

Starting point is 00:40:02 back propagation into the model and update the model, and that is the danger of these architectures. So logically, that's what this is going to look like. It looks a whole lot like that database flowchart, right? A little bit more complicated. You have the initialization stuff coming off of reset and or power fail. You load the HBM with the code and the initial models. But this big ticket item down at the bottom here

Starting point is 00:40:29 is when you decide that you need to checkpoint the model and transmit that model, you're gobbling up a ton of available bandwidth on your IO subsystem. And we can do better. Persistent memory allows, it's not going to eliminate the load time, but it will eliminate reload time because reset is no longer going to take you all the way back to the beginning

Starting point is 00:40:50 and it's going to eliminate that feedback path where you have to do the checkpointing so how would you take this idea of nram i've told you i can build a ddr4 chip out of it but what if you wanted to go right down in and incorporate it right into the silicon? And that's a different model, but it uses the same concept. Our DDR4 control silicon is literally just a piece of silicon that does DDR4 stuff. The carbon nanotubes are literally layered on top of the silicon. You could do the same with a chip. So let's say you had an AI device that wanted to incorporate persistent memory into it.

Starting point is 00:41:34 So I'm going to walk you through how you could then take that architecture and merge the functionality of persistent memory and right into the this device i'm also going to suggest that with your s ram as the execution part of the artificial intelligence engines you can one either replace the s ram completely five nanoseconds might not be stellar but it might be good enough for these little execution units. Or at very worst, maybe it's a shadow that on power fail, you can quickly copy from the SRAM up into the persistent memory just by doing a transfer straight up out of the cells down below into the persistent memory cells up above, which means that essentially in five nanoseconds you could checkpoint all of your sram so those are the two reasons why a 1t1r structure might be make sense in that part of the

Starting point is 00:42:30 application so this is what it would look like and now keep in mind this is just a conceptual diagram don't get don't start thinking that it looks like you're die stacking because this is just a conceptual diagram literally the carbon nanotubes are spread in layers right on top of the silicon, covers the whole thing. Carbon is a great conductor of heat, so it actually acts as a heat spreader as well. It's really the best thing you want to do, as long as you pay my royalties.

Starting point is 00:42:58 So how does this work? Well, what you do is you incorporate the drivers for the carbon nanotube memory cells right into the customer logic. So diagrammed, you know, 3D over there or vertically here, you see what you do is you design your circuits, you incorporate our drivers and receivers right into your design you place the carbon nanotube array right over the top you can build one t1r and cross point at the same time because they're the same manufacturing steps it's only a choice on how you do the device drivers the cell drivers in your logic you incorporate that as your memory interface into your customer logic and fabricate the whole

Starting point is 00:43:48 thing together and now all of a sudden you have an artificial intelligence device that has a full persistence taking that a step further it doesn't even have to be sram replacement what about registers you can do shadow registers what about registers? You can do shadow registers. What about latches? You can do persistent latches. The technology is kind of mind-boggling once you get into looking at what you can do with it. Elimination of refresh.

Starting point is 00:44:19 Okay, so we talked about that additional 15% bandwidth, but what about the latency hit? If you're doing a 400 gigabit per second Ethernet card, it's 350 nanosecond refresh recovery time, 140,000 bit times. Where's that data going to go? Well, that means that your controllers have to buffer all of that while waiting for the DRAM to become available and then empty that buffer when the DRAM becomes available.

Starting point is 00:44:51 With an NRAM, no refresh, you can start decreasing the amount of IO buffering that you have to do because it's available all the time. It doesn't go offline for refresh functions. That's kind of what I meant by, you know, changing how you think about doing system architectures. Now,

Starting point is 00:45:10 talk about data security though. One of the big problems that people have is they say, well, persistent memory is great except, you know, 12,000 years of data might not be the best thing for everybody because somebody could steal that module,

Starting point is 00:45:26 plug it into another system, and have my bank account pin or something. Yeah, that's a real concern. So how do you address that problem when you have a persistent memory? Well, with the DRAM backing up the flash in the NVIDIA M-N, one of the mechanisms, it's the incorporation of encryption and decryption

Starting point is 00:45:46 when they do the save and restore. We don't have that. We literally are going to be persistent memory forever. So there are a number of ways that you can solve this problem, and that is, first of all, you can do encryption in the CPU before the data is sent out. And with the vast majority of my customers, that's exactly what they say to do.

Starting point is 00:46:10 They say, don't put encryption in every chip because that's going to cost an arm and a leg. We'll take care of the problem, and they'll do the encryption. However, there are other architectures. Open CAPI, for example, that's going to a smart controller. In that context, you have that smart controller do the encryption and decryption. Problem solved as well. NVIDIA M-P.

Starting point is 00:46:36 There's no reason why we can't be the media in an NVIDIA M-P and let the NVIDIA M-P controller do encryption and decryption. So the right answer is don't pay for every chip to do encryption and decryption. Centralize your encryption and decryption function so you only pay for it once. So, went over a lot of material. we talked about the carbon nanotube structure and how we can make

Starting point is 00:47:09 a great memory core talked about how this can be built into any kind of d-ram replacement function including uh standard devices or custom devices talked a little bit about how cross points and 1t 1r types of structures both have applications and they're pretty distinct looked at an example of how you can incorporate this right into your own silicon designs as well we are a licensing company you guys are more than welcome to come up and license this from us and build it yourself umed a lot about how persistent memory is literally more efficient than DRAM. The DRAM has these refresh cycles.

Starting point is 00:47:51 It has power loss issues, and persistent memory solves a lot of these problems. I think we talked quite a bit about a few applications. Hopefully, at least one of those applications connected to everybody in the room. And then we talked a little bit about data encryption. So at that point, more than willing to

Starting point is 00:48:12 end the talk and ask for your questions. Thank you very much. Question in the back. It sounds better than DRAM in what? No, not really. These chips are terrible with salsa. You can never get that carbon taste out of your mouth. Aside from that, I can't think of a downside

Starting point is 00:48:47 literally that's the most important question My new best friend. My new best friend. At 28 nanometers, our die size is 60% the size of a DRAM done on 14 nanometer. So that's cost. Right. But it is price.

Starting point is 00:49:33 We don't control price because we're not a product company. But if you look at who our funders are, you get a very good idea of what our supply chain looks like, and our customer will be shipping this in pre-packaged server form to their customers. However, our supply chain does allow for loose chip sales, but like I said, you can also license it directly. So let's say you're with a Marvell, and Marvell wants to do an SSD controller and suck the cache right into the controller. You could take that Marvell controller, design our IO blocks for the drivers and receivers

Starting point is 00:50:14 for the carbon nanotube arrays, and that Marvell controller would have, say, 16 gigabits of persistent memory for its cache on the same chip. That's exactly the right business model. We license it to anybody and everybody who has a credit card. Timeline.

Starting point is 00:50:40 Again, we're an IP company, so we can't speak to product timelines because we're not the ones that do product. That being said, tape out of our DDR4 device, done. DDR5 device, well underway. And you can do the math. I think that's my warning.

Starting point is 00:51:02 Yeah, so that kind of gives you a sense of where the timeline is right yeah yes sir DDR4 drop in replacement so therefore everything runs off of either 2.5 volt or 1.2 volt rails

Starting point is 00:51:22 and technically it's an electrostatic force. It's not even a voltage that does the set and reset. And so it's a little bit buzzier, but we can always have an offline discussion for that. And they're telling me I'm out of time. So I'll be available for offline conversations. Thank you very much. Thanks for listening. So I'll be available for offline conversations. Thank you very much.

Starting point is 00:51:47 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #91: Memory Class Storage and its Impact

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.