Storage Developer Conference - #91: Memory Class Storage and its Impact
Episode Date: April 8, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 91. Well, I'm going to get introduced to
this term, and I'm going to try to back it up with some logic as to why the industry needs a new, yet another three-letter acronym.
What I'm going to do is I'm going to talk about the technology that's developed by my company, Nantero, and it's carbon nanotube-based memory technology. What I'm going to show you are some basics about how carbon nanotubes work
and how you can apply this function of carbon nanotubes to memory technology.
Now, this is not the first in this series of talks that I've been giving all year.
I've been speaking at Persistent Memory Summit, Flash Memory Summit,
Hot Chips, and so forth.
So the first part of this is actually going to be a little bit on the sketchy side.
It's not the level of detail that I've disclosed at the previous conferences.
So by all means, grab my card if you would like to follow up on more levels of detail about the CNT technology itself.
I'm just going to sketch that.
What I wanted to do for this particular conference, and what I negotiated with the people coordinating the conference, is to now take that idea of how
the carbon nanotube memory is going to work and start applying it to real applications
that apply directly to the topic of the Storage Developers Conference. developers conference. So let's talk a little bit about the carbon nanotube
basics. So carbon nanotubes are these cool little magical things. They're strands of
carbon in a circular form and they have some really great properties. They're
resistant to heat. They're great conductors of heat. Heat doesn't bother them.
They're very strong.
And they have nice, predictable resistive properties.
So what you're going to be seeing here is I'm essentially going to walk you through
how we're going to turn this into a version of a resistive ram.
So what happens is with these carbon nanotubes,
the effect that we're going to take advantage of
is something that you may remember from your physics class.
It's the Van der Waals barrier.
The Van der Waals barrier is that energy barrier
that keeps atoms apart when the atoms are apart.
And then once you cross the Van der Waals barrier,
it's what keeps atoms together once they're together.
That distance for carbon atoms happens to be 1.7 angstroms.
And so the idea is that you generate energy to cross over the barrier,
and then it requires absolutely no energy whatsoever
to maintain that connection,
and in fact requires an injected energy
to break it back over the barrier to separate them.
As a result, what we end up with is a connection
that is going to last literally for thousands of years.
Our data retention analysis shows that at normal temperatures,
we're looking at 12,000 years of data retention from a media using carbon nanotubes for that connectivity.
Now, granted, this is also a weakness because our studies also show that that data retention
reduces to 300 years
at 300 degrees C
so if you have systems that require
higher than 300 degrees C operation
let me know and I'll laugh at you
because the screws are going to fall apart
before the carbon nanotubes do
we've also seen no wear out mechanism
and it's for the same reason
one of the things that we do is we have lots and lots of these two switching
so once these things are switched
they stay switched
the data retention of this has turned out to be really great too
because these things are going to be connecting and disconnecting
in a void, There's no substrate.
So it's not like filament growing or anything like that that's going to be done in a substrate.
This is done just out in a void.
And so these connections can be made and broken an infinite number of times.
So far, no, I have to admit, we've only built enough test devices and run enough cycles
to talk about numbers like 10 to the 12th, 10 to the 13th for now. But that's just because we haven't had time to run the additional cycles
to prove it all the way out to the unlimited number, which people generally consider to be
E to the 15th. So we've talked about how the cells work, the individual carbon nanotubes.
Let's now build
something with this. And so what we're going to do is we're going to take a bunch of these
carbon nanotubes and put them into a memory cell. So it's not a single connection of one nanotube.
It's literally going to be hundreds to thousands of carbon nanotubes that are going to be wiggling
around inside the cell, mechanically wiggling around and making contact and breaking contact. So the idea is that when
you have a random array of these guys, you end up with this stochastic connection of a random
number of tubes. And as long as we get a minimum of 100 tubes switching from connected to disconnected,
we can detect that resistance change
and do it in a way that's one setting
for an entire boatload of wafers.
It's not something that has to be tuned
down at the individual level
because what we're getting out of this
is as we do sets and resets on these cells,
we're getting between 10x and 20x resistance change.
And that's plenty for us to be able to sense in a reliable way between the set and the reset states.
Some of the other random data I threw together on this slide here covers things like,
what's the performance of this? Well, it's only a few angstroms of movement. The fundamental cell
performance is five nanoseconds, and it's a balanced read and write at five nanoseconds.
So it's a persistent memory cell, runs at five nanoseconds. What could you build with that?
And the other thing that I threw in here was the bit about the
temperature. You can see that this thing operates identically across temperature extremes. Now,
keep in mind that's just the cell. You do have to attach that to logic, but the logic
is going to be the weak link. It's not the, it's not the memory cell itself. So what are we going to do with this? Well, lots of stuff. First thing that we can
do is just layer this stuff onto a piece of logic, which is absolutely process agnostic.
This can be done on a logic process. It could be done on a memory process. It doesn't even
have to be silicon. It could be silicon, any other kind of thing, like a germanium or whatever, underneath.
As long as you expose metal contacts on top of that logic,
you can connect these carbon nanotube arrays into that and do cool stuff with it.
So what are the things that we can do? Well, at 28 nanometers, we're fitting 4 gigabits of carbon nanotube cells in 100 square millimeters of dye space.
You can do the math yourself pretty quickly to figure out what you could do with that.
But also, this can be taken three-dimensionally.
We can layer multiple layers of carbon nanotubes up on top of one another,
alternate word lines and bit lines,
and turn this into a 3D cross-point array.
So that's one way that we can expand this,
and that takes us with four layers of carbon nanotubes
from 4 gigabits to 16 gigabits per chip.
However, there are other things you can do,
especially like with the DDR4 protocol.
DDR4 has chip IDs, so you can stack multiple die together.
And so we support that as well,
and you'll see that that allows us
to go to 128 gigabits per device. But there's other things that we can do in the future.
I already told you that this is on a 28 nanometer process. At 28 nanometers, we have some 2,500
carbon nanotubes switching per bit.
But I said all we need is 100.
So if you do the testing or the modeling down to the finer geometries,
who could guess where 100 nanotubes is?
Turns out that's a one nanometer process.
This is very, very scalable, well into the future.
And we could probably get below one nanometer by dialing in the size of the carbon nanotubes to match the process geometries. So we're very,
very comfortable that in the DDR5 timeframe, we're expecting that seven nanometer process
will be commonplace. And With that, we'll be at
512 gigabits per die
or a terabit in a stack of
memories.
This is
just using the carbon nanotubes as a single
bit. A piece of data that I didn't
really show you was that we also
have tested this thing for multi-level
cell functionality. Turns out
that there's a nice linear response
of the injected voltage versus the resistance of the cell.
Basically, the higher the voltage,
the more carbon nanotubes spring up.
And so we can adjust that voltage
to get multi-level cell operation as well.
So we clearly could be in the multiple terabits per die range
in the very near future.
We're pretty excited about that.
But again, what can you do with this?
You can do lots of things.
Except change the screen.
I need that abracadabra word.
So one of the ways you can use this thing
is just build a simple transistor for every resistor.
And it's pretty obvious how you would build that.
You just deposit the carbon nanotubes
right on top of the drain of your transistor,
and bang, you have a memory cell.
Very, very simple.
But it kind of consumes a lot of space
because now that means you need a transistor for every cell,
and the layout of your transistors becomes your gating item for how tight you can pack things.
So generally, we save this approach for when customers want megabits to low gigabits kind of stuff.
This is not going to give us the 16 gigabit device that we want to take to market.
That's going to require something more like this.
And this is the cross point that I was describing.
So what you're seeing here
is that you alternate vertical and horizontal lines
and put carbon nanotube structures
in between each of those lines.
And without a selector,
you're able to address any of those four resistors
simply by the assertion of the appropriate word lines and bit lines.
So it's pretty powerful.
This is what's allowing us to get that four gigabits per layer.
And the other interesting thing about this is if you think about it,
this is what determines your access time.
If you have a 5 nanosecond core,
that doesn't mean you have a 5 nanosecond memory.
Remember, you have to send energy down a word line.
You have to wait for it to settle.
You have to get the energy off of the bit line.
Then you have to put it into a sense amplifier. Then you have to put it into a FIPO. Then you
have to go through the IO drivers at the pins. And so that's what turns a five nanosecond core
into something like a 56 nanosecond cycle time for a DDR4 DRAM, or in this case, a DDR4 NRAM.
And that's exactly where I'm taking this line of reasoning.
Any questions?
Oh, there's more stuff.
Oh.
Okay, so now that you have this core,
what can you do with this core?
Well, we're just going to throw this array of carbon nanotubes
into the background,
and then what we can do is attach any phi that we need to it
for the front end.
Again, it's a 5-nanosecond persistent core.
You can put a DDR4 or a DDR5 interface on that.
HBM. You can do LPDDR or GDDR. It doesn't matter what the PHY is on the front end. That's the only difference between these basic memory
technologies. So what I'm saying is that we're going to be able to offer multi-gigabit level
devices in all of these device categories. And that should get you guys a little bit more excited.
So that's why I'm justifying the meaning of the term memory class storage.
Storage class memory terminology has been around for a while.
I know SNEA has tried to change the industry terminology to persistent memory, and that's fine.
But that doesn't really capture any difference.
That's just a semantic change, at least to me.
And so the problem that I had with storage class memory is that those are really those devices
that sit in that wasteland between DRAM and Flash.
And sure enough, you know, there's, what,
a 10,000x difference in performance between DRAM and flash.
So that's fine that there's a wasteland, but I sure as hell don't want my parts to be caught in that trap of being in the wasteland.
Because the problem with the devices in the wasteland are a number of them.
They don't meet DDR timings.
Therefore, they can't plug into the socket
that a DDR chip plugs into.
They don't have unlimited write endurance.
And that's really the big ticket item.
If you don't have unlimited write endurance,
you can't be a DRAM replacement
because a DRAM is a fully deterministic interface.
It gets an address, it expects data,
and that data has to be good.
If you have a device that has endurance issues,
it has to go offline periodically for housekeeping,
wear, leveling, and so forth.
That's the biggest difference.
So what I'm suggesting is that we need another term
that I've called memory class storage
to capture that device
that meets all of the timings of a DDR device,
has unlimited write endurance,
and therefore never has to go offline for housekeeping.
And that's why I think we need these terminologies.
Now, does that mean we will not have
other versions of this N-RAM technology
that will fit into the storage class memory wasteland? No, we can do that too. Like,
let's say we stacked 128 die high. We're not going to meet DDR4 or DDR5 timings,
but we can give pretty close to DRAM timings and get to flash levels of density.
So there's another future there, but that's not what I'm going to focus on today.
What I want to focus on today
is what you can do with this technology.
So here's our DDR4 and DDR5 devices.
They're basically the same thing with slightly different PHYs.
What we do is we take that DDR
physical interface, and then
we're going to translate that.
So, we do
have impurities in our carbon.
Impurities that are in the parts
per billion, but they are impurities.
So we do need to deal with bit fallout
in the big arrays.
Two ways that we deal
with that. One is we just do post-package repair
and map around bad blocks.
But the other one is that we do accommodate
that we might miss something in testing.
So every one of our designs
does a full single-bit correct,
double-bit detect error correction scheme
on the incoming data.
So we bring the data in,
translate 64 bits to 72 bits, and we store all
72. When we do a read cycle, we collapse that 72 back down to 64 and ship it out on the channel.
That was not in DDR4. It is in DDR5. DDR4e was a device that had been proposed in the industry
that would have added this ECC.
We've just taken it one step further
and incorporated it right into our DDR4 interface as well.
So what does this give you?
Well, it gives you a device that's going to meet
all of the timings of a DDR device.
It's going to provide non-volatility.
And that's kind of the purpose of this talk talk is to talk about the non-volatile aspects of a DRAM class device what happens when
you don't need to refresh your device
and what are the impacts of a non-destructive read what can you do with something that when
you read the data, it doesn't
discharge a capacitor, but it is just reading a resistance from a cell that's not going to change
based on the fact that you've read it. So these things have a big impact, not only in our chip
design, but for you and your systems as to what you can do with this technology.
So you can do a comparison of DRAM to NRAM.
The active power is the same.
A DRAM cell, roughly 5 to 7 femtojoules per bit.
An NRAM cell, roughly 5 to 7 femtojoules per bit.
Therefore, runtime performance, power is going to be the same.
However, that's when things stop.
First of all, DRAMs have what are called banks.
And one of the things you do when you access a DRAM is that you do something called an activate.
An activate takes the contents of the DRAM array
and puts it into sense amplifiers.
You do IOs to that.
But then you have to restore that
because it's a destructive read.
So there's something called a pre-charge
that occurs in a DRAM that says,
take the contents and write it back out into the array.
And then it also does some other magic stuff
about normalizing voltages on lines and stuff like that.
But the main function is that writing back of the array.
But I said we have a non-destructive read.
We don't need pre-charge at all.
And in fact, we know op-it when a controller sends us a pre-charge command
because there's nothing to do.
The data is still there.
One of the aspects of that pre-charge
is that you have to close all your banks periodically in order to do a refresh.
Because refresh can only work when all your banks are closed.
So this imposes a pretty big performance penalty on your systems.
And it means that every 3.9 microseconds, you're going to have to go out, close all the banks that were active, do your refresh command. Then, now you're done with that, you wait 350 nanoseconds, and then you start activating
banks and bringing them back online so you can do some stuff. All of that goes away completely.
You just keep running the device. Once you've activated a bank, it never closes.
It's available forever until you power the chip down or until you activate another column into that same bank
to replace those contents.
So you constantly have your 8 kilobytes of data available at all times,
and you never have to close anything because there's no refresh requirement so these
are the kinds of things that affect you how you start viewing this thing it starts looking a lot
more like an SRAM if you think about it than a DRAM and it becomes a lot more deterministic
from the system standpoint power is a great one so with the nram there's no self-refresh mode you don't need self-refresh
mode the only time you do anything like that is if you want to change the operation frequency
you want to shut off the dll and turn it back on at the new frequency that's the only time we need
anything resembling a self refresh mode because
data is retained all the time and you have no power burn if you're not using
the device plus we have the additional feature that you can turn power
completely off to the device and the data is still there when power comes
back up it starts changing how you think about system design
when you don't have to worry about these things
like self-refresh modes and energy storage.
So this gives you an idea of the additional determinism
is that the other thing that refresh does
is it gobbles up a lot of your bandwidth.
And the number 15 percent comes from
16 gigabit device at high temperature requires refresh every 3.9 microseconds
and refresh recovery time is 315 nanoseconds do the math 15 percent of your system bandwidth
is given up to the fact that the DRAM is not available.
The other things that we were able to take out of the architecture, and this might be good.
I'm sure you probably have some telecom people in the room.
Another thing is that we got rid of the four activations window.
If you're a telecom guy, you know that four activations window is really painful because
that says that once you've opened four banks,
you can't do diddly squat for a long time before you activate the fifth bank because
power supply issues. We don't have that. We took that out of it too. Bank groups. We don't have
bank groups. We're emulating that bank group structure. We have the same timing all the way
across the chip. We don't need bank group timings.
You start adding these things up, and you can see perhaps as much as 20% additional data throughput
at the same clock frequency as the DRAM while being in a compatible footprint and application space.
So let's think about what the implications are. So now you've got the background on what it is
that we want to build so let's talk about what you can do with this in your applications
so we can talk about things like what happens when you don't have power fail concerns how does
that affect your ssd how does that affect your telecom device how does that affect your artificial intelligence device? The elimination
of data reload time. I mean, this is big for everybody, right? Right now, power fails. When
power is restored, it takes you a while before you can get back to work because you have to
reestablish all of your bitmap tables if you're an SSD. You have to zero out your link list structures. You have to do all that stuff because the data's gone.
Well, that goes away.
Recalculation times.
This is for artificial intelligence people.
The artificial intelligence world,
you're loading in these data models,
then you're streaming user data in,
and then you're updating the models.
If you have power fail, not only do you have to reload the, the models and reload the user data,
but you have to rerun all the calculations that were not checkpointed when power failed.
And then we'll do, look at some other things like, you know, just the elimination of checkpointing,
reduction in data buffering for telecom applications.
And then, finally, there are a lot of people that have concerns about,
what, you really have a device that can last for 12,000 years?
There might be some security concerns there, so we want to look at that as well.
So, most of you are familiar with nvdim-n. Raise your hand. Yeah, more than
half the room. The nvdim-n is kind of a cute little device. It was really how persistent
memory got introduced into the industry. And I don't want to slam it, I just want to kill it. So the cool thing about the NVDIMM
is that you have this FPGA
that's going to sit there and do nothing
except burn power
while you're exercising the DRAM
that's on that memory module.
And then on the back of the module,
just in case power fails,
you're going to take and disconnect that module from the system
and take all the contents of the DRAM and save it in Flash. And then power comes back. And you
take the contents of the Flash and you load it back into the DRAM. And then you tell the system,
let's rock and roll, dudes. Well, that's great. Except it also requires this ugly super capacitor
that's hanging on this cable that's blocking your airflow
and you have this expensive FPGA,
you have the power supply,
you have the fact that it's only half the memory capacity
because you can only fit DRAM on one side of the module
and the math starts not adding up really so well.
So that's an easy market for us to kill
because our registered DIMM uses standard register,
standard data buffers if you're going to make an LRDM out of it. You just replace the DRAMs with
an NRAM. And guess what? Not only does it operate like a DRAM module, but when power fails, the
data is still there. When power comes back on, you just resume from where you left off. No save and restore function needed, no super capacitor, and twice the memory capacity,
because you can put memory on both sides of the module.
What about power glitches? Yeah. Yeah, the power glitches would affect it the same way anybody
else would be affected, in the sense that what happens if you have a power glitches would affect it the same way anybody else would be affected,
in the sense that what happens if you have a power glitch today?
Hopefully, your CPU detected it.
If your CPU detected it, what the CPU does is it completes the burst that's in process,
puts the memory in self-refresh,
and that's kind of like the trigger point for nvdim-n today.
With this architecture, the same thing.
Completes the burst that's in process
and stops using it,
and then power can go away,
and there's no problem.
It's only when a data burst gets interrupted
that you have data loss.
But that's consistent with any persistent memory, right?
If the bus is screwed up, your data's screwed up.
I can't solve that problem.
So what did persistent memory give the industry?
The persistent memory is a great idea.
And the good news is that the nbdim-n has made its way into the industry. The persistent memory is a great idea. And the good news is that the NBDM-N
has made its way into the marketplace. We have gotten the Windows drivers and the Linux drivers
to support it. We've gotten DAX mode and all that other cool stuff that allowed us to get away from
this problem. This was the problem that we were all trying to solve, which is when you have data
loss, you don't want that data loss to prevent
your bosses paying your paycheck from going into the auto deposit at your bank. So you better have
some checkpoints. And that's what was gobbling up system performance was all these damn checkpoints
where you run for a while and then you have to checkpoint how far you got before you run for a while and checkpoint again.
It ate up system performance.
It ate up space.
It really didn't provide anything useful
except for the fact that, well,
DRAMs were built from capacitors that lost data.
Cool thing about persistence
is you just run, run, run, run, run.
So persistent memory was a great thing for the industry.
You just keep running, and you just keep running as long as you want.
Yes, sir?
Hi, Bill.
I have a question about that.
One of the general truths in the software business is that people like to write stuff up,
and checkpoints have always been handy for rolling back to the original code. Yeah, that's a really good point.
So what he's pointing out is that
sometimes you'd want to do checkpointing
even if you have persistent memory.
But the nice thing is,
now your checkpoints can be in memory
instead of the checkpoint that requires
that you copy data out of DRAM,
shove it up through PCIe into your SSD.
That's the part of checkpointing that can go away
because that's what's really killing your performance.
You can checkpoint in main memory,
and it's a very small number of nanoseconds.
When you checkpoint off to SSD,
now you're talking about microseconds of delay,
and it's a whole different category of delay, I think. Don't you agree? Okay. That sounds like a great offline conversation.
Let's talk about that later. Okay, so killing the battery. So killing the battery is great.
But the reason I wanted to bring this topic here,
because this should be obvious to probably everybody in the room,
but there are some things that are not quite as obvious.
So, for example, right over on the left-hand side there,
that is kind of what everybody is building today
and has been building for a long time.
What you have, what I call an SSD controller,
is pretty much that whole variety
of things. That can be sitting on a SATA bus, can be on PCIe, can be a fabric interface.
That controller is going to have typically flash memory for the mass storage. That's the
general architecture of an SSD. But that could also be the rotating media for your hard drive. But hanging off to the side is that cache. And that's the weak link,
of course, that you're putting data in that cache, you're doing command reordering and so forth,
and you're sensitive to data loss. So what everybody in this industry has to do
is put an energy source out there. Super caps,
tantalums, something. God forbid, even batteries. And that external energy source ripples through
the entire design because it limits how big your cash can be. And that's kind of the weak link here. And in fact, it has this funny ripple effect
that, for example, you might have, say,
a one terabyte and a two terabyte SSD,
and your one terabyte SSD might perform better
than your two terabyte SSD
because keeping two terabytes of flash alive
requires more energy than keeping one terabytes of flash alive requires more energy than keeping one terabyte
of flash alive. And so you might actually have less available cache when it's in the two gig,
two terabyte configuration versus one. Big problem for marketing guys today.
And it's all because you can only fit so many tantalums on your module.
Well, what if you got decoupled from that?
Well, there are a lot of obvious things that come from decoupling and getting rid of that energy store.
First of all, big one is you can add more flash and increase the capacity of your drive and make it more marketable.
So that's pretty big.
But the second thing is you've now decoupled your cache size from your storage, right? Now people generally have like a gigabyte of cache
versus for a terabyte of mass storage. That formula is not driven by performance optimizations.
It's driven by how many capacitors you can fit. Now you're decoupled from that. What if your performance metric said
you could do better with four gigabytes of cache?
Well, now you can do that
because it doesn't require any additional energy.
So it changes how you think about architecting this.
There are other things you can do
now that you have this DDR4 class cache device hanging off the side.
What if this SSD controller was going into a notebook?
Right now, notebooks, when they go into Hibernate, they have to take the contents of DRAM,
stream it up through PCIe into the flash at flash speeds.
So it takes 20, 30 seconds to hibernate your notebook.
And then you have to load it back once power is restored. Another long delay, which makes our
notebooks so painful to use that most of us probably use our cell phones instead of the
computer just because we don't want to deal with boot up times. What if you put your Hibernate partition right here? And now,
instead of it being 20 seconds to go to sleep and 20 seconds to wake back up, what if it's now
less than half of a second to store all the contents of DRAM into your Hibernate partition
and then restore it back when the lid is reopened and you could have an instant on notebook it like i say
starts changing how you approach system design as well so pretty excited about this kind of stuff
that you can do cell phone same kind of problem right now you running on your cell phone and
you're not doing anything the cell phone likes to put its LPDDR memory into
self-refresh mode. Well, guess what? Self-refresh mode still draws power. It's sitting there
internally every 3.9 to 7.8 microseconds doing a read-modify write on a row of memory,
and it has to do that all the time,
and it runs your battery down.
With NRAM, you can turn power off completely,
turn it back on, data's still there,
and you keep going.
So it's a true zero standby power implementation.
Big change.
But like I keep alluding,
artificial intelligence is really a home run.
If you've looked at the artificial intelligence architectures,
they're pretty different than the traditional von Neumanns that we've been working with.
Instead of going with a single central processing unit,
maybe with a few cores, but still basically one execution unit. Artificial intelligence architectures
have tens to hundreds to thousands
of execution units across.
Very little ones.
Eight 16-bit kind of processing elements.
Very simple multiply, accumulate kind of stuff
for giving weighting, for doing compression to JPEG,
that kind of stuff.
And they need a really wide data set coming in and out.
So the AI guys are bringing in something like
high bandwidth memory to get the width of the data interface.
And this, for example, is the Intel Nirvana chip.
That's 32 gigabytes of HBM sitting there.
They're going to load in... And then how do you get in and out of
this chip serial pipes and then they take these serial pipes and they interconnect all these
nirvanas into a toroid or a hypercube so now let's imagine that you're this is the one that's
connected to your io subsystem and you're now going to take your memory from your SSD,
load it into your cube structure.
Well, that means you have to go through all these serial hops
to fill each of these chips one at a time.
Multiply 32 gigabytes times, say, 1024 processors,
and you can see that the boot times for some of these devices can be a day.
Now talk about power loss and the cost of power loss. I'm sure you guys have all seen the statistics on how much energy loss and power failure costs data centers. And so this is one of
those great cases where, holy moly, you'd be really out of a lot of money if this thing went
power fail. And to complicate things, the early artificial intelligence applications just loaded
a model and then ran user data through the model. Well, now artificial intelligence is giving way to
deep learning. And deep learning is a different approach.
Deep learning says, learn from your mistakes.
Take and feed back into the model the choices that you made that turned out to be wrong
so that you can do better next time.
Now what they're doing is modifying the model in that HBM
so that the next time user data comes through,
it makes a better choice.
Now you're not only in danger of losing all of that load time on PowerFail,
but also learned user data,
and that is even more valuable.
So what do these guys have to do?
Exactly what the database guys have to do.
They have to checkpoint,
which means that they update their models,
and then periodically, at an interesting point,
they have to take the contents of the HBM,
shove it back out over that series of links on the serial channel
to get back out to an SSD to checkpoint.
Extremely expensive, and more importantly,
it gobbles up system bandwidth that could be better used running user applications.
Now imagine putting persistent memory in those architectures.
And so replacing HBM with an in-rem HBM is an obvious win for this marketplace as well.
And this is graphically showing what i was just talking about so this is where these architectures are built so the data comes in on the left it runs through these algorithms doing waiting
when it makes the determination whether you had a hit or a miss that's when it has to do
back propagation into the model and
update the model, and that is the danger of these architectures.
So logically, that's what this is going to look like.
It looks a whole lot like that database flowchart, right?
A little bit more complicated.
You have the initialization stuff coming off of reset and or power fail.
You load the HBM with the code and the initial models.
But this big ticket item down at the bottom here
is when you decide that you need to checkpoint the model
and transmit that model,
you're gobbling up a ton of available bandwidth
on your IO subsystem.
And we can do better.
Persistent memory allows,
it's not going to eliminate the load time,
but it will eliminate reload time because reset is no longer going to take you all the way back to the beginning
and it's going to eliminate that feedback path where you have to do the checkpointing
so how would you take this idea of nram i've told you i can build a ddr4 chip out of it
but what if you wanted to go
right down in and incorporate it right into the silicon? And that's a different model,
but it uses the same concept. Our DDR4 control silicon is literally just a piece of silicon
that does DDR4 stuff. The carbon nanotubes are literally layered on top of the silicon.
You could do the same with a chip.
So let's say you had an AI device that wanted to incorporate persistent memory into it.
So I'm going to walk you through how you could then take that architecture
and merge the functionality of persistent memory and right into the this device i'm also going to suggest
that with your s ram as the execution part of the artificial intelligence engines you can one
either replace the s ram completely five nanoseconds might not be stellar but it might
be good enough for these little execution units. Or at very worst, maybe
it's a shadow that on power fail, you can quickly copy from the SRAM up into the persistent memory
just by doing a transfer straight up out of the cells down below into the persistent memory cells
up above, which means that essentially in five nanoseconds you could checkpoint all of your sram so those are the two reasons why a 1t1r structure might be make sense in that part of the
application so this is what it would look like and now keep in mind this is just a conceptual
diagram don't get don't start thinking that it looks like you're die stacking because this is
just a conceptual diagram literally the carbon nanotubes are spread in layers
right on top of the silicon, covers the whole thing.
Carbon is a great conductor of heat,
so it actually acts as a heat spreader as well.
It's really the best thing you want to do,
as long as you pay my royalties.
So how does this work?
Well, what you do is you incorporate
the drivers for the carbon nanotube memory cells
right into the customer logic. So diagrammed, you know, 3D over there or vertically here,
you see what you do is you design your circuits, you incorporate our drivers and receivers right into your design you place the carbon nanotube array
right over the top you can build one t1r and cross point at the same time because they're
the same manufacturing steps it's only a choice on how you do the device drivers
the cell drivers in your logic you incorporate that as your memory interface into your customer logic and fabricate the whole
thing together and now all of a sudden you have an artificial intelligence device that has a full
persistence taking that a step further it doesn't even have to be sram replacement what about
registers you can do shadow registers what about registers? You can do shadow registers.
What about latches?
You can do persistent latches.
The technology is kind of mind-boggling
once you get into looking at what you can do with it.
Elimination of refresh.
Okay, so we talked about that additional 15% bandwidth,
but what about the latency hit?
If you're doing a 400 gigabit per second Ethernet card,
it's 350 nanosecond refresh recovery time,
140,000 bit times.
Where's that data going to go?
Well, that means that your controllers have to buffer all of that while waiting for the DRAM to become available
and then empty that buffer when the DRAM becomes available.
With an NRAM, no refresh,
you can start decreasing the amount of IO buffering
that you have to do because it's available all the time.
It doesn't go offline for refresh functions.
That's kind of what I meant by, you know,
changing how you think about doing
system architectures.
Now,
talk about data security
though. One of the big problems that
people have is they say,
well, persistent memory is great
except, you know,
12,000 years of data might
not be the best thing for everybody
because somebody could steal that module,
plug it into another system,
and have my bank account pin or something.
Yeah, that's a real concern.
So how do you address that problem
when you have a persistent memory?
Well, with the DRAM backing up the flash in the NVIDIA M-N,
one of the mechanisms,
it's the incorporation of encryption and decryption
when they do the save and restore.
We don't have that.
We literally are going to be persistent memory forever.
So there are a number of ways that you can solve this problem,
and that is, first of all,
you can do encryption in the CPU before the data is sent out.
And with the vast majority of my customers,
that's exactly what they say to do.
They say, don't put encryption in every chip
because that's going to cost an arm and a leg.
We'll take care of the problem, and they'll do the encryption.
However, there are other architectures.
Open CAPI, for example, that's going to a smart controller.
In that context, you have that smart controller do the encryption and decryption.
Problem solved as well.
NVIDIA M-P.
There's no reason why we can't be the media in an NVIDIA M-P
and let the NVIDIA M-P controller do encryption and decryption.
So the right answer is
don't pay for every chip
to do encryption and decryption.
Centralize your encryption and decryption function
so you only pay for it once.
So, went over a lot of material. we talked about the carbon nanotube structure and how we can make
a great memory core talked about how this can be built into any kind of d-ram replacement function
including uh standard devices or custom devices talked a little bit about how cross points and
1t 1r types of structures both have
applications and they're pretty distinct looked at an example of how you can incorporate this
right into your own silicon designs as well we are a licensing company you guys are more than
welcome to come up and license this from us and build it yourself umed a lot about how persistent memory
is literally more efficient than DRAM.
The DRAM has these refresh cycles.
It has power loss issues,
and persistent memory solves a lot of these problems.
I think we talked quite a bit about a few applications.
Hopefully, at least one of those applications
connected to everybody in the room.
And then we talked a little bit
about data encryption. So
at that point, more than willing to
end the talk and
ask for your questions. Thank you very much.
Question in the back.
It sounds better than DRAM in what?
No, not really.
These chips are terrible with salsa.
You can never get that carbon taste out of your mouth.
Aside from that, I can't think of a downside
literally that's the most
important question My new best friend.
My new best friend.
At 28 nanometers,
our die size is 60% the size of a DRAM done on 14 nanometer.
So that's cost.
Right.
But it is price.
We don't control price because we're not a product company.
But if you look at who our funders are, you get a very good idea of what our supply chain looks like, and our customer will be shipping this
in pre-packaged server form to their customers. However, our supply chain does allow for loose
chip sales, but like I said, you can also license it directly. So let's say you're with
a Marvell, and Marvell wants to do an SSD controller
and suck the cache right into the controller.
You could take that Marvell controller,
design our IO blocks for the drivers and receivers
for the carbon nanotube arrays,
and that Marvell controller would have, say,
16 gigabits of persistent memory
for its cache on the same chip.
That's exactly the right business model.
We license it to anybody and everybody
who has a credit card.
Timeline.
Again, we're an IP company,
so we can't speak to product timelines
because we're not the ones that do product.
That being said,
tape out of our DDR4 device, done.
DDR5 device, well underway.
And you can do the math.
I think that's my warning.
Yeah, so that kind of gives you a sense
of where the timeline is
right yeah
yes sir
DDR4 drop in replacement
so therefore everything
runs off of either 2.5 volt or 1.2
volt rails
and technically it's an electrostatic force.
It's not even a voltage that does the set and reset.
And so it's a little bit buzzier,
but we can always have an offline discussion for that.
And they're telling me I'm out of time.
So I'll be available for offline conversations.
Thank you very much.
Thanks for listening. So I'll be available for offline conversations. Thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.