Storage Developer Conference - #186: The Looming need for Molecular Storage
Episode Date: April 4, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual
Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to SDC Podcast
episode number 186.
Okay, so Murphy's Law is in full effect today.
I got whacked out of my laptop just before the presentation,
but thank you for providing me with another.
So kind of, as was said,
my job is roughly making sure we have the right hardware, software.
I'll add scale in Azure Storage.
And also, people have a lot of difficulty
thinking about exponential growth. People aren't
evolved for it. And, you know, we tend to look two, three years down the horizon and figure out
what we need to do. But there's some things where we need to look very far ahead to try and
understand the actions we have to take now to get ready for it.
So just as a quick survey here, in terms of generations, how many boomers do we have?
All right.
How many Gen Xers do we have?
Gen Y?
Okay, that's not good.
Gen Z? Okay, so we need a little more youth in the storage area in order to kind of get ready for this future that's coming.
So in Azure Storage, I started 2008.
We had a fairly modest footprint,
and we inherited a lot of our technology from our search team.
And we've grown quite a bit,
thousands of times since the inception. And even at the beginning, it was very difficult
to get the team ready for the future. Again, people don't tend to think exponentially.
So we had like a handful of clusters. We made a big purchase of 12 clusters. And I start telling
people, well, we need to get ready for a thousand.
And people are like, what?
I'm like, yeah, that's only three, four years away.
And now this is what we have.
We started in six data centers.
We're now in almost 100 regions,
over 140 data centers.
And we're building data centers at a rate you wouldn't believe.
It's more than one a month.
I can't tell you exactly.
And the question is, well, what are we doing to store data?
How are we deploying it?
And there's kind of a picture of one of our
deployments, about 20 racks. And, you know, we have, you know, tens to hundreds of exabytes of
data. We're approaching zettabytes in the next few years, as the industry is. And what this
translates into physically is tens of kilometers of those racks. You could run a marathon beside
Azure Storage. It's big. It's hard to comprehend, even as you write down the numbers. We have
thousands of deployments. We do deployments every day. And in terms of the scale and power, we're into the hundreds of megawatts,
you know, so think small cities.
And there's a problem coming.
You know, there's these problems we deal with day to day
with this growth,
but there's a much bigger problem coming.
So this data growth curve is kind of the industry HDD, but this maps to all types of data storage equally.
And the growth curve is about 40% year over year.
And this is a very strong signal.
Predicting the future is notoriously difficult.
Predicting exponential curves is even harder.
I gave a presentation years ago, like 2011, talking about storing zettabytes.
And last year, the hard drive industry shipped a zettabyte of capacity for the first time.
And when you were talking about shipping zettabytes, when the industry was shipping,
you know, hundreds of petabytes, you sound like you're a tinfoil hat guy.
What are you talking about? You know, people think linearly. shipping hundreds of petabytes. You sound like you're a tinfoil hat guy.
What are you talking about?
People think linearly.
So much so that I even went to give a presentation to Seagate because sometime around 2014,
they were seeing a decline in their head counts.
And they were like, oh my gosh,
our hard drive's going out of business.
Should we be lowering our investment
and shutting this thing down?
I mean, it wasn't quite that extreme,
but I was looking at these curves in the cloud,
and I say, no, your business is just shifting.
You were selling all these customers
these multi-hundred gigabyte, terabyte drives.
They were putting like 10, 20% on them.
And then they weren't using
the capacity. When we went to the cloud, we started buying your very high capacity drives,
the biggest ones we can. We run them at, you know, a lot of people didn't believe it, over 90%
capacity effectively. And we compress everything and we erasure code everything and we store it
much more efficiently.
So they were seeing a decline in heads,
and they're like, oh, my God, the world's coming to an end.
I'm like, no, no, no, just wait a few years and follow this curve. And I was lucky.
The growth was pretty consistent,
and it actually tracked that curve almost perfectly.
And now the hard drive industry recognizes
that their entire business is selling to cloud service providers.
There's no more consumer hard drive revenue of any significance, like if you just fast forward a few more years.
And still, this is going to be a complete boom for them because they have the most efficient dollar per gigabyte answer today.
And there's nothing that I'm aware of that's coming
that will beat them for online hot access
in the dollar per gigabyte range.
And then the question is when you're looking at, you know,
an exponential curve, you say, well, that can't go on forever.
Absolutely, it can't go on forever absolutely it can't go on forever but the
question is are we in the middle are we near the end so two things can happen right things can keep
doing what they're doing or things can change and you got to ask the question is like well you know
what are the things that are coming into the cloud and are things likely to change and you know how
much how big is this data really?
And when we think about a zettabyte,
well, that sounds like a lot of data.
But it depends on how you look at it
and what type of analogies you use for scale.
So my current favorite one is
if we took all the storage that mankind has ever produced,
we can't describe the state of one mole of gas.
So from that perspective, it looks pretty small.
And then you've got to think about
what are the applications that are coming
and what type of storage are they going to need?
We'll get more into that later.
So this is kind of taking you down the journey
of how we've worked to improve the efficiency
of storage.
And I strongly believe that as we reduce the cost of online storage, we are enabling more
and more applications.
If it costs too much to retain the data, the data isn't retained.
But I think we also are creating a virtuous cycle in that the number of applications that
can come grows faster than the efficiency
improvements. So we create a bigger and bigger business. If you look at the projections for the
cloud, you know, IDC or whoever, you know, they predict that the revenue for CSPs by the end of
the decade will be a trillion dollars a year and continue to grow. And a lot of that is going to
be data storage. And my favorite line is, there's a reason they call it a data center and not a
compute center, because the data is the important part. And here at SDC, you guys should know that
you're making the most important changes to the future by enabling new storage technologies and making the cloud more effective and more efficient.
So let's go through the efficiency journey.
When we started in Azure, like I said,
we inherited technology that came from Search,
and Search had this very strong meme around
you want all your hardware to be fungible or reusable
because you don't know
where the applications are going to be. And they were writing the 40% improvement from Moore's Law,
and they're like, hey, every year we're doing great. We're getting better and better and better.
But they were buying hardware and selling ads. So the coupling between what they're buying
and what they're selling wasn't very tight.
Their margins on ads are huge.
So they're not looking very hard at their hardware.
When we came in with Azure, we looked at their system,
and the finance people looked at their system,
and we had a benchmark.
We had AWS.
And it turned out that to store data in that system
as a service cost over five times what AWS
was charging. So I kind of started on this journey to work to improve the efficiency of data storage,
along with, of course, a huge team of people at Microsoft. And a lot of the lifting has been done,
of course, by the hard drive industry. When we started, we had 500 gigabyte drives. We're now
like, that's a little
out of date now. I think we're getting 22 terabyte
drives.
But, you know, if you follow the drive
capacity curve,
they went 1 terabyte, 2 terabyte,
3 terabyte, 4, 6,
8, 10, 12.
Well, it's not exponential.
It means that in order to handle the curve of growth,
we need to deploy more and more hardware.
As I've said earlier, we have kilometers and kilometers of it now.
And to continue to lower the friction for more applications,
we have to continue to push down that cost.
So besides the hardware improvement in hard drives,
we've done a lot of work in how we store the data.
We've added compression systems,
very sophisticated erasure coding systems.
We offer different classes.
We've deployed archival storage.
I gave a talk there for Peter Fallhaber
and Fujifilm's conference.
At the time, that was the cheapest way.
I think it is still, for Idlebytes,
the cheapest way to store.
I even went to a data at scale conference
where Facebook was pushing optical storage,
which was derived from DVDs,
and they were saying,
this is the future of cold storage,
and anybody with a little background in physics
can look at the wavelength of light,
the surface area of the disk, and say, well, that's probably not the future the wavelength of light, the surface area of the disk and say,
well, that's probably not the future.
It's a lot more surface area on a tape.
So I actually presented after their presentation on optical
to tell them actually tape was just fine
and was going to be huge.
Facebook, from what I hear,
is now the biggest consumer of tape on the planet
and they back everything up.
But these are just some of the opportunities.
But in my day-to-day,
I spend a lot of time worrying about the future,
how we make it cheaper, how we
enable more storage applications.
So
the HDD story has a little,
not a problem for the HDD manufacturers,
but a problem for us, which is
that they're hitting the top of an S-curve,
kind of the capacities I described a minute ago, and they need to make a technology shift. So some of them are shifting to
MAMR, and some are shifting to HAMR, and the reason for this shift is that the bits on the
disk are so small now that using the regular media types, which are stable at room temperature,
the bits won't stay where they are.
They flip.
Some call it coercivity of the bits.
But basically, at room temperature on the media, they aren't stable.
So they need to use a media that is more stable and needs to be excited with energy before it can be programmed.
So that's what MAMR and HAMR are about.
And from talking to them,
MAMR, they think, might get into the high tens, mid tens.
HAMR has a roadmap maybe to 100,
but they always surprise us.
So you might assume,
okay, maybe they'll get twice as good as we think.
They'll claim that they're going to get.
In that world, let's say they get to 230 terabytes.
If we look at the amount of power that we consume just to spin the drives,
like forget the data centers, fans, servers, and we follow the current curve, even if we had this 230 terabyte mythical drive,
by 2030, or sorry, 2042,
we would use 5% of the current US generating capacity
just to spin the drives.
And of course, the curve doesn't end there.
By 2050, if we've tried to follow this curve
and provide this amount of storage,
we would be using 60% of the current US generating capacity.
And then if those drives are not 230,
but really 100, we'd be using more power
than we currently generate in the US.
Which, when you talk about exponential curves,
are things gonna stay the same? I can very strongly say things cannot stay the same.
So the question is, well, what has to change?
Well, one of three things has to change.
We have to slow data growth.
Tell people, yeah, this is how much data is going to be.
This is what it's cost.
We don't have any more improvement.
Work with it.
I don't think that's a future anybody wants
because it means all those applications
will not be enabled.
We can generate a lot more capacity and power.
That is going to be extremely controversial
given all our efforts to conserve energy,
use renewables.
So kind of bad timing for, say,
let's stoke more furnaces to generate the power.
There's another thing we can do.
We can change data storage technology.
And then the question becomes as well, well, how?
I mean, we don't have the playbook.
We don't have the tech.
We haven't heard about anything that can do this.
I'll tell you just historically,
since I started talking about tape and other media types, I've been getting a lot of emails from everybody with a, I won't say crazy
idea, but an innovative idea on how to store data. And some of them are, well, if you just cool down
the data center to like two Kelvin and you put this device in there, then I can store all this data. I'm like,
yeah, that's great. Except for the two Kelvin part. Um, there's people who've come up and said,
well, I can print on paper and multiple colors and you can use the colors, you know, bit depth and all that. And like, well, how are you going to get the consistent color? And, you know, show me
a prototype. Then we can talk. Um, but, but it's a constant stream. So there's lots of people recognize this is a big business.
There's lots of justification to invest and innovate.
But we need a platform that's going to work.
So when Azure Storage started, we had a dozen clusters.
We made our first big purchase.
It was like $80 million for 12 clusters.
Small business. We're a pretty big business now, right? We measure, you know, revenue in the
billions. And we can afford to try and help answer this question on our own. So we've had research
projects in MSR to do DNA storage. Molecular storage is kind of the panacea, right? I mean,
yeah, maybe you can store things in electron spin or something, but DNA has the highest density of
anything that we've actually seen be used to store data. To put it in perspective,
the raw data, not error-corrected, you can get about an exabyte and a cubic centimeter.
That's pretty dense.
We've done more research into storing data in glass,
which has some very nice characteristics.
And this is another, this is actually a case of a crazy email went good. I got an email from, forwarded to me about a research project at the University of
Southampton storing data
in glass.
My friend in MSR,
Ant Rostron,
read the same article, and I'd
already scheduled a flight to go meet the
guy at the University of Southampton.
Ant went and met him later.
And we worked with them
to figure out how to develop
a commercial system based on this
the very interesting characteristic about this
is as we generate more and more data
our media types today actually are not
very durable. If I store data on a hard drive
and I try to put it on a shelf
you can come back in five or six years,
but if you come back in 10 years, your data is probably not going to be readable.
So in a sense, as technology has advanced, we've kind of gone backwards, right? We have stone
tablets that are thousands of years old. We have no media type that we use today in a data center
that can retain data for thousands of years. This can retain data for we don't know
how long, as long as we've ever been able to test it for. You can boil it, you can run it at high
temperature, you can hit it with an EMP, and your data is still good. So this is kind of very exciting
work, but it's more archival, not really hot data. We've been doing research into holographic storage.
This is an area where IBM made a significant investment
two decades ago, and people haven't revisited it.
There's a bunch of technology involved in how you create the image,
how you project it onto the crystal, and how you retrieve it.
And we have a lot better technology today for doing that,
so we're exploring this.
The very nice thing about holographic storage
is it's extremely fast.
The images are very large,
so you can retrieve a large image in a millisecond.
That image can have gigabytes of data in it,
so you can figure out that data rate.
It's pretty darn good.
But there are a lot of challenges with holographic storage.
That's why we're doing research.
And we're willing to try lots of different things.
When you're making billions, you can invest millions in the research.
But we're not seeing anything yet that we have a lot of confidence
will displace the primary store.
So the question is, well, what can we do?
Well, obviously, when we look at DNA, we love the density.
It is molecular.
It is small.
It is very high capacity.
We could definitely store the state of several moles of gas with it.
But the problem today is that it's leveraging technology
for medical. Medical's incentives around performance
are not aligned with what we need in the data center.
And we need something that can be faster.
So,
when I was thinking about this problem, and many people are thinking about this problem, I asked this question, which is, where is most of humanity's data stored?
Anyone want to guess? Anyone want to put out?
There you go.
Okay, so the HD, I don't know if this is a little too small,
HD industry likes to ship to zettabyte.
To power a zettabyte, you need about 50 million hard drives,
500 megawatts, at 500 megawatts per zettabyte.
You know, this is state of the art, maybe.
Well, human brains, by estimates, you can kind of look it up.
Many people have tried to estimate the capacity of a human brain.
And, you know, this is kind of the raw bits, not the full capacity.
I assume there's some deduplication, some very novel representations.
But basically, the human brain embarrasses us.
Eight megawatts per zettabyte.
And, you know, this is kind of a proof point
that within the universe and the world of physics,
there exist solutions for data storage and access
that are far superior to the systems that we have today.
But the question is, is what investments do we need to make
in order to access them?
So, you know, this is just one example.
I could write a similar roadmap, you know, for any type of storage.
But, you know, this is talking about specifically the integrated circuit.
So my entire life,
1971, I was born in 67,
we've been on the integrated circuit.
And if somebody asked me what happened before that,
I'm like, I don't know.
So I went and looked a little.
There were designs for computational devices even in the 1800s.
They didn't actually get built or work.
We've had mechanical systems.
We've had systems based on thermionic valves, relays, vacuum tubes, and then big systems built
on individual transistors. And probably everybody in this, well, there's some older people here,
the boomers, they're probably aware of those older systems. But, you know, we never, you know, I've never even considered until we went through this exercise that we're just sitting on one platform.
And that platform has done so well that we haven't thought about other platforms for computing and storage.
You know, we've done pretty well with the integrated circuit.
You know, we're sitting at 100 million times improvement.
Do your own calculation. It's something between this and a billion times. We probably have 100x
to go. People are talking about 1.6 nanometer or even maybe half nanometer. Still a lot of research.
But it's pretty clear to me, my opinion, that we've kind of pushed this platform pretty far.
And we're now into the world of optimizing for the application.
We're seeing very more application-specific designs using kind of the Von Neumann machine as a general purpose.
It isn't working.
We've gone to GPGPUs.
We have other classes of accelerators like DPUs. So we're
specializing. We're not in the general purpose processor anymore. And sometime in the next few
years, decade, two decades, if we want to keep riding this exponential improvement, we're going
to have to look deeper into the world of physics and different principles in order to continue going.
So if we follow the curve and we want to get to Yotta scale,
we're at Zeta scale today,
the curve intercepts Yotta scale in 2042.
And we discussed how much power that would take,
which is unacceptable.
And we look at the existing roadmaps that we have.
And for capacity, only DNA on here
can hit these capacity things at a reasonable power.
DNA is, I'll say more generally,
is a molecular class of storage.
There are other ways we can do storage based on molecules.
And I don't know if DNA is the right molecule,
but it's the one we have the tools for right now.
And we're going to start investing there.
We are investing.
And we're going to try and build useful applications.
And we'll hopefully learn a lot of things
about manipulating molecules.
But there are other people looking at manipulating molecules.
And what's pretty clear, you know, I mentioned you can't simulate a mole of gas.
You can't even represent it in storage.
So people are trying to use AI in order to not simulate things,
but infer through AI how things will behave.
So Google has produced AlphaFold,
which is an AI that figures out how proteins are going to fold.
It is extremely good at predicting it,
and it is not sitting there simulating the atoms and molecules
inside a protein to figure out how it's going to fold.
It's learning in a neural net based on other rules and experiences.
And from what I've read,
they've successfully predicted the folding
of something like 200 million known proteins.
Pretty good result.
Microsoft has a research project called AI for Science
where we're doing molecular simulation.
There's a lot of medical research,
and a little controversial,
but there's a lot of billionaires walking around,
and they have a limited time span.
And this type of technology might give them a longer time span.
And they're directing a lot of their financial net worth
towards figuring out how they could stick around a little longer.
And fortunately for us, that investment overlaps with storage needs and maybe computing needs.
In terms of this area, besides figuring out how to manipulate molecules, how to fold proteins,
how to build molecular machines that we can possibly build systems out of.
It's pretty clear that our new tech is going to have to bridge to the old tech,
which are our integrated circuits.
So we're going to have to figure out how to interface with it.
And the prediction for me is if we can get this type of system working,
we can enable Yotta scale and Zano
scale without melting the planet. Any questions? Seriously? Okay.
Several slides back, you had a projection of energy consumption.
Was that just Microsoft's or?
No, that's all hard drive capacity.
And that's global? Was that U.S.?
That's a global measurement using U.S. power generation.
So the United States generates 700, sorry, has capacity for 700 gigawatts, isn't generating it all the time.
So like an average for the AES-580 worldwide, the generation is 7.1 terawatts.
In terms of the hard drive or whatever drive you're using, what's the basis for that?
A 10-watt hard drive. Okay, 10-watt hard drive.
Okay.
Yeah, so the question is, you know, we have an estimate on, like, how much storage we need.
Do we have an estimate on what it would cost to develop DNA storage?
And the answer is, no, I don't.
But, you know, I think that we're talking about materials that are, like, generally available.
They're all in you presumably you know there's a path to synthesis
that is is pretty cheap and inexpensive so one kind of placeholder i use in my head is like you
know every one of you has a bunch of ribosomes in you that's a pretty good engine for translating
dna into proteins maybe we can leverage something like that Yeah, the question is,
we have an estimate on power,
but do we have an estimate on the depth
that we would have to cover the planet in HDDs?
Yeah, the planet is really, really big,
and HDDs actually aren't that bad.
Definitely at Yotta scale,
you wouldn't even cover the Earth with HGDs.
Definitely
at Zano scale,
you'd probably be a few feet deep.
Anyway,
the other thing that I want to point out
for everyone here, especially the younger
ones, is that these are projections based on exponentials.
The future isn't kind of written, right?
But the pressure from the industry and mankind to store more data is pretty obvious when
we look at these curves.
This is a future that we would have to enable by making the right investments.
It's a future that we have to create if we want all the technology and capabilities that this type of storage will enable.
You could argue that you're sort of blindly accepting the need for data growth.
Those same kind of projections are used for demand for water, for example.
If we can start putting real prices on what water costs to consumers,
it can knock the demand down
pretty significantly. Is there a way
to do something like that for data growth?
I'm sure there is.
I think we're about to do it
and see what happens if we don't get
some new technology.
When I look
at the data sources,
there are more than enough data sources
to continue driving this.
There's sensors, there's cameras.
An interesting statistic is there's more security camera,
there will be more security cameras than people
by the middle of the decade.
And the question is,
what is the value of the data
versus the cost of storing it?
And as long as we continue to shift the equation to the cost of storing it being cheaper,
then the data will get stored, as long as it has the value.
I'm sure there's lots of video cameras that post 9-11,
the U.S. government would have wished had longer retention,
so they could go and trace back things that happened.
So I really kind of believe that the data growth and the different sources,
when we started, there was a clear one.
It's like all the cell phones are going to get backed up.
Now it's clear these AI data sets are massive.
We have weather data, weather sensors.
We have space telescopes capturing incredible resolution images. And the counter to that is there's a whole bunch of those things pointing back at the Earth.
And they might be even better tech.
And what are they recording and what type of retention?
I don't see any shortage of data sources if we can make the data inexpensive enough.
And I think that the applications that get enabled
are helpful for mankind.
I would like a personal digital assistant
that is as smart as I am to help me out day to day.
I think everybody would.
We know that it's going to take two petabytes per person,
a few autobytes of storage.
I think it's a useful application.
If we make it cheap enough, it's going to happen.
Let's make it cheap enough.
Have you done any calculations with that?
I haven't, and the reason is, you know,
we obviously are very focused on all classes of storage.
We currently have about, I don't know,
between 5% and 10% of our storage on Flash.
Flash continues to be significantly more expensive
than hard drive, and with the hard drive roadmap,
we aren't getting any signal yet
that the Flash industry has a path
to be more cost-effective per byte than the hard drives are today.
Now, certainly, there's always innovation, right?
Lots of smart people are working on this problem, so that could change.
So, yeah, maybe there's some solid-state solution that will help solve this. Maybe there's something with, what is it called,
carbon nanotubes or sheets of carbon storing data in there.
I mean, there's lots of research going on.
We might be able to achieve much higher densities
with things that are much less radical.
And I think that relative to the opportunity,
the cost of doing all this research is kind of peanuts.
You know, you tell me, when I talk about a trillion dollar
a year industry, you're spending a few billion dollars
a year on research.
I think that's probably a good budget.
And I think the world's doing it.
I mean, I know definitely in the direction of molecules,
that kind of money is being spent.
This is the flip side of this, an economic argument.
In other words, if we don't find a new technology medium in the desert,
you know, me storing 150 pictures of my cat,
the incremental cost of that is going to go up exponentially because
I'm going to start eating into the 500 megawatts per zetabyte and all that, right?
Isn't that kind of a flip side of this?
The question, I'm not quite sure what the question is.
Can you repeat it again?
If we don't do this, if we don't find another medium, then if we continue down this HDD path,
in 2042, it's going to cost me, you know, 20 bucks a year to store a picture of my cat.
Well, if the resolution of your cat picture continues to accelerate, yes.
But, you know, we keep getting better, right?
So, you know, we have a clear roadmap to the middle of the decade,
maybe through the end of the decade,
to continue to reap improvements from storage density.
That's not where the problem is.
And as we go, your cat pictures are going to get cheaper and cheaper.
I hope to make your cat picture one-tenth the cost to store.
The question is beyond that, right?
As the growth continues,
how do we make it so that we can
support that growth
economically?
I think you need to add
time to access
into your thought.
Absolutely.
If you look at the current interface, it's certainly the only way to get serious Absolutely. Absolutely.
Right, but we have a few proof points that there exist systems that are much denser
that are efficient.
So like you can sort through all the data
of all the people you've ever seen
in about 20 milliseconds to do recognition.
Now that's not strictly DNA storage,
but I'd say it's based on molecular
or electro-molecular machinery.
So there's a huge space of design to explore here.
Molecular is just kind of our first
toe in the water to start understanding this space.
All right.
Is that two petabytes per brain more DNA storage?
No, no, no.
That's neuron storage.
That's neural interconnect.
There's a bunch of different analysis of it.
They all come to the same number.
I don't know if they cheated off each other,
but they did it through different sets of calculations.
And this tends to be the range.
I mean, maybe it's a tenth that.
Maybe it's ten times that.
The point is, you know, still the same, right?
That's neurons, not DNA.
Correct. That's interconnected neurons, not DNA.
Thank you.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers
in the storage developer
community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.