Storage Developer Conference - #183: DNA Data Storage Alliance: Building a DNA Data Storage Ecosystem
Episode Date: March 7, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode number 183.
Good morning, everybody. Welcome to the DNA Data Storage Track.
My name is Dave Landsman, Western Digital, and I'll be kicking it off.
And then we're going to talk about, so I'm going to talk about building a DNA data storage ecosystem,
as well as giving a little bit of an overview of DNA data storage.
I'm not a molecular biologist or a chemist, so
bear with me, but I'm trying to do a short summary. And then, so let's just dive in.
So the problem we're here to talk about in the track is fairly straightforward. There's too much data that people are digitizing to save it cost effectively.
And it's just we need new mechanisms for archival storage at high volumes, potentially zettabyte scale.
And simultaneously, the value of saved data is growing. So I like this chart that
Fred Moore at Horizons did because it shows, I mean, it's somewhat qualitative, but he
talks about a curve of value for data. Can you see the, yeah, you can see the pointer. So data, typically when it's saved,
it starts out useful. That's why you're saving it. And then it tends to taper off when, you know,
and, but it rises again as time goes by and people want to either retrieve it or mine it
or something like that. So this is kind of driving some of the trends. So the basic problem is too
much data and the value of it is growing.
People want to mine it and extract value from it.
So why are we talking about DNA?
So primarily DNA bits are really small. Excuse me. A simple base, one of the base molecule in the DNA, a base is a picture of an LTO tape, and if we were,
with some rough estimates of density and assumptions about bit density and things like
that, if we filled the space of this LTO tape with DNA bits, you could fit about two exabytes of data in this small container, which is over
100,000 times, 100,000 tapes. So this, I mean, everybody has their own metaphor or analogy for
scale, but this gives you scale. The other thing about DNA bits, they last a really long time.
You can think, you know, you can think woolly mammoth, right? We've recovered DNA, at least pieces of DNA.
It's a little bit of an oversimplification, but we've recovered DNA from fossilized bones.
They last a long time, and the bits don't need much care and feeding.
And again, think of the fossilized example.
The DNA is very durable.
And even if you want to store digital data in DNA and you want make it a lot less resource intensive and less costly than existing storage.
And then finally, DNA bits don't need migration.
You don't need to keep migrating the tape library every five years or 10 years. And so these factors add up to DNA
having a very good potential TCO story.
And lastly, if you think about the,
the DNA ecosystem already exists, right,
for medical and scientific,
and it's exploded in the last couple, few decades.
And that's why we're even able to be here talking about DNA data storage because of all the techniques that have evolved there.
But so the DNA data storage ecosystem will benefit from that investment and that energy in the market.
And it will also add its own.
So we just see a virtuous cycle evolving with respect to DNA storage and the market.
So with all this goodness, potential goodness for DNA as a storage medium, the question is, do we really need, you know, there are still questions.
So do we really need something as dense as DNA?
It's really dense, but maybe we don't need
it that much, that many bits. So also, can we scale the underlying technologies? I'll talk a
little bit about this as we go. But, you know, synthesis and sequencing are, you know, we're
working on getting them to be, to scale in performance and cost. And then how do we create an interoperable DNA ecosystem?
Because if we create a DNA storage products or an ecosystem,
it needs to fit in.
It's going to be a complement to existing storage.
It's not going to replace hard drives or tapes.
It's going to augment them.
So we need an interoperable ecosystem.
So on this first question, Aaron Ogis
will talk tomorrow. Aaron Ogis of Microsoft is going to talk on, you know, what a data center,
what Microsoft feels is the compelling need for molecular data storage. Today in the track,
I'm going to continue here with an overview of DNA storage and then a little bit about what the DNA Storage Alliance is doing to build the alliance.
Joel Christner is going to talk, Avdel is going to talk about an issue we call Rosetta Stone, which is kind of like, how do we bootstrap a DNA archive and discover what's in it so that we can decode the rest of it?
So it's kind of a master boot record talk.
Alessia Morelli and Reno Micheloni of DNA Algo are going to talk about a full system simulator for the DNA storage pipeline.
So we need tools to model the channel. And let's see,
Zhao Reis and Marila Menossi,
they're from a research institute in Brazil, the Instituto de Pequizas Tecnológicas, and they're associated with Lenovo,
and they're going to talk about end-to-end DNA storage,
an end-to-end DNA storage system that they built and studied.
And then Jao Gervasio and Adriano Galindo-Leal are going to talk about DNA coding and kind of the last 10 years of encoding for DNA data storage.
And lastly, Luca Piantanita from Boise State is going to talk about nucleic acid memory.
So hopefully you'll enjoy the track and hopefully.
Oh, and then we'll have we'll have an informal Q&A at the end.
So and I like I like this room for that.
We'll just we'll chat.
OK, so.
Yep. Oh, oh. that we'll just we'll chat okay so yep oh yeah they have it right they have
uploaded the slides if you want to get a copy yep hopefully the screen is big enough, but it's a little little.
Yeah. So. OK, so just.
You know, I think everybody is familiar, even if it was from, you know, high school biology courses, right? So the DNA molecule is a chain of bases, you know, adenine, thymine,
cytosine, and guanine. And the bases have a natural affinity for each other. So A and T
bind to each other, and C and G bind to each other. And they're connected in the molecule by a
sugar phosphate backbone. So this is our building block. And in concept, DNA data
storage is very simple, right? It's the devil is in the details making it work. But we take bits,
digital bits and files, we encode them into this language of A, T, C, G, because, I mean, DNA in
our bodies is a storage and encoding mechanism.
And so we encode the ones and zeros to the bases.
Then we build a molecule from those bases and store it.
And on the way out, we do the reverse.
We sequence it or read it.
And then we get our bases back. And then we decode them back into the reverse. We sequence it or read it and then we get our bases back and then we decode them back into the digital.
So and we like in electronic channels, we add ECC and metadata for various purposes into the chain of bases. And let's see. And one more thing. I'm going to use a term as I talk called oligos. It's oligonucleotide, and it's basically a short strand of synthetic DNA or RNA.
So you'll hear this term a lot, probably from me and other speakers. So that's
what that is. Okay. So I kind of talked about the DNA as a channel, right? And when I was
getting into this, it helped me to kind of think about some of the analogies
of DNA storage in respect to an electrical channel.
So we are building a protocol through the pipeline, right?
Some of the bits that we'll build and some of the DNA bits will be protocol bits.
There'll be payload bits.
There may be protocol at the end.
If we have a particular string of DNA, it can have fields and protocol,
just like we're very used to
with storage protocols. In electrical channel, of course, ones and zeros are converted to analog
waveforms. And in DNA, the waveforms are replaced with the bases. So we convert bits to bases.
And we have to worry about various aspects.
Just like with ones and zeros, we have to add ECC into the or with electric with waveforms.
We have to add ECC into the protocol.
So in electrical channel, we add ECC bits.
Sorry, I'm I'm stumbled a bit.
We add ECC bits to the digital bit stream, and in DNA, we do the same.
We add, before we synthesize, we add ECC bits and other metadata to help us. And in the case of DNA,
we're worried about errors like insertions, deletions, substitutions. And then in an electrical channel, in some cases, like with memory buses, we'll add
scrambling patterns at the transmitter because certain patterns of ones and zeros can create
electrical interference on the wire. Similarly, certain patterns of bases in DNA processing can be problematic. So we may alter the symbol stream
after we've added the ECCN metadata.
So as with an electrical channel,
this kind of line protocol that we've got for DNA storage
is critical to overall channel efficiency
and how everything works.
And you'll hear more about that later today.
And there's also logical protocol layers above the line protocol, file tagging, packetization.
If we have a file, we break it're two of the important areas, writing and reading. Two, there's phosphoramidite chemistry is one main technique, and there's an area of
techniques called enzymatic synthesis which are evolving. Phosphoramidite is the main,
it's the mainstream, it's the most productized, but enzymatic is evolving and one of the interesting
things about enzymatic chemistry is that it doesn't evolve as many caustic chemicals,
things like that, and it may also enable us to build molecules more effectively.
So this area is base-by-base synthesis. So we build a molecule, one base at a time, in kind of a cyclic process.
We start with a single base with a blocker on it.
We deblock it.
Then we add a new base that has a blocker on it, and then oxidize to solidify things.
And then we start again.
So very, very simplistically, we build molecules base by base.
And with either of these, phosphoramidide or enzymatic,
the limit today is about 200 to 300 bases in an oligo.
If you go beyond that, you start getting too many errors.
Now, if people want to put together longer strands of synthetic DNA, they can use ligation techniques.
So this is, again, I talked about how there's an affinity between the bases, right? if we have some short oligos, like the yellow and blue here,
we can construct what we call a DNA splint,
and you can actually use the affinity of the complementary characteristics
to glue longer strands together and thereby get strands of synthetic DNA that are many
hundreds of bases long or maybe more.
There's all, there's a very deep, there are debates going on about, you know, are
smaller oligos better in general or longer?
But, you know, one thing about having longer oligos is you have more bits, your protocol overhead is lower, just like you're used to.
But there's many, the industry is evolving here on what's the best way to build DNA molecules.
So here I wanted to talk briefly about kind of where some of the progress is.
So this is a study that was done at the University of Washington and Microsoft.
And they built an array to synthesize, an electrochemical array to synthesize DNA. And the basic results of the study were that they were able to do
synthesis on this chip
with 650 nanometer electrodes and 200 nanometer wells.
So they actually are putting chemicals and molecules and reagents in little nanostructure wells.
They were able to contain the acid diffusion at each site.
And so the chip they built was two micron pitch,
and they were able to control the acid diffusion at each of these wells.
And that's very important because the acid diffusion,
when I showed that picture before, when you took that blocker off the molecule so that you could add a new base,
if the acid diffuses too much over the array, you might affect other molecules.
So they were able to show 100 base long DNA, 100 base long oligos at these feature sizes.
Now, some people are asking how quickly can you write DNA data?
So this chip in the study, it reached a synthesis density of 25 million synthesis sites per square centimeter.
And this was about three orders of magnitude greater than previous work.
So it was quite a step forward.
And if you take an array with this density, you could achieve in the kilobytes per second, per second, per centimeter squared write rate.
And that, you know, the authors felt
this was like a practical minimum that we could get.
So for certain archival applications
where the write speed is not so critical,
this might be commercially viable.
It's just, it's a judgment call, but I mean, that's,
so they could see an array of this size would would would serve a purpose.
But at this density, if you wanted to, say, get to a megabyte per second squared, you would need a 360 centimeter squared chip, which is a little big.
You know, so and that and and or you would need many, many chips. So there's a long
way to go to scale, but this was a good, a very good experiment and paper published about real
progress in synthesis. And furthermore, we're seeing continued progress. So, so the chip I
just talked about in that study had 25 million synthesis sites per centimeter squared.
And Twist Bioscience just announced a chip with 100 million synthesis sites.
And they can write about a gigabyte per run.
You load it up with reagents, et cetera, do a run, one gigabyte and um the reason i mentioned the the one gigabyte is the the iarpa
did a study or it created a program the molecular information storage program and um they've set
they've kind of set goals uh that they want they think we need to achieve to get commercially viable molecular storage.
And if you look there, the goals by mid-2024 were to have a 10-gigabyte chip and about
a dollar per gigabyte.
I can't talk about any prices here because they're really hard to come by from the vendors, but we're seeing, you know, we have a one gigabyte chip in 2022,
and we have some others. So the indicators are that the industry is starting to scale.
That's the real message. And at least according to the predictions that IARPA made or goals that they set, we seem to be moving along this curve reasonably well.
Okay, so much more scaling is needed, but the foundations are established is the main message.
How am I doing on that? I don't know if I'm going really slow.
Let's see, 850.
Okay, so we talked a little about sequencing or synthesis. Now we're going to talk about storage in the middle. So I'm not going to go into any detail here, but there are many preservation
methods being looked at for when we store DNA.
There are ways, you know, there's chemical encapsulation, there's physical encapsulation. You can bind the molecules in a matrix of some kind, some material matrix, even on,
you can use adsorption on paper.
So there's many techniques and there have been many studies about biological DNA and how it,
how the molecules simply exist and how slowly they degrade in nature. There is now an attempt
to start looking at how we, how we store and with an eye towards data retention. So I'll come back to that point a little bit later in the talk.
The basic thing about DNA storage here is keep it away from water and air.
I mean, if you do that, you're in pretty good shape.
But there are many more aspects of how you store it and for how long you need to store it and how hard it is to retrieve for a particular method.
So I'll talk about that briefly later.
Okay, so sequencing, snapshot of sequencing.
So there's kind of two main techniques for sequencing today.
One is sequencing by synthesis,
and it's called sequencing by synthesis
because you start with a single-stranded DNA.
That's the template.
And you put it in solution,
and then you add bases into the mix.
And as each base binds to the template, in the Illumina case,
the events are detected by visible light. So each reaction event is detectable,
and because of the complementarity,
you can then tell, you can read what was in this original strand.
Now another, so it's kind of an indirect way of reading DNA.
Nanopore is the other technique that's really getting a lot of attention and
investment. And that's where we guide a DNA strand through a very small channel. And it either can be
a natural biological channel or it can be a semiconductor, you know, a synthetic one. And as the DNA molecule transits through the pore, it creates ionic current discrepancies or
tunneling current. And so these events can be detected and the bases are directly read.
So there's a great deal of energy going into this. Oxford Nanopore has a product
here. Illumina is the biggest. There are many synthesis vendors doing SBS. So both of these
are, yeah, in general today, SBS is kind of more accurate and slower, which is more appropriate maybe for medical and the traditional uses of DNA
technology. And nanopore is, on a per base basis, is slightly less accurate, but faster per base.
But the throughput of the systems is, you know, there's a battle going on in terms of
which is better. And then we'll, yeah. Okay. So as far as scaling synthesis,
today we're at a throughput level of about, the high-end aluminum machines, for example,
are kind of at tens of gigabytes per day data.
So it's not very high.
We probably need to get to hundreds of terabytes per day to be kind of useful.
And on a cost standpoint, Illumina presented this slide at FMS, and they referred to where their products are with respect to gigabases per day. So if we
assume one bit per base, which is kind of conservative, there's nuances here. We might
be able to get higher bit density. But if we use one bit per base, then that puts us at $48,000 per terabyte today, which is a little pricey.
But we see a direct line of sight to getting to $8,000 if we get to $1 per base or $1 per gigabit,
and no conceptual hurdles to $800 per terabyte.
And these prices are still kind of oriented around the markets that they serve, which
is medical and scientific.
So there's, I mean, cost and price are flexible, as we know.
So as DNA data storage evolves, not only will the technology evolve, but the pricing models
and things around it will evolve.
So the other thing to mention is that the DNA storage can tolerate much higher error rates than medical and scientific applications, genomic applications.
So the key will be what are the error correction technologies and how efficient can we make the pipeline for DNA storage versus these other markets.
So in conclusion, we have probably three orders
of magnitude to go on cost and price,
both cost price and throughput for sequencing.
But that said, there are many ways to manipulate
all phases of the pipeline that we talked about
from synthesis to sequencing,
where we can balance the error tolerance at each step and thus the performance of the whole pipeline.
So, okay.
So that's all I had on sequencing.
So one last thing.
I think I'm going faster than I thought.
So just to give you a very slight, a very small peak,
there are people working on doing selective retrieval of data in DNA.
It's not just like you read the whole,
necessarily read the whole archive at once,
so that would be one application.
But if you want to retrieve files,
there's a lot of work going on to do random access
or in this case, file filtering.
So the basic idea here is that we have a database of pictures,
and we encode the pictures themselves somehow, either in DNA or otherwise.
But we encode an index.
So for each picture, we encode an index that has some kind of feature like catness, right?
So, you know, we have a few cats and another one might be called automobile, right?
So category of index.
And it's also got an each feature oligo has an ID.
So it's got a feature and an ID. And then we could construct,
again by synthesizing, we can construct a query to the database. So this
green arrow here and the bases in this oligo will be complementary to the
feature vector in some of these, in some of the, it'll be,
the basis here will be complementary to these indices or to some of these
indices. And we also put this little, in this example, in this paper, we put a
little tag on the end of this query and I'll talk about talk about what that is in a minute. So then when you put the query oligo in the presence of the index oligos,
they will bind by hybridization.
That is, the affinity of the bases, they will bind.
And then that's where this little this biotin tag comes in.
This is in this case, it's a magnetic.
It's a it's a magnetic nanoparticle. And you can then pull this little Pac-Man thing here.
Right. So you can you can pull then the indices out of the database that correspond to cats. And so we get our query.
So we don't get the automobile, et cetera.
So this paper was, I'm showing one paper.
James Tuck of North Carolina State gave a talk about some of this also at SDC last year, and the presentation is there and has references.
There's many, many papers on random access.
So it's not just – okay, so I wanted to paint a picture of the progress there.
So in conclusion, in terms of DNA data storage, it's resting on a solid foundation. I mean, when I
first got into this a couple of years ago, it felt like science fiction to me, but it has moved.
It's science and it's technology, and now we have to scale it. Okay.
So now I'm going to talk a little bit about use cases.
If we build DNA data storage, who wants to use it and why? So the DNA Storage Alliance did a...
We queried, we've done, you know, we've researched and we held a user conference in 2020.
And we've done interviews with people in different markets, you know, from automated driver assistance to media entertainment to digital art and preservation.
And we've tried to get a sense of what people want and um the the i mean there's a there's quite a diversity of needs
and and honestly when you ask somebody how they would use dna they're they're doing a thought
experiment they're not quite sure but um But they do know that they want to save
data for a really long time. Before I go there, there's kind of, I'd say there's two categories.
Like if you look at digital artists and preservation, digital art, the amount of data is
very small, but artists want to save it for a long, you know, forever. And things like the Shoah Foundation, we talked to them and, you know,
historical societies, libraries,
they want to save things even if the data is not gigantic.
On the other hand, you've got, you know, hyperscale vendors,
which Aaron's going to talk about tomorrow and,
and governmental requirements and, you know, streaming and media entertainment.
So there's a great deal of data.
There's also sensor data coming from all the smart cars and cities.
So it runs the gamut from people who just want to do small amount of data forever and
others who want to have a large amount of data for a long time, but maybe use it more.
It's not a pure archival use case. So in all of this, data retention is kind of the key.
I mean, this is what's changing. We're digitizing all this data and to either discover new things in the data or to
monetize the data. Fields like healthcare, astronomy, climate science, sports. When we
did our user forum, Major League Baseball came in and talked about they have a gigantic database and they're and they have a gigantic tape library that
they need to migrate and while you know something like dna is obviously not real time they they
at the high end of their or the low at the archival end of their system, they need a lot of scale. And there are others like that. So
everybody's trying to save money or save data and not delete it because we don't know what we might
discover in that data or we want to try and monetize something in that data. And if we can
store more data at lower cost generically, then we don't have to throw it away, which is what is happening
today in many cases. People throw data away because it's too expensive to keep it.
And then this emphasis on data retention puts a lot of emphasis on total cost of ownership. So we just did an analysis of the costs of keeping data. And so this shows
we used for the cost of or the price of tape, we used the Fujifilm TCO calculator.
And that's, let's see, that's the tape is the red bar.
And we took cloud prices, list prices from Amazon, AWS, just public pricing.
And we estimated some DNA data storage prices based on selected, you know, on some selected cost scenarios.
So like I should have really edited this, sorry. So we, like, you know, $100 per terabyte is the yellow, the $50 per terabyte is the green,
and the red, the orange on the end is if we get to $25 per terabyte.
And the basic message is that the fixity checks and migrations for traditional storage begin to swamp the costs of ownership over time.
So if you want to keep a lot of data for a long time, you need something, whether it's glass, whether it's DNA, something.
So the market needs a solution.
And the other thing about DNA, it does definitely minimize energy consumption and improve sustainability.
And this model that we did here was really based on one copy of the database.
And obviously, another nice thing about having, if you have a very valuable archive and you want to keep a few copies of it geographically dispersed, you can do that easily with DNA, easily and more cheaply.
So like I said, you'll hear more about this general problem of too much data tomorrow.
So how do we build the ecosystem?
So there's an organization called the DNA Data Storage Alliance.
And this was started by Twist Biosciences. It was formed in October of 2020.
And Illumina, Microsoft, Twist, and Western Digital agreed to be founding members.
And we created a promoter group kind of structure and we climbed to about 60 members by the second quarter of this year.
And we realized that if we wanted to, you know, to build standards and really build the industry, we needed to either incorporate or become part of something.
So we decided to join SNEA.
We joined SNEA as a technology affiliate group in June of 22.
And our mission is to create and promote an interoperable storage system based on DNA.
And the scope is to educate the market, one.
Two, to develop a DNA storage roadmap, kind of like, you know, IARPA has worked on this,
but we want to put together a DNA data storage industry roadmap that kind of points the way
for research and development investment.
And then we want to develop standards and specs as warranted.
We were very I think we're trying to be very humble about not prematurely standardizing things that aren't naturally suited to being standardized because you don't want to stifle innovation.
But we do believe there are some areas that we can work on now.
So what we're working on now is the roadmap.
We've started some work groups in the Alliance for Standardization.
We're working on a second part.
We published a white paper.
I should have put a link to it in here, but it's on our website.
We published an introduction to DNA data storage. And then we're working on a white paper, number two, to describe some of the market data and observations we got from what I was talking about a few minutes ago. We started a newsletter, and we're doing events. So the industry technology roadmap, I think I've told you what it is. It's so, I won't reiterate too much. We hope to have this done, at least to draft it by the end
of the year. It's looking a little aggressive, the way that the pace things are going, but we're
going to try. The work groups in the TWIG, we have three.
And this first one on the left, you'll hear about from Joel in the next talk.
And that's about, you know, it's about how do we design a DNA archive and how do you discover what's in that archive and be able to decode it when there may be, you know, there will be different, there will be different DNA codecs. There are people who have different ways of doing things.
So we want to support innovation, but you need a standard way to find out how you bootstrap
yourself into the archive. So Joel will talk a lot more about that. There's another group
called interoperable interfaces. And we're, this is kind of, it's taking a little while to get going.
But the purpose of this is to ensure physical compatibility of synthesis storage.
It's about the mechanics of doing DNA storage.
So that you can end up with plug and play swaps of instruments or recovery of molecules for read.
You can recover molecules irrespective of the supplier, whether the supplier is existent at the time or and just general issues of fluidics and data centers.
Because, I mean, it's not going to have massive gallons of things.
But there are things that we have to account for. So we're starting to look at just how we, the mechanical and interoperability aspects of the pipeline.
And then the third work group we're starting is have multiple solutions for storing data, and we need ways to compare them.
So, you know, and we need multiple vendors, I mean, and we need ways to compare them.
And we also need ways to do accelerated wear testing or things so that we can believe that the metrics and the claims being made by a particular solution, that we understand what they mean and that we believe them. So this has been done for different media and different aspects of technology, right, around the industry.
And this picture here is, again, it's way too small now, I'm realizing.
But this was from a study that Microsoft and UW did, University of Washington did, and they went through and exposed samples of DNA to
different environmental and other conditions and measured the data retention properties.
So this is a very complex area and we're just starting to get going.
We want to really try and define terms and get everybody speaking the same language
so that we can proceed.
So these are the three work groups
that we've got going now.
There will obviously probably be more later on
in the alliance.
Okay, so just one final disclaimer before closing.
DNA, what did I do?
DNA is really not like other media, right?
So I actually had somebody ask me, only part joking, does this mean I'm going to be putting my music in my dog, right? And so we always try and say that,
you know, we're just using the building blocks of molecules to build storage. So we're stirring
digital data and DNA doesn't require, use, nor create any cells, organisms, or life. It's just we're just using it's instead of electrons, we're using molecules. So anyway,
so that's all I've got. Enjoy the rest of the track. I hope you'll stay and then come back
tomorrow at 11.20 and Aaron will talk about Azure's observations about the data explosion.
And that's it.
I guess I am done.
Any questions?
Yes.
Well, I noticed that one on the storage.
Yeah, I think you might have listed storage in a...
Yes, you're right.
Somebody can read that.
That's very good.
Very good.
Yes.
These, yeah.
So I don't really think bacteria are going to probably be, you know, I think the DNA data, you know, storage things are going to be more like this capsule from Imogene. But yeah, this was, this table came out of studies that some folks at Imogene and other,
and some universities did about biological DNA and how it gets here.
But you're right.
We could do that.
But yeah.
Another question.
Is there a likelihood that there's going to be any patent issues?
I mean, certain patents or something.
Is it likely there's going to be a standard algorithm,
as well as the interoperability and so forth?
Like, somebody said, okay, I'm going to be able to get this data element
through random access, and there's a built-in version.
I can get one ones I want to,
but would it make sense
to have somebody say,
well, if you don't want to make
that report,
how would you do this
for retrieving stuff?
So my personal opinion
is that for the foreseeable future,
there's going to be
an amazing amount of competition
and innovation and development here
so that there will be people who may try and patent things methods but i don't see us developing a
standard for encoding or for random access methods for a long time now whether whether somebody, whether a company outside, you know, this environment, this NIA environment or
within it, you know, tries to build something and patent it. I mean, I don't know. But what we're
hoping is that we can, yeah, we're just hoping we can, I don't see that because we don't really patent.
I mean, if we take SSD data placement. Right. Just as one example where we have been arguing for 10 years about whether it's streaming or, you know, its own storage or, you know, whatever it is. Right.
Or so nobody's those are being developed in standards orgs and nobody's arguing about patents.
But that's not to say it couldn't happen, right?
I'm not sure if that answered your question.
Interesting.
So, DNA count is probably as close as anything else to an acceptable argument for those standards.
So, I think, just like happened to other critical patent developments, I wouldn't
I wouldn't
yeah I wouldn't doubt that
and at de facto standards
and maybe some actual agreed standards will emerge. I just,
it's a very fluid environment right now and I don't see it. I mean, yeah, no pun intended. Yeah.
Yeah. And you'll hear more about that. We have a talk on encoding and later on a really good,
you know, deep talk on the last 10 years
of encoding, so you may get some more insights on that later.
Yes?
Male Speaker 2
Male Speaker 3
Male Speaker 4
Male Speaker 5
Male Speaker 6
Male Speaker 7
Male Speaker 8
Male Speaker 9 Male Speaker 10 Did I take the what? The density.
Yes.
I mean, if we... And I think realistically what you have to do is imagine an isolation of the technology as being in so many racks. And so I've spent some time working with people on the robotics and the storage mechanisms.
And it adds a lot of bulk to DNA, but the advantage of DNA I mean,
today, if you do
an x-byte
of
tape storage,
you build a building
and if you do an x-byte
of DNA
storage, you do it
in that format.
Or you don't even, I mean, like I showed in this
picture, I mean, if you
like,
and two exabytes of
DNA data storage could fit in this physical
space of a
single LTO tape today.
So the scale is
so dramatically small that
whether you, I mean,
if we put it in a steel capsule like the Imogene capsule or others,
I mean, yes, there are storage requirements,
but they're vastly smaller than anything we've got today.
You were trying to imagine the scale. So one of the interesting things about the data
retention techniques is that we're thinking about is that if you put
data in one of these sealed a sealed capsule, and keep it inert, right?
So the best way to store DNA is inert, sounded by an inert gas, probably.
Or, you know, some inert compound so that there's no moisture, there's no air.
But different storage methods are harder or easier to recover or retrieve from. And you might need specialized equipment to pull things out of the imaging capsule.
Whereas if you did it some other way, I don't even know filter paper, you know, just as a wild example, right?
You might not need as specialized.
You might just need to rehydrate.
But but but the but the. hydrate. But the, so there will, depending on the use cases, and so for long-term archival,
having, you know, going, doing, using something like a sealed capsule will almost certainly make
sense. And the scale will be really, really dramatically different than tape or anything
else we've got today.
But there may be some other use cases where people want to, say, leave the data for a year or two years,
and they still want to get access to it, and they'll use one of these less reliable storage methods in the sense of lower data retention, but they can get it back easier.
My question was, if you say it's 100,000 times denser,
but if I actually need to get to the data,
I can't, you know, go and talk to the organization,
the density is maybe, what, 10,000 times?
Oh, still very, very much larger than anything around today.
I mean, by the time you can see the surrounding infrastructure, you kind of have to imagine what that infrastructure is Yeah. there's there's things that read you know
there's
those are more or less
fixed right
so you can have
any amount
of DNA
storage
and use the same
retrieval thing
so it's just
yeah
and I
I think the
the trick will be
is if we develop
when I was talking about
like the random access
what is the granularity
we can access
a database
in DNA
and once those techniques
evolve
we'll figure out what's the most efficient way to store it.
And in some cases, you'll want to put the whole archive in a single thing
or there may be, I think, an ecosystem of storage methods will evolve.
So anyway, I'm not, okay.
Any other questions for now?
Yeah.
So one of the things I've been running across
is I think about ways to do this
and what you're doing.
Is that you're running across
the idea that users have this compliance requirement.
They periodically want to ensure
that their data is readable.
And it doesn't matter whether it's a reservation
or a store or a company. All along this spectrum, is real. And it doesn't matter whether it's a vagueness or
hyperscale. All along
this vector, periodically
one of those still works.
And yet, as long
as sequencing
and sequencing methods
are destructive synthesis,
as long as those
are popular,
I mean, there are biologically derived ways of making statements about the quality of your data, but how do you change the users from reading the data? That's how they know it's good. They're going to read that and how do you introduce that kind of change?
Is anybody thinking about that kind of problem?
Well, I'm always kind of like a deer in the headlights up here.
So your question is about the fact that if you have a destructive read,
you've got to get people to change the way they
think about what they're... Right, right. So, yeah, so all of that is, we are thinking about it.
It's not like, but we're not, I guess, at the moment, we're just kind of in a dialogue with customers and people
who are envisioning using this and trying to explain what the
characteristics will be and and see if they still want to adapt yeah right and
and I again want to emphasize I think that this kind of this layer of storage, it will be a complement to other things.
Right. So it's not like we're trying to say you use DNA, you throw away your tapes.
But there. So but we'll balance out the economic tradeoffs between where this fits. And as the capability of DNA storage evolves,
then people will figure out the right use cases
or the viable use cases for it.
Okay.
Yeah.
And, you know, again, yeah, we have a talk today
that doesn't directly address this, but it's just, I mean,
they're all, yeah, you'll see some stuff today
about just some of the really interesting techniques
and ways people are thinking about accessing and manipulating DNA media.
The encoding, the nucleic acid talk will be interesting in that respect.
Anyway.
Okay, that's about all I've got.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.