Storage Developer Conference - #183: DNA Data Storage Alliance: Building a DNA Data Storage Ecosystem

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode number 183. Good morning, everybody. Welcome to the DNA Data Storage Track. My name is Dave Landsman, Western Digital, and I'll be kicking it off. And then we're going to talk about, so I'm going to talk about building a DNA data storage ecosystem,

Starting point is 00:00:57 as well as giving a little bit of an overview of DNA data storage. I'm not a molecular biologist or a chemist, so bear with me, but I'm trying to do a short summary. And then, so let's just dive in. So the problem we're here to talk about in the track is fairly straightforward. There's too much data that people are digitizing to save it cost effectively. And it's just we need new mechanisms for archival storage at high volumes, potentially zettabyte scale. And simultaneously, the value of saved data is growing. So I like this chart that Fred Moore at Horizons did because it shows, I mean, it's somewhat qualitative, but he talks about a curve of value for data. Can you see the, yeah, you can see the pointer. So data, typically when it's saved,

Starting point is 00:02:07 it starts out useful. That's why you're saving it. And then it tends to taper off when, you know, and, but it rises again as time goes by and people want to either retrieve it or mine it or something like that. So this is kind of driving some of the trends. So the basic problem is too much data and the value of it is growing. People want to mine it and extract value from it. So why are we talking about DNA? So primarily DNA bits are really small. Excuse me. A simple base, one of the base molecule in the DNA, a base is a picture of an LTO tape, and if we were, with some rough estimates of density and assumptions about bit density and things like

Starting point is 00:03:16 that, if we filled the space of this LTO tape with DNA bits, you could fit about two exabytes of data in this small container, which is over 100,000 times, 100,000 tapes. So this, I mean, everybody has their own metaphor or analogy for scale, but this gives you scale. The other thing about DNA bits, they last a really long time. You can think, you know, you can think woolly mammoth, right? We've recovered DNA, at least pieces of DNA. It's a little bit of an oversimplification, but we've recovered DNA from fossilized bones. They last a long time, and the bits don't need much care and feeding. And again, think of the fossilized example. The DNA is very durable.

Starting point is 00:04:08 And even if you want to store digital data in DNA and you want make it a lot less resource intensive and less costly than existing storage. And then finally, DNA bits don't need migration. You don't need to keep migrating the tape library every five years or 10 years. And so these factors add up to DNA having a very good potential TCO story. And lastly, if you think about the, the DNA ecosystem already exists, right, for medical and scientific, and it's exploded in the last couple, few decades.

Starting point is 00:05:03 And that's why we're even able to be here talking about DNA data storage because of all the techniques that have evolved there. But so the DNA data storage ecosystem will benefit from that investment and that energy in the market. And it will also add its own. So we just see a virtuous cycle evolving with respect to DNA storage and the market. So with all this goodness, potential goodness for DNA as a storage medium, the question is, do we really need, you know, there are still questions. So do we really need something as dense as DNA? It's really dense, but maybe we don't need it that much, that many bits. So also, can we scale the underlying technologies? I'll talk a

Starting point is 00:05:52 little bit about this as we go. But, you know, synthesis and sequencing are, you know, we're working on getting them to be, to scale in performance and cost. And then how do we create an interoperable DNA ecosystem? Because if we create a DNA storage products or an ecosystem, it needs to fit in. It's going to be a complement to existing storage. It's not going to replace hard drives or tapes. It's going to augment them. So we need an interoperable ecosystem.

Starting point is 00:06:24 So on this first question, Aaron Ogis will talk tomorrow. Aaron Ogis of Microsoft is going to talk on, you know, what a data center, what Microsoft feels is the compelling need for molecular data storage. Today in the track, I'm going to continue here with an overview of DNA storage and then a little bit about what the DNA Storage Alliance is doing to build the alliance. Joel Christner is going to talk, Avdel is going to talk about an issue we call Rosetta Stone, which is kind of like, how do we bootstrap a DNA archive and discover what's in it so that we can decode the rest of it? So it's kind of a master boot record talk. Alessia Morelli and Reno Micheloni of DNA Algo are going to talk about a full system simulator for the DNA storage pipeline. So we need tools to model the channel. And let's see,

Starting point is 00:07:27 Zhao Reis and Marila Menossi, they're from a research institute in Brazil, the Instituto de Pequizas Tecnológicas, and they're associated with Lenovo, and they're going to talk about end-to-end DNA storage, an end-to-end DNA storage system that they built and studied. And then Jao Gervasio and Adriano Galindo-Leal are going to talk about DNA coding and kind of the last 10 years of encoding for DNA data storage. And lastly, Luca Piantanita from Boise State is going to talk about nucleic acid memory. So hopefully you'll enjoy the track and hopefully. Oh, and then we'll have we'll have an informal Q&A at the end.

Starting point is 00:08:16 So and I like I like this room for that. We'll just we'll chat. OK, so. Yep. Oh, oh. that we'll just we'll chat okay so yep oh yeah they have it right they have uploaded the slides if you want to get a copy yep hopefully the screen is big enough, but it's a little little. Yeah. So. OK, so just. You know, I think everybody is familiar, even if it was from, you know, high school biology courses, right? So the DNA molecule is a chain of bases, you know, adenine, thymine, cytosine, and guanine. And the bases have a natural affinity for each other. So A and T

Starting point is 00:09:15 bind to each other, and C and G bind to each other. And they're connected in the molecule by a sugar phosphate backbone. So this is our building block. And in concept, DNA data storage is very simple, right? It's the devil is in the details making it work. But we take bits, digital bits and files, we encode them into this language of A, T, C, G, because, I mean, DNA in our bodies is a storage and encoding mechanism. And so we encode the ones and zeros to the bases. Then we build a molecule from those bases and store it. And on the way out, we do the reverse.

Starting point is 00:10:02 We sequence it or read it. And then we get our bases back. And then we decode them back into the reverse. We sequence it or read it and then we get our bases back and then we decode them back into the digital. So and we like in electronic channels, we add ECC and metadata for various purposes into the chain of bases. And let's see. And one more thing. I'm going to use a term as I talk called oligos. It's oligonucleotide, and it's basically a short strand of synthetic DNA or RNA. So you'll hear this term a lot, probably from me and other speakers. So that's what that is. Okay. So I kind of talked about the DNA as a channel, right? And when I was getting into this, it helped me to kind of think about some of the analogies of DNA storage in respect to an electrical channel. So we are building a protocol through the pipeline, right?

Starting point is 00:11:10 Some of the bits that we'll build and some of the DNA bits will be protocol bits. There'll be payload bits. There may be protocol at the end. If we have a particular string of DNA, it can have fields and protocol, just like we're very used to with storage protocols. In electrical channel, of course, ones and zeros are converted to analog waveforms. And in DNA, the waveforms are replaced with the bases. So we convert bits to bases. And we have to worry about various aspects.

Starting point is 00:11:45 Just like with ones and zeros, we have to add ECC into the or with electric with waveforms. We have to add ECC into the protocol. So in electrical channel, we add ECC bits. Sorry, I'm I'm stumbled a bit. We add ECC bits to the digital bit stream, and in DNA, we do the same. We add, before we synthesize, we add ECC bits and other metadata to help us. And in the case of DNA, we're worried about errors like insertions, deletions, substitutions. And then in an electrical channel, in some cases, like with memory buses, we'll add scrambling patterns at the transmitter because certain patterns of ones and zeros can create

Starting point is 00:12:35 electrical interference on the wire. Similarly, certain patterns of bases in DNA processing can be problematic. So we may alter the symbol stream after we've added the ECCN metadata. So as with an electrical channel, this kind of line protocol that we've got for DNA storage is critical to overall channel efficiency and how everything works. And you'll hear more about that later today. And there's also logical protocol layers above the line protocol, file tagging, packetization.

Starting point is 00:13:12 If we have a file, we break it're two of the important areas, writing and reading. Two, there's phosphoramidite chemistry is one main technique, and there's an area of techniques called enzymatic synthesis which are evolving. Phosphoramidite is the main, it's the mainstream, it's the most productized, but enzymatic is evolving and one of the interesting things about enzymatic chemistry is that it doesn't evolve as many caustic chemicals, things like that, and it may also enable us to build molecules more effectively. So this area is base-by-base synthesis. So we build a molecule, one base at a time, in kind of a cyclic process. We start with a single base with a blocker on it. We deblock it.

Starting point is 00:14:36 Then we add a new base that has a blocker on it, and then oxidize to solidify things. And then we start again. So very, very simplistically, we build molecules base by base. And with either of these, phosphoramidide or enzymatic, the limit today is about 200 to 300 bases in an oligo. If you go beyond that, you start getting too many errors. Now, if people want to put together longer strands of synthetic DNA, they can use ligation techniques. So this is, again, I talked about how there's an affinity between the bases, right? if we have some short oligos, like the yellow and blue here,

Starting point is 00:15:26 we can construct what we call a DNA splint, and you can actually use the affinity of the complementary characteristics to glue longer strands together and thereby get strands of synthetic DNA that are many hundreds of bases long or maybe more. There's all, there's a very deep, there are debates going on about, you know, are smaller oligos better in general or longer? But, you know, one thing about having longer oligos is you have more bits, your protocol overhead is lower, just like you're used to. But there's many, the industry is evolving here on what's the best way to build DNA molecules.

Starting point is 00:16:22 So here I wanted to talk briefly about kind of where some of the progress is. So this is a study that was done at the University of Washington and Microsoft. And they built an array to synthesize, an electrochemical array to synthesize DNA. And the basic results of the study were that they were able to do synthesis on this chip with 650 nanometer electrodes and 200 nanometer wells. So they actually are putting chemicals and molecules and reagents in little nanostructure wells. They were able to contain the acid diffusion at each site. And so the chip they built was two micron pitch,

Starting point is 00:17:18 and they were able to control the acid diffusion at each of these wells. And that's very important because the acid diffusion, when I showed that picture before, when you took that blocker off the molecule so that you could add a new base, if the acid diffuses too much over the array, you might affect other molecules. So they were able to show 100 base long DNA, 100 base long oligos at these feature sizes. Now, some people are asking how quickly can you write DNA data? So this chip in the study, it reached a synthesis density of 25 million synthesis sites per square centimeter. And this was about three orders of magnitude greater than previous work.

Starting point is 00:18:12 So it was quite a step forward. And if you take an array with this density, you could achieve in the kilobytes per second, per second, per centimeter squared write rate. And that, you know, the authors felt this was like a practical minimum that we could get. So for certain archival applications where the write speed is not so critical, this might be commercially viable. It's just, it's a judgment call, but I mean, that's,

Starting point is 00:18:42 so they could see an array of this size would would would serve a purpose. But at this density, if you wanted to, say, get to a megabyte per second squared, you would need a 360 centimeter squared chip, which is a little big. You know, so and that and and or you would need many, many chips. So there's a long way to go to scale, but this was a good, a very good experiment and paper published about real progress in synthesis. And furthermore, we're seeing continued progress. So, so the chip I just talked about in that study had 25 million synthesis sites per centimeter squared. And Twist Bioscience just announced a chip with 100 million synthesis sites. And they can write about a gigabyte per run.

Starting point is 00:19:38 You load it up with reagents, et cetera, do a run, one gigabyte and um the reason i mentioned the the one gigabyte is the the iarpa did a study or it created a program the molecular information storage program and um they've set they've kind of set goals uh that they want they think we need to achieve to get commercially viable molecular storage. And if you look there, the goals by mid-2024 were to have a 10-gigabyte chip and about a dollar per gigabyte. I can't talk about any prices here because they're really hard to come by from the vendors, but we're seeing, you know, we have a one gigabyte chip in 2022, and we have some others. So the indicators are that the industry is starting to scale. That's the real message. And at least according to the predictions that IARPA made or goals that they set, we seem to be moving along this curve reasonably well.

Starting point is 00:20:51 Okay, so much more scaling is needed, but the foundations are established is the main message. How am I doing on that? I don't know if I'm going really slow. Let's see, 850. Okay, so we talked a little about sequencing or synthesis. Now we're going to talk about storage in the middle. So I'm not going to go into any detail here, but there are many preservation methods being looked at for when we store DNA. There are ways, you know, there's chemical encapsulation, there's physical encapsulation. You can bind the molecules in a matrix of some kind, some material matrix, even on, you can use adsorption on paper. So there's many techniques and there have been many studies about biological DNA and how it,

Starting point is 00:21:46 how the molecules simply exist and how slowly they degrade in nature. There is now an attempt to start looking at how we, how we store and with an eye towards data retention. So I'll come back to that point a little bit later in the talk. The basic thing about DNA storage here is keep it away from water and air. I mean, if you do that, you're in pretty good shape. But there are many more aspects of how you store it and for how long you need to store it and how hard it is to retrieve for a particular method. So I'll talk about that briefly later. Okay, so sequencing, snapshot of sequencing. So there's kind of two main techniques for sequencing today.

Starting point is 00:22:43 One is sequencing by synthesis, and it's called sequencing by synthesis because you start with a single-stranded DNA. That's the template. And you put it in solution, and then you add bases into the mix. And as each base binds to the template, in the Illumina case, the events are detected by visible light. So each reaction event is detectable,

Starting point is 00:23:26 and because of the complementarity, you can then tell, you can read what was in this original strand. Now another, so it's kind of an indirect way of reading DNA. Nanopore is the other technique that's really getting a lot of attention and investment. And that's where we guide a DNA strand through a very small channel. And it either can be a natural biological channel or it can be a semiconductor, you know, a synthetic one. And as the DNA molecule transits through the pore, it creates ionic current discrepancies or tunneling current. And so these events can be detected and the bases are directly read. So there's a great deal of energy going into this. Oxford Nanopore has a product

Starting point is 00:24:29 here. Illumina is the biggest. There are many synthesis vendors doing SBS. So both of these are, yeah, in general today, SBS is kind of more accurate and slower, which is more appropriate maybe for medical and the traditional uses of DNA technology. And nanopore is, on a per base basis, is slightly less accurate, but faster per base. But the throughput of the systems is, you know, there's a battle going on in terms of which is better. And then we'll, yeah. Okay. So as far as scaling synthesis, today we're at a throughput level of about, the high-end aluminum machines, for example, are kind of at tens of gigabytes per day data. So it's not very high.

Starting point is 00:25:27 We probably need to get to hundreds of terabytes per day to be kind of useful. And on a cost standpoint, Illumina presented this slide at FMS, and they referred to where their products are with respect to gigabases per day. So if we assume one bit per base, which is kind of conservative, there's nuances here. We might be able to get higher bit density. But if we use one bit per base, then that puts us at $48,000 per terabyte today, which is a little pricey. But we see a direct line of sight to getting to $8,000 if we get to $1 per base or $1 per gigabit, and no conceptual hurdles to $800 per terabyte. And these prices are still kind of oriented around the markets that they serve, which is medical and scientific.

Starting point is 00:26:29 So there's, I mean, cost and price are flexible, as we know. So as DNA data storage evolves, not only will the technology evolve, but the pricing models and things around it will evolve. So the other thing to mention is that the DNA storage can tolerate much higher error rates than medical and scientific applications, genomic applications. So the key will be what are the error correction technologies and how efficient can we make the pipeline for DNA storage versus these other markets. So in conclusion, we have probably three orders of magnitude to go on cost and price, both cost price and throughput for sequencing.

Starting point is 00:27:14 But that said, there are many ways to manipulate all phases of the pipeline that we talked about from synthesis to sequencing, where we can balance the error tolerance at each step and thus the performance of the whole pipeline. So, okay. So that's all I had on sequencing. So one last thing. I think I'm going faster than I thought.

Starting point is 00:27:50 So just to give you a very slight, a very small peak, there are people working on doing selective retrieval of data in DNA. It's not just like you read the whole, necessarily read the whole archive at once, so that would be one application. But if you want to retrieve files, there's a lot of work going on to do random access or in this case, file filtering.

Starting point is 00:28:23 So the basic idea here is that we have a database of pictures, and we encode the pictures themselves somehow, either in DNA or otherwise. But we encode an index. So for each picture, we encode an index that has some kind of feature like catness, right? So, you know, we have a few cats and another one might be called automobile, right? So category of index. And it's also got an each feature oligo has an ID. So it's got a feature and an ID. And then we could construct,

Starting point is 00:29:07 again by synthesizing, we can construct a query to the database. So this green arrow here and the bases in this oligo will be complementary to the feature vector in some of these, in some of the, it'll be, the basis here will be complementary to these indices or to some of these indices. And we also put this little, in this example, in this paper, we put a little tag on the end of this query and I'll talk about talk about what that is in a minute. So then when you put the query oligo in the presence of the index oligos, they will bind by hybridization. That is, the affinity of the bases, they will bind.

Starting point is 00:30:00 And then that's where this little this biotin tag comes in. This is in this case, it's a magnetic. It's a it's a magnetic nanoparticle. And you can then pull this little Pac-Man thing here. Right. So you can you can pull then the indices out of the database that correspond to cats. And so we get our query. So we don't get the automobile, et cetera. So this paper was, I'm showing one paper. James Tuck of North Carolina State gave a talk about some of this also at SDC last year, and the presentation is there and has references. There's many, many papers on random access.

Starting point is 00:30:54 So it's not just – okay, so I wanted to paint a picture of the progress there. So in conclusion, in terms of DNA data storage, it's resting on a solid foundation. I mean, when I first got into this a couple of years ago, it felt like science fiction to me, but it has moved. It's science and it's technology, and now we have to scale it. Okay. So now I'm going to talk a little bit about use cases. If we build DNA data storage, who wants to use it and why? So the DNA Storage Alliance did a... We queried, we've done, you know, we've researched and we held a user conference in 2020. And we've done interviews with people in different markets, you know, from automated driver assistance to media entertainment to digital art and preservation.

Starting point is 00:32:06 And we've tried to get a sense of what people want and um the the i mean there's a there's quite a diversity of needs and and honestly when you ask somebody how they would use dna they're they're doing a thought experiment they're not quite sure but um But they do know that they want to save data for a really long time. Before I go there, there's kind of, I'd say there's two categories. Like if you look at digital artists and preservation, digital art, the amount of data is very small, but artists want to save it for a long, you know, forever. And things like the Shoah Foundation, we talked to them and, you know, historical societies, libraries, they want to save things even if the data is not gigantic.

Starting point is 00:32:55 On the other hand, you've got, you know, hyperscale vendors, which Aaron's going to talk about tomorrow and, and governmental requirements and, you know, streaming and media entertainment. So there's a great deal of data. There's also sensor data coming from all the smart cars and cities. So it runs the gamut from people who just want to do small amount of data forever and others who want to have a large amount of data for a long time, but maybe use it more. It's not a pure archival use case. So in all of this, data retention is kind of the key.

Starting point is 00:33:39 I mean, this is what's changing. We're digitizing all this data and to either discover new things in the data or to monetize the data. Fields like healthcare, astronomy, climate science, sports. When we did our user forum, Major League Baseball came in and talked about they have a gigantic database and they're and they have a gigantic tape library that they need to migrate and while you know something like dna is obviously not real time they they at the high end of their or the low at the archival end of their system, they need a lot of scale. And there are others like that. So everybody's trying to save money or save data and not delete it because we don't know what we might discover in that data or we want to try and monetize something in that data. And if we can store more data at lower cost generically, then we don't have to throw it away, which is what is happening

Starting point is 00:34:47 today in many cases. People throw data away because it's too expensive to keep it. And then this emphasis on data retention puts a lot of emphasis on total cost of ownership. So we just did an analysis of the costs of keeping data. And so this shows we used for the cost of or the price of tape, we used the Fujifilm TCO calculator. And that's, let's see, that's the tape is the red bar. And we took cloud prices, list prices from Amazon, AWS, just public pricing. And we estimated some DNA data storage prices based on selected, you know, on some selected cost scenarios. So like I should have really edited this, sorry. So we, like, you know, $100 per terabyte is the yellow, the $50 per terabyte is the green, and the red, the orange on the end is if we get to $25 per terabyte.

Starting point is 00:35:58 And the basic message is that the fixity checks and migrations for traditional storage begin to swamp the costs of ownership over time. So if you want to keep a lot of data for a long time, you need something, whether it's glass, whether it's DNA, something. So the market needs a solution. And the other thing about DNA, it does definitely minimize energy consumption and improve sustainability. And this model that we did here was really based on one copy of the database. And obviously, another nice thing about having, if you have a very valuable archive and you want to keep a few copies of it geographically dispersed, you can do that easily with DNA, easily and more cheaply. So like I said, you'll hear more about this general problem of too much data tomorrow. So how do we build the ecosystem?

Starting point is 00:37:04 So there's an organization called the DNA Data Storage Alliance. And this was started by Twist Biosciences. It was formed in October of 2020. And Illumina, Microsoft, Twist, and Western Digital agreed to be founding members. And we created a promoter group kind of structure and we climbed to about 60 members by the second quarter of this year. And we realized that if we wanted to, you know, to build standards and really build the industry, we needed to either incorporate or become part of something. So we decided to join SNEA. We joined SNEA as a technology affiliate group in June of 22. And our mission is to create and promote an interoperable storage system based on DNA.

Starting point is 00:38:02 And the scope is to educate the market, one. Two, to develop a DNA storage roadmap, kind of like, you know, IARPA has worked on this, but we want to put together a DNA data storage industry roadmap that kind of points the way for research and development investment. And then we want to develop standards and specs as warranted. We were very I think we're trying to be very humble about not prematurely standardizing things that aren't naturally suited to being standardized because you don't want to stifle innovation. But we do believe there are some areas that we can work on now. So what we're working on now is the roadmap.

Starting point is 00:38:50 We've started some work groups in the Alliance for Standardization. We're working on a second part. We published a white paper. I should have put a link to it in here, but it's on our website. We published an introduction to DNA data storage. And then we're working on a white paper, number two, to describe some of the market data and observations we got from what I was talking about a few minutes ago. We started a newsletter, and we're doing events. So the industry technology roadmap, I think I've told you what it is. It's so, I won't reiterate too much. We hope to have this done, at least to draft it by the end of the year. It's looking a little aggressive, the way that the pace things are going, but we're going to try. The work groups in the TWIG, we have three. And this first one on the left, you'll hear about from Joel in the next talk.

Starting point is 00:39:50 And that's about, you know, it's about how do we design a DNA archive and how do you discover what's in that archive and be able to decode it when there may be, you know, there will be different, there will be different DNA codecs. There are people who have different ways of doing things. So we want to support innovation, but you need a standard way to find out how you bootstrap yourself into the archive. So Joel will talk a lot more about that. There's another group called interoperable interfaces. And we're, this is kind of, it's taking a little while to get going. But the purpose of this is to ensure physical compatibility of synthesis storage. It's about the mechanics of doing DNA storage. So that you can end up with plug and play swaps of instruments or recovery of molecules for read. You can recover molecules irrespective of the supplier, whether the supplier is existent at the time or and just general issues of fluidics and data centers.

Starting point is 00:40:57 Because, I mean, it's not going to have massive gallons of things. But there are things that we have to account for. So we're starting to look at just how we, the mechanical and interoperability aspects of the pipeline. And then the third work group we're starting is have multiple solutions for storing data, and we need ways to compare them. So, you know, and we need multiple vendors, I mean, and we need ways to compare them. And we also need ways to do accelerated wear testing or things so that we can believe that the metrics and the claims being made by a particular solution, that we understand what they mean and that we believe them. So this has been done for different media and different aspects of technology, right, around the industry. And this picture here is, again, it's way too small now, I'm realizing. But this was from a study that Microsoft and UW did, University of Washington did, and they went through and exposed samples of DNA to different environmental and other conditions and measured the data retention properties.

Starting point is 00:42:37 So this is a very complex area and we're just starting to get going. We want to really try and define terms and get everybody speaking the same language so that we can proceed. So these are the three work groups that we've got going now. There will obviously probably be more later on in the alliance. Okay, so just one final disclaimer before closing.

Starting point is 00:43:07 DNA, what did I do? DNA is really not like other media, right? So I actually had somebody ask me, only part joking, does this mean I'm going to be putting my music in my dog, right? And so we always try and say that, you know, we're just using the building blocks of molecules to build storage. So we're stirring digital data and DNA doesn't require, use, nor create any cells, organisms, or life. It's just we're just using it's instead of electrons, we're using molecules. So anyway, so that's all I've got. Enjoy the rest of the track. I hope you'll stay and then come back tomorrow at 11.20 and Aaron will talk about Azure's observations about the data explosion. And that's it.

Starting point is 00:44:11 I guess I am done. Any questions? Yes. Well, I noticed that one on the storage. Yeah, I think you might have listed storage in a... Yes, you're right. Somebody can read that. That's very good.

Starting point is 00:44:46 Very good. Yes. These, yeah. So I don't really think bacteria are going to probably be, you know, I think the DNA data, you know, storage things are going to be more like this capsule from Imogene. But yeah, this was, this table came out of studies that some folks at Imogene and other, and some universities did about biological DNA and how it gets here. But you're right. We could do that. But yeah.

Starting point is 00:45:17 Another question. Is there a likelihood that there's going to be any patent issues? I mean, certain patents or something. Is it likely there's going to be a standard algorithm, as well as the interoperability and so forth? Like, somebody said, okay, I'm going to be able to get this data element through random access, and there's a built-in version. I can get one ones I want to,

Starting point is 00:45:46 but would it make sense to have somebody say, well, if you don't want to make that report, how would you do this for retrieving stuff? So my personal opinion is that for the foreseeable future,

Starting point is 00:45:59 there's going to be an amazing amount of competition and innovation and development here so that there will be people who may try and patent things methods but i don't see us developing a standard for encoding or for random access methods for a long time now whether whether somebody, whether a company outside, you know, this environment, this NIA environment or within it, you know, tries to build something and patent it. I mean, I don't know. But what we're hoping is that we can, yeah, we're just hoping we can, I don't see that because we don't really patent. I mean, if we take SSD data placement. Right. Just as one example where we have been arguing for 10 years about whether it's streaming or, you know, its own storage or, you know, whatever it is. Right.

Starting point is 00:46:56 Or so nobody's those are being developed in standards orgs and nobody's arguing about patents. But that's not to say it couldn't happen, right? I'm not sure if that answered your question. Interesting. So, DNA count is probably as close as anything else to an acceptable argument for those standards. So, I think, just like happened to other critical patent developments, I wouldn't I wouldn't yeah I wouldn't doubt that

Starting point is 00:47:40 and at de facto standards and maybe some actual agreed standards will emerge. I just, it's a very fluid environment right now and I don't see it. I mean, yeah, no pun intended. Yeah. Yeah. And you'll hear more about that. We have a talk on encoding and later on a really good, you know, deep talk on the last 10 years of encoding, so you may get some more insights on that later. Yes? Male Speaker 2

Starting point is 00:48:11 Male Speaker 3 Male Speaker 4 Male Speaker 5 Male Speaker 6 Male Speaker 7 Male Speaker 8 Male Speaker 9 Male Speaker 10 Did I take the what? The density. Yes.

Starting point is 00:48:48 I mean, if we... And I think realistically what you have to do is imagine an isolation of the technology as being in so many racks. And so I've spent some time working with people on the robotics and the storage mechanisms. And it adds a lot of bulk to DNA, but the advantage of DNA I mean, today, if you do an x-byte of tape storage, you build a building and if you do an x-byte

Starting point is 00:49:19 of DNA storage, you do it in that format. Or you don't even, I mean, like I showed in this picture, I mean, if you like, and two exabytes of DNA data storage could fit in this physical

Starting point is 00:49:34 space of a single LTO tape today. So the scale is so dramatically small that whether you, I mean, if we put it in a steel capsule like the Imogene capsule or others, I mean, yes, there are storage requirements, but they're vastly smaller than anything we've got today.

Starting point is 00:49:57 You were trying to imagine the scale. So one of the interesting things about the data retention techniques is that we're thinking about is that if you put data in one of these sealed a sealed capsule, and keep it inert, right? So the best way to store DNA is inert, sounded by an inert gas, probably. Or, you know, some inert compound so that there's no moisture, there's no air. But different storage methods are harder or easier to recover or retrieve from. And you might need specialized equipment to pull things out of the imaging capsule. Whereas if you did it some other way, I don't even know filter paper, you know, just as a wild example, right? You might not need as specialized.

Starting point is 00:50:58 You might just need to rehydrate. But but but the but the. hydrate. But the, so there will, depending on the use cases, and so for long-term archival, having, you know, going, doing, using something like a sealed capsule will almost certainly make sense. And the scale will be really, really dramatically different than tape or anything else we've got today. But there may be some other use cases where people want to, say, leave the data for a year or two years, and they still want to get access to it, and they'll use one of these less reliable storage methods in the sense of lower data retention, but they can get it back easier. My question was, if you say it's 100,000 times denser,

Starting point is 00:51:47 but if I actually need to get to the data, I can't, you know, go and talk to the organization, the density is maybe, what, 10,000 times? Oh, still very, very much larger than anything around today. I mean, by the time you can see the surrounding infrastructure, you kind of have to imagine what that infrastructure is Yeah. there's there's things that read you know there's those are more or less fixed right

Starting point is 00:52:27 so you can have any amount of DNA storage and use the same retrieval thing so it's just yeah

Starting point is 00:52:33 and I I think the the trick will be is if we develop when I was talking about like the random access what is the granularity we can access

Starting point is 00:52:41 a database in DNA and once those techniques evolve we'll figure out what's the most efficient way to store it. And in some cases, you'll want to put the whole archive in a single thing or there may be, I think, an ecosystem of storage methods will evolve. So anyway, I'm not, okay.

Starting point is 00:53:00 Any other questions for now? Yeah. So one of the things I've been running across is I think about ways to do this and what you're doing. Is that you're running across the idea that users have this compliance requirement. They periodically want to ensure

Starting point is 00:53:19 that their data is readable. And it doesn't matter whether it's a reservation or a store or a company. All along this spectrum, is real. And it doesn't matter whether it's a vagueness or hyperscale. All along this vector, periodically one of those still works. And yet, as long as sequencing

Starting point is 00:53:35 and sequencing methods are destructive synthesis, as long as those are popular, I mean, there are biologically derived ways of making statements about the quality of your data, but how do you change the users from reading the data? That's how they know it's good. They're going to read that and how do you introduce that kind of change? Is anybody thinking about that kind of problem? Well, I'm always kind of like a deer in the headlights up here. So your question is about the fact that if you have a destructive read,

Starting point is 00:54:23 you've got to get people to change the way they think about what they're... Right, right. So, yeah, so all of that is, we are thinking about it. It's not like, but we're not, I guess, at the moment, we're just kind of in a dialogue with customers and people who are envisioning using this and trying to explain what the characteristics will be and and see if they still want to adapt yeah right and and I again want to emphasize I think that this kind of this layer of storage, it will be a complement to other things. Right. So it's not like we're trying to say you use DNA, you throw away your tapes. But there. So but we'll balance out the economic tradeoffs between where this fits. And as the capability of DNA storage evolves,

Starting point is 00:55:26 then people will figure out the right use cases or the viable use cases for it. Okay. Yeah. And, you know, again, yeah, we have a talk today that doesn't directly address this, but it's just, I mean, they're all, yeah, you'll see some stuff today about just some of the really interesting techniques

Starting point is 00:55:47 and ways people are thinking about accessing and manipulating DNA media. The encoding, the nucleic acid talk will be interesting in that respect. Anyway. Okay, that's about all I've got. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #183: DNA Data Storage Alliance: Building a DNA Data Storage Ecosystem

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.