Storage Developer Conference - #170: DNA Data Storage and Near-Molecule Processing for the Yottabyte Era

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 170. Hello, everyone. My name is Luis Tazi. I'm a professor of computer science at the University of Washington and co-founder and CEO at OctoML. It's really great to have an opportunity to give a talk at SNEA STC again.

Starting point is 00:00:51 So it turns out that I gave a talk here with Karen Strauss five years ago on the internet storage. And it's just so great to see how much has happened since then and how much excitement there is in this new field. It's great to see a fantastic session here full of great results. So we're going to tag team. I just introduced myself. Now I'm going to pass the talk to Kari and I'll come back halfway to the talk. Hi, everyone.

Starting point is 00:01:17 I'm Karen Strauss, the Senior Principal Research Manager at Microsoft Research. And it's really, as Luis mentioned, a pleasure to come back to SNEA SDC and talk about the progress in the area. For five years, we've been working on DNA data storage. There are other companies now who have joined this area, and you're going to hear from them later on. And we've also created the DNA Data Storage Alliance to push the field

Starting point is 00:01:47 forward. So let's dig into it. Today, what I'm going to do is give you an overview of DNA Data Storage and sort of provide a little bit of context of the talks that you're about to hear later in the presentation. And Luis will cover some of the trends and also some of the work that we've been doing during these five years. All right, so this is a collaboration, as you can probably gather by now, between Microsoft and University of Washington, and together we created the Molecular Information Systems Lab to work on the chemistry, biology, innovation to help with issues in the IT industry. So the Anetero Storage is our flagship project. And...

Starting point is 00:02:40 Sorry about the technical issues here. So it's our flagship project. And so just for this audience, I don't need to. But really, there's a gap in what we generate and our ability to store information. And that gap is growing if we simply follow the trend. So what this results in is that the percentage of all the data that we generate is shrinking over time. So we think that there should be different ways. We need different ways to store information and DNA data storage provides a different way to do that. So here's the rush on now and how it's different from what's been done before. So the industry, storage industry and semiconductor industry in general has evolved by following Moore's law.

Starting point is 00:03:47 And so essentially increasing the number of transistors or the number of cells that you have to store information over time and packing more on the same area. And with that, getting gains in capacity and gains in cost. Now, I want to contrast this approach with an approach that was proposed in the 60s, in the 50s actually, by Richard Feynman, where he pointed out that if we have the ability to arrange atoms in the way we want, we may be able to get to molecular computing, molecular storage. And this is what we're trying to do with DNA data storage. He even used DNA as a molecule, as an example of molecule that can store information. He stopped short of proposing to use that for, to use that to store digital information, but that didn't take very long.

Starting point is 00:04:35 You know, people in the sixties proposed that. And with the increased technological progress in biotechnology, now we have the tools to really realize on that vision. So that this sort of era of DNA data storage started in about 2012, 2013, when folks observed that it's possible to, now the tools make it possible for us to store and retrieve digital data from DNA. So what's DNA data storage? DNA, if we think of DNA as a material, DNA is the double helix, right? And each side of the double helix is a sequence of bases, A, T, C, and G, right? And these bases, today we have the tools to arrange these bases

Starting point is 00:05:22 in arbitrary sequences so that if we have a sequence of bits that we want to store, we can store that sequence of bits by translating, by doing a mapping between the bits and a sequence of bases. So for example, you could use the simple mapping that I'm showing here, or you could use a completely different mapping. In fact, we use more sophisticated codes to do that translation. But bottom line, translate a sequence of bits into a sequence of bases. Now, I talked about the double helix. And what I mean by that, the information in the double helix is redundant. And what I mean by that is that if you have a particular sequence on one side of the double helix, if you remember from biology classes, A complements T. And if the A is on one side of the double helix, the T is on the other side.

Starting point is 00:06:14 If a C is on one side, the G is on the other side. And so there's complementarity there. But from an information storage point of view, that's sort of redundant information. So we typically think of the DNA as one side of the double helix, and this is what's drawn here. Now, as I mentioned, now we have tools to write and read the DNA in arbitrary ways. And so we have technology to make synthetic DNA. And this is what I'm going to be talking about during this presentation and probably throughout the track, we're all talking about synthetic DNA as a way

Starting point is 00:06:51 to store information. And so to point out that this is the material, but there's no cells, there's no organisms, no live involved here. It's really the material used to store the information. So why do we want to store digital information in synthetic DNA? First is density. So here's a test tube. This is sort of an expanded picture of the test tube. And what you see at the bottom of that test tube, that pink smear, is dried DNA. And that's enough DNA to store, enough physical DNA to store the equivalent of 10 terabytes. So, you know, a pretty reasonable hard drive. And you can store all that in the

Starting point is 00:07:35 bottom of the test tube. So density is quite high for DNA. And if you extrapolate that into what this means data center-wide is what today we need a building to house, tomorrow you'd need essentially a very small space. And this is really sort of proportionally what you'd need. So it's really probably just a pixel in your screen. So very high density. In real size, that's about one cubic inch that can house about one exabyte of data in it. So quite, quite dense. So density is certainly one property that we like about DNA. The other is durability. So DNA can be encapsulated. There are even demonstrations in the wild that DNA that can preserve its information for thousands of years, even a million years. But obviously that's under very special conditions. But it turns out that

Starting point is 00:08:40 those conditions can be reproduced synthetically and we can store DNA for quite a long time without having the information in it degrade. And the extreme density of DNA also helps because if you need to, for example, create some conditions like cooling it, it's very dense. And so it's pretty easy to cool quite a lot of material and therefore quite a lot of data. So that begs the question, how does it compare with other types of media? So here we're plotting both density and the durability lifetime of different media. And so you can see that DNA projections and at the limit, it's many orders of magnitude better than the best types of technologies we have. But even once we discount everything that we need, that we think we need to build a practical system, so overheads from error correction, metadata, the containers that hold the DNA themselves, even if we discount that,

Starting point is 00:09:39 we still have a few orders of magnitude higher density. And also, you know, thousands of years, sort of a conservative prediction here can really, it's actually pretty good from a preservation point of view. So next is relevance. So now that we know how to read DNA and we use it to read our genomic DNA, we always have readers to recover the information that we store, the digital information that we store in synthetic DNA. And so we can essentially ride all the improvements that the biotechnology industry has been making to essentially our benefit. And what's interesting about DNA is that the medium doesn't change. So as we improve the technology, we don't need to migrate to the next generation of medium. So that actually makes it even more relevant and reduces some of the headaches that we

Starting point is 00:10:44 typically have in doing migration. And finally, one other property, nice property of DNA is the ability to make copies. So essentially, there's this reaction that you may have heard due to COVID times, polymerase chain reactions, where you can use it to copy the DNA and copy it in bulk so that you end up with many copies of the same sequence. And so, you know, for making multiple copies, and for example, data distribution, this could be quite interesting. And finally, recently, we did a study on environmental sustainability. That's very much a topic that's front and center these days with climate change. And what we did was compare DNA or what a system, if we deploy DNA at large scale, would look like and how it compares to a tape system where in terms of the carbon emissions and the energy consumption and the water consumption,

Starting point is 00:11:46 to store one terabyte for a year. So that was sort of our unit. And what this pre-screening life cycle assessment has shown is that along these axes, DNA is actually quite promising. So we're very excited about that as well. So DNA could provide a more sustainable way to store digital information. All right. So I'm going to cover quickly what we, you know, our view of what the DNA data storage system should look like. And you're going to hear a lot more in this track. You're going to hear a lot more from other companies working on different parts of this,

Starting point is 00:12:27 uh, of this pipeline, this system here. So we start with bits. Um, and as I mentioned, we can encode into sequences of those bases, right? And that's still an electronic representation. And then it's time to make the molecules. That's the process of synthesis, which is our write process. Next, we preserve the DNA to ensure that, create the conditions for the information to be preserved into the DNA. When it's time to read, we'll perform random access and recover the molecules that we're interested in reading. We'll sequence them, which is the process of reading them,

Starting point is 00:13:03 and we'll decode them and then recover the bits that we had originally stored. So let me show you how this is done in a little bit more detail. But before I do that, I just wanted to comment on some of the results so far. So we've been working, again, Microsoft, University of Washington, and for this particular work, we collaborated with Twist Bioscience. We encoded one gigabyte of data in DNA. Twist has synthesized that into DNA, and we were able to recover. And we wanted to make a audio, archival quality audio.

Starting point is 00:13:52 So we've encoded a number of files and then we were able to recover using this end to end system that I was talking about. So, again, I'm going to walk you through what the system looks like. So let's take this OKGo video that we've encoded in DNA. So it's 44 megabytes of information. First step was to partition it into smaller segments of different sequences of bits that we wanted to store and number them like we used to number floppies just so that we can reorder the information on the way out. We add some redundancy for error correction, and then we translate those bits into bases of DNA. We add an additional tag that's sort of a file ID and also allows us to more easily copy the information. And then those sequences go into uh the writing process so that's dna synthesis dna synthesis is

Starting point is 00:14:48 essentially um a parallel process where um many molecules are uh grown uh at the same time and grow from a surface like lawn growing from the ground up um and uh it's a series of chemical steps. So here, just to cover quickly, we add a base. There's a process that strengthens the bond of that base. And then there's a process that allows the next base to attach, which is the blocking step. And that cycle goes over and over and that's what makes the DNA molecules. We don't grow one molecule at a time.

Starting point is 00:15:28 We grow many, as I mentioned, and we grow many different sequences at a time using a method called the ray synthesis. And essentially, we get multiple, we get the parallelism of growing multiple sequences at a time, and that's what gives us throughput. Next step is to preserve the DNA. So we'll remove the molecules from the substrate.

Starting point is 00:15:51 We'll encapsulate, and this is work we've done in collaboration with ETH Zurich, where the DNA is encapsulated, in this case, in silicon dioxide nanoparticles. And those are stored into a DNA library, very much like we have tape libraries today except it's a lot smaller. And then when it's time to recover the information we do the random access using a combination of fluidics and really retrieving the right library

Starting point is 00:16:22 with a random access used with PCR, which is that process that I mentioned that's copying the DNA. So we essentially can selectively copy the DNA based on that tag that we added, that file ID that we added in the beginning when we encoded the data. So again, we can only copy molecules that belong to the file that we are interested in reading and not other files. So we copy those and then we sample from those. And most of the molecules after we sample, after when we sample, after we do PCR, are the molecules that ultimately make it into a reader.

Starting point is 00:17:00 Okay, so then the next step is to sequence it. So to read the DNA, Illumina, who's going to talk later today, has a sequencer that's based on optical technology. and then do that successfully. And with that, you can use very smart computer vision tricks to essentially detect which molecules are in a particular space. That generates a number of reads that will then later decode into the sequences of bits that we originally stored. Another way to do that is with using nanopore devices. And so here the reading is electrical and it's really dragging the DNA through a nanoscale pore and measuring electrical disturbances

Starting point is 00:17:56 that it causes as it goes through. And this is what generates the reads. So there are errors in both platforms. There's different types of errors, not just substitutions, which would be the equivalent of a bit flip, but also insertions where symbols appear and deletions where symbols disappear from our sequences. But we can still recover the information. And so we can use many tricks from error correction and coding theory to do that. So let me just quickly walk you through that. So here are sequences that we read. We reordered those sequences and clustered them so that we essentially grouped together sequences that are

Starting point is 00:18:39 similar. We perform a consensus, do consensus analysis and come up with a sequence that's inferred to be the sequence that came in originally. And then we'll translate that into bits. If there's any sequences that are still missing, erasure errors, then we're going to use that redundancy that we added in the beginning to recover from those. And the code can also recover from a few additional errors if they trickle in. All right. And then we have all the sequences. We can use those sequence numbers to reorder them and then to recreate the file that we had originally stored. So going back to this pipeline here, I just wanted to give you an overview of what's to come in the rest of the session.

Starting point is 00:19:33 For encoding and decoding, we're going to have Los Alamos National Lab talk about their system. In synthesis, we're going to hear from Twist on what they've been on synthesis and what they've been up to. We're going to hear from Imogene on preservation and from Denali on random access and finally from Illumina on sequencing. So with that, I'm going to say bye and Luis will come back and talk some more about the trends and also some more of our collaboration and our work. Thank you. Thank you, Karen.

Starting point is 00:20:12 That's a tough act to follow, but I'll try my best here. So I wanted to start with a few thoughts on just zooming out and seeing the progress that the data storage has made as a field. And what this plot is showing is here in the x-axis and the amount of data stored in the y-axis and, you know, the different forms of synthesis to get there. And what's interesting first to note is that this is an exponential scale, so we are in exponential progress,

Starting point is 00:20:40 which is great to see. You know, so far lately, we're roughly on the gigabyte scale right now, you know, definitely on array-based synthesis. And, you know, you're going to hear more about this later today from the Twist folks. So with that, I want to stop back and think about, let's look at the trends here.

Starting point is 00:20:59 You know, unfortunately, this is 2015, you know, we should get Rob Carlson to update this. But what this is, this is known as the Moore's Law of DNA reading and writing. And a few things are of note here. Again, this is in an exponential scale. And the red line is synthesis. So synthesis has made some exponential progress, but then kind of slowed down. And I'm fully confident that DNA data storage is going to push this back into, you know, faster pace curve.

Starting point is 00:21:26 And so the black line here is transistors per chip. So this is essentially Moore's law as a reference. And it's interesting to see that, you know, DNA sequencing reading is improving faster than Moore's law, at least for a while. The point here is to show that, you know, this gives a lot of confidence that DNA data storage is likely to be viable within a reasonable amount of time. And I'm sure we're going to hear a lot more about sequencing from Craig later today. But now, thinking from a systems perspective, what I just talked about was throughput. Now let's think about latency. latency, you know, we talked about the latency of synthesis and sequencing are typically done in batch and involves fluidics, involve physical actuation to a degree that's way beyond what we

Starting point is 00:22:10 typically mean in storage systems. So we're talking about latency. It's probably going to be out of tens of minutes to hours for synthesis. And, you know, if we do sequencing by synthesis, but emerging technologies like nanopore that Karen talked about, since you get data real time, you're likely to push that latency, at least for the readout side, closer to real time. So that's one trend. The other thing that's interesting to think about on read and write mechanisms is looking at the trade-offs between what has been done for life sciences and what we need for data storage.

Starting point is 00:22:42 The fundamental write and read mechanisms are very similar, but actually the trade-offs are different. For example, you know, think about error rates. In life sciences, you know, single base Flipper mutation could lead to, you know, very significant effects. So, but with data storage, we can, of course, build error correcting schemes. You're going to hear a lot more about this, and there has been significant progress on that. And, you know, you have

Starting point is 00:23:07 several error types, and we can actually deal with that with redundant information that nature does at some extent, but I would say not as robust as computer scientists have figured out. So one ends up wondering, like, if we did have stronger correcting codes in nature, maybe we wouldn't have, you know, horrible diseases that we have today. Okay. Now, on the length, you know, so for DNA data storage, and of course, the longer the strand of DNA, the more data you put. But it turns out that, you know, there's a diminishing return. So after you amortize the overhead of primer sequences or

Starting point is 00:23:48 other tags, or maybe an addressing scheme, those either stay constant or grow logarithmically with the payload. So after a few hundred bases or so, it's unlikely you're going to get significant benefit in terms of reducing overheads. Whereas in life sciences, longer sequences tend to have a lot more function and you have really long genes. The reason that I mentioned this is that this suggests a lot of optimizations to make the native data storage more practical, right? You can trade off accuracy for lower cost and higher throughput.

Starting point is 00:24:17 You can trade off sequence size for faster write and read. And, you know, this is, of course, a lot of folks working in this space are building on these trade-offs. Now, all right. So we're talking about these trade-offs. Let me now discuss something that we're particularly passionate about

Starting point is 00:24:37 is the following question. When we succeed, I'm saying when, when we succeed in storing a lot of data in molecular form in DNA, this does beg the question of, you know, given that the benefit between the molecular world and the electronic world is likely to be limited, even with all the progress that we have ahead that, you know, I'm sure we're going to get to, wouldn't it be nice if we could actually do a lot of computation in molecular form directly? Because then you enable highly parallel efficient

Starting point is 00:25:07 and energy efficient computing, and then also use a lot less bandwidth between the molecular world and the electronic world. So here's one example that we're excited about. So you might have heard of DNA computing in the 80s. There's a fantastic paper by Len Adelman in DNA1, the first DNA conference in 1994. They talked about solving an explanation time problem, a hematoma path problem with DNA.

Starting point is 00:25:31 And it worked in the following way. So here you have a graph and you want to find a way that actually a path that visits all of the nodes in a graph. So what is the shortest path that would do that? Well, the way you do that in DNA, you encode each node here as a DNA sequence, and then edges are an overhang sequence that actually connects the two. So after I encode all nodes and all edges in a bunch of DNA molecules, I put them in a solution, I shake them, right, or I let them settle. And then when look at the longest molecule they have formed, that shows us what the result was.

Starting point is 00:26:10 This is super cool. And it was an incredible idea that opened up a lot of possibilities in thinking about molecular computing. But the problem in this specific solution that it shifts the complexity from time to amount of material, right? So this is an exponential time problem.

Starting point is 00:26:24 Now, if you were to do this in space, you're going to need an exponential amount of space. So for you to solve any reasonable size problem here, you're going to need, you know, a lot of DNA, potentially even all of the atoms of the universe in form of DNA, which, you know, is definitely not practical. But what we've been thinking about is that what would DNA computing look like in the age of big data?

Starting point is 00:26:43 So what we wanted here is essentially operate on data already stored in DNA. We want to target polynomial time algorithms like search. And if we did that, we would potentially have an extremely parallel and energy efficient way of manipulating information in a molecular form. The problem that we decided to explore is content-based image and video search for several reasons. First, you know, this is very bandwidth intensive and it's also a very key primitive in systems today. So content-based image and video search exists today

Starting point is 00:27:17 and it's used in a variety of day-to-day systems. You can do image-based similarity search in Google or Bing. And also this similarity search is Google or Bing. And also, this similarity search is a key primitive in machine learning systems that is part of a bigger flow. Now, how would you do that in DNA? And the way this works is you give an image to a database, and you retrieve images that are similar to your input image, in this case, our airplanes. How do we do that in DNA? Well, we want to encode our database

Starting point is 00:27:45 in DNA and be able to search it. So how do we do that? Well, first, we need to start with an observation that's fairly straightforward, especially in hindsight, that as you know, DNA forms this double helix. And if you have a complete match, this binding between the two sides is very, very strong. But if you have a reasonably good partial match, you still bind the two sides of a double helix, and you can have a poor match here that still binds, but it's a little bit less stable. So you probably can guess where I'm getting at here, but suppose that I had the following. Suppose that I had the database here on the left, encoding a bunch of single-cell DNA molecules, and then you have a query, something that I want to search for a match here, attach the magnetic beads. Okay. So now if I

Starting point is 00:28:34 actually mix this, if I have a perfect match, there's a much higher likelihood that I'm going to retrieve the perfect match. Now, as I go with poorer and poorer matches, I decrease the probability and the frequency in which those resulting mollusks will come out of the magnetic extraction process, right? So, okay, so how do we build on that to do search? Here's how we do it. We came up with this idea of essentially encoding features, the same features using computer vision, where you extract visual, say in this case it's images, this is not constrained only to images,

Starting point is 00:29:11 it could be video, it could be other forms of data. We're going to extract feature vector from those data items. And then the question here is, how do we encode those data items, those feature data sets into DNA forms such that you get the following property.

Starting point is 00:29:26 So features that are similar, okay, so should have DNA sequences that are more likely to actually stick to each other. Okay. And the way we did that is via a learning process. Okay. So we, and I'm going to tell you more about it in a second, but keep that in mind that the property that we want here is that similar features leads to DNA sequences that are more

Starting point is 00:29:51 likely to stick to each other. Okay. So proportionally to the similarity. Now, how would we build on that? Suppose that I have a, you know, a query that looks like, like, so a database looks like this, you know, I have the feature vector encoded in part of my DNA strand, and then I have a tag and it just, just like an idea of the image in the rest of the strand. Okay. So now I have a query say that, you know, that binocular I encoded in DNA using the same form here, using the encoder that we had talked about and attaches a magnetic

Starting point is 00:30:21 bead. So now I can build, if they are similar, you know, they're going to hybridize. So the way we're going to do that into a database is encode all of your images into database trends, have a, you know, large number of copies of your query attached to magnetic beads. I'm going to mix it all in a solution when everything, you know, gets stable, you know, because it's the state of lowest energy, the most, the closest matches are the ones that are going to actually hybridize into the magnetic beads. So when I take that with the magnets, you know, take a picture, that's what it looks like.

Starting point is 00:30:51 Now, those are the magnets and magnetic nanoparticles. And I take it out from the solution. And what I get back are things that look like the query image. So if you're into machine learning, you probably think that this is a form of semantic hashing like the way we encode say a uh sim a feature in an euclidean space and saying in a vector of floating point numbers you can encode that into binary strings such that in hemming space uh they have similar properties uh similar distance right so because in this case it's cheaper to compute in a binary uh representation than in floating point for. We kind of did a similar thing here, but into DNA.

Starting point is 00:31:26 We think what we did is a translation from these feature vectors in Euclidean space into DNA sequences such that you have these properties that similarity in Euclidean space leads to higher extraction and reaction yields in the DNA molecular space. Okay. So I don't spend too much time on this, but I'm happy to, you can read it in our paper, Nature Coms 2021. Basically the way this works is starting with a set of images. We start with, you know, pairs. There are images whose features are mapped randomly to DNA. Then as I keep getting these pairs of images, I run into the encoder, I measure their distance, I estimate the yield, and then I compare that, I estimate the yield with what the predictor said.

Starting point is 00:32:13 If they don't agree, I keep mutating the encoding process such that for the training set, for all the similar images that are close enough, the the estimated yield um of this extraction or the probability of sticking uh is proportional okay um you can read more about the the structure of the neural network that does that the key thing i want to point out here that's the only reason i'm showing the structure here is that we definitely need um a fully connected layer because we're going to reduce the space that's much larger than we're going to put into DNA into a relatively small set of bases. So we need to have a fully connected layer to actually spread and make use of the encoding space as well as possible. And if you're interested in the results, you should read our paper. Basically, what we did, we encoded 1.6 million images in a database and synthesized with Twist services. Thank you, Twist.

Starting point is 00:33:11 They were collaborators with us on this project. This was funded by the DARPA Molecular Informatics Program. So thank you, Anne Fisher. And this was a result that got us really excited because when we actually ran the process and took the images out and ran it through a sequencer, we got what we were expecting, which is molecules that corresponded to images that had low Euclidean distance or very similar are much more frequent than the ones that had high Euclidean distance. This was very encouraging. And we refined this as a lot more results on the paper. Okay, so now why am I excited about this as a computer architect?

Starting point is 00:33:52 The reason is because I know there's a bunch of storage folks in this conference, so I'm preaching to the choir here, but thinking about the storage device as a 3D shape, typically capacity, you know, as a 3D shape, right? Typically capacity is not typically the capacity is function of the volume of this physical object. And the bandwidth is a function of the surface of the area. So that's why, as you have a larger and larger storage device, you tend to have much higher time to read it all. Right. So the trends for reading a whole hard drive is going up really fast. Today's on

Starting point is 00:34:26 the order of days when used to be in the order of hours, you know, not too long ago. That's because the capacity grows with the cube, the bandwidth grows with the surface of the area. So capacity grows L cubed and bandwidth grows L square. Now, as a computer architect, you know, you probably, you know, I'm a big fan of near data processing. You probably heard of that before. The key idea there is to, you know, break down storage, say in this case happens to be, you know, a memory, solid state memory into little, to smaller units and pair them with processing elements. And the reason is because you have much higher bandwidth, they're smaller, they're closer, and you have lower volumes. You can actually have much lower times read it all, right? You change the capacity to bandwidth ratio. That's a trending computer architecture

Starting point is 00:35:14 ready to access data in high bandwidth and low latency. And the way I think about what we are doing here with this way of embedding computation in molecular form and diffusing through the data is really that essentially like one way of calling this is like diffusive computing right you can store the data in a bunch of dna molecules and then you encode processing elements into molecular form and then you let it diffuse through the data and the reason this is cool is that since you're not attached to any specific physical organization of your of 3d physical organization of your data you're not bound to a predefined surface to area surface to volume ratio so you can choose and dial up your bandwidth depending on the needs

Starting point is 00:35:58 of your application now this is a long argument but i can't resist you know making it because i do think this is a fundamental advantage that molecular storage would have over traditional data storage, be it in optical form, in electronic form. Is that the fact that you can diffuse and rearrange it physically leads to really interesting systems properties that are likely to be useful. Okay, so now that was a form of near-molecule processing. And I want to transition to a different point here, which is how are we going to package this all, all of this into a system we can deploy, say, in a data center, right? So one, when you think about this, you know,

Starting point is 00:36:39 neatly organized is having stations, a write station, a storage station, a read station in a rack scale. And that's the vision. We'll be able to encapsulate this in a way that you can actually put it in a data center. But the reality is that today we're doing a lot of this still like this. You know, this is an old photo by now, but this is one of our grad students, Lee Organic, who did a bunch of the work that we talked about today. And this is me pretending to be useful. It's all very manual. So there's a very big difference from this and where we want to be in a fully automated system. So we've been thinking a lot about what would it take to automate the whole process from digital data, encoding to the molecules, storing them, retrieving and running to a sequencer.

Starting point is 00:37:20 So we decided to build a system that, you know, the point is just to show automation, but not to be fast. You know, this was not fast at all. But we built a synthesizer. You know, you see like the phosphoramidides, A, C, Gs, and Ts. And, you know, there's a bunch of valves here that controls the synthesis process. We store it away, and then we can retrieve it, prep it for sequencing, and run it through a nanopore sequencer here. So it was great to see that it's possible to show that, open our eyes of what's needed to be fully automated. But that's still a long ways from this. This is a tape library that's probably several years old already.

Starting point is 00:38:02 That's still a long ways away. So the way we are thinking about bridging that gap is extending the system with a much more flexible way of manipulating samples. And our best is that digital microfluidics is one of those because you can rearrange and configure and program it in a flexible way. And this technology has been around for some time.

Starting point is 00:38:24 We've been building an open source one called PurpleDrop. in a flexible way. And this technology has been around for some time. We've been building an open source one called PurpleDrop. And the way this works, for those of you who don't know, is the following. So digital microplastics has an array of electrodes and has a conductive top plate here. And as you apply a voltage, the droplets move to where the voltage is. Okay?

Starting point is 00:38:44 Here's some examples, some videos. There's a green droplet moving on our electrode board. And in order to control and see where things are, we've been applying computer vision to watch those droplets or using capacitive sensing to monitor that just because it still needs closed-loop control. The reason I'm showing you this is that that's what we've been using to, we've

Starting point is 00:39:08 used to prototype what would be the equivalent of a library on a card of DNA. The idea here is to store DNA into pools that are dehydrated on the surface, and then use droplets to retrieve those spots of

Starting point is 00:39:24 DNA. In this top plate that I was talking about, you have spots of DNA that dehydrate, and when you want to retrieve those spots of DNA. So in this top plate that I was talking about, you have spots of DNA that dehydrates. And when you want to retrieve it, a droplet visits there, stays there for a little bit, soaks up some DNA, and then we move it and steer it to the DNA sequencer. And we demonstrated that this works and it can work fairly well. There was a paper, a NatureCom's paper a couple of years ago that, you know, detailed how we did that and looked at contamination issues. We're looking at what are the chances of droplets leaving, you know, molecules behind in a path and then later contaminate other droplets that cross the same path. And we didn't see any, you know any very significant contaminations. This suggests a solution of using microfluidics as a liquid robot to go pick data from a location and move it to another,

Starting point is 00:40:12 kind of similar to a robotic arm in a tape library. So we've been very excited about digital microfluidics in general. And we've been, you know, as part of this, we're building an affordable, full-stacks, soft and hard digital microfluidics platform where you can express your protocols in Python and compile it down to the assembly code of controlling the actuation of electrodes on the boards

Starting point is 00:40:41 to move the droplets where they need to go and so on. So this is far from being a professional, electrodes on the boards to move the droplets where they need to go and so on. So this is far from being a professional large scale solution, but we feel like these are ways of actually de-risking and finding what are potential ways forward of building data center scale fluidic manipulation systems. The price has to be lower. The reliability can be lower too, as long as there's enough redundancy. And we've been thinking about this, not just in the context of data storage and molecular computing,

Starting point is 00:41:11 but also how do we scale this up to build truly, truly large scale automation for sample processing or forms of molecular computing, of molecular manipulation. So zooming out, you know, the way we like thinking about, you know, DNA data storage and coupling it with a form of molecular computing, it's really having harder software and wetware. You know, we have, and we have interfaces that are highly flexible, like for example, digital microfluidics to interface electronic domain

Starting point is 00:41:41 with the molecular domain. And we see this as a part of what could be a future hybrid system. It's pretty clear by now that there's specialization in computer systems. We have CPUs, GPUs, accelerators done with electronics because it's ultra low latency. It's highly engineerable and allows you to control it perfectly. But we know that there's a lot of promise in other forms of computing using, for example, biomolecules that have self-assembly, massive data storage density,

Starting point is 00:42:12 and potentially very highly energy-efficient computing mechanisms that are orders of magnitude lower than what electronics can be. And then, of course, there's quantum systems that's massively parallel, computing for problems like optimization, but typically have much lower IO bandwidth in and out of a quantum system. And the reason I mention all of this is that I think there's an interesting world here where you pick the best type of device technology for the best type of algorithms, right?

Starting point is 00:42:43 And mix, I don't think electronics and CMOS is ever going to go away. But, you know, we have to think about hybrid systems that mix the most out of each. With that, I'll say thank you. We look forward to answering any questions you might have in the online forums now. Thank you again for having us.

Starting point is 00:42:59 It's really great to see this field thriving, and we can't wait to see where it's going to be in the near and medium-term future. So if you have more questions, you can also explore our website. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #170: DNA Data Storage and Near-Molecule Processing for the Yottabyte Era

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.