Storage Developer Conference - #192: DNA data storage: Coding and decoding

Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 192. My name is João, but you may call me Lucas. It's my second name. And together with my colleague Marília, we're going to make a presentation about our work with DNA data storage. So we come from Brazil,

Starting point is 00:00:50 and we have started working with DNA data storage this year in January. So we're very new to this field. And we're here to talk about our experiences, share some of our results, and also introduce ourselves to the DNA data storage world. So, I would like to start with this brief introduction. We come from the Institute for Technological Research in Sao Paulo, Brazil, which we can abbreviate as IPT. And for 123 years, we have been contributing, providing technical solutions in many different areas for the industry, our government, and society in general, contributing for the technological development of Brazil. And being more specifically, inside IPT,

Starting point is 00:01:38 we have many different research centers. We, the ones who are here today, we come from the BioNano, I'm so sorry, I thought I had corrected this yesterday, but apparently not. So it's the Bio, from biology, BioNano Manufacturing Center. We are a multidisciplinary research facility that we focus on researches based on biotechnology, nanotechnology, material science, and micromanufacturing. So, in IPT, and more specifically in the Bionano Center, we are always looking for partnerships with different companies to develop innovative and disruptive technologies. And we wanted to work with DNA Data Storage first because we really believe in the potential of this kind of technology to change how many different industries will store their data and work in the future.

Starting point is 00:02:33 And also, we believe that we could combine all four areas of our expertise, biotechnology, nanotechnology, material science, and micromanufacturing, to make a good contribution to this area. And Lenovo, which is one of the leaders today in designing and manufacturing data storage devices, they have decided to join us in this venture to bring DNA data storage solutions into reality. And we're working together in this project and since January this year, we have named our project Prometheus. I don't know if this is the correct pronunciation, but it's an honor to the Greek Titan God as a symbol of our joint efforts in DNA data storage. And we have been working together since then. So I would like to start with this introduction to you

Starting point is 00:03:29 about why DNA data storage. I guess it's no surprise for anyone here that we have been generating more data than we can store, and therefore we need to find some better solutions. And DNA is being presented as this possible solution. But why should we work with DNA data storage? First, DNA presents the possibility to store a very high density of data along its molecules. So some research workers say that we have the potential to store petabytes of data into

Starting point is 00:03:57 just one gram of DNA. That would mean that we could store the whole data that we find in a data center today into a very small microtube, which is a very impressive compatibility capacity. On the other side, DNA may also be very stable along the years. So I brought this research article here as an example to show that this year, in 2022, researchers were able to recover the DNA from the remains of the victims of the Mount Vesuvius eruption that happened approximately 2,000 years ago. It is important to notice in this example that along all those years, those 2,000 years,

Starting point is 00:04:37 the DNA was not stored in proper conditions so it could remain stable. It was present in the remains. It was suffering very harsh conditions like an eruption. And the amount of rescissures they were able to recover was enough to reconstruct the information they needed. So DNA can be a very stable possibility. And another interesting fact about it is that as long as there are humans on Earth, DNA will always have to be read, which means the DNA reading technologies, they will never be outdated. I mean, they can be they can be outdated, but if they are outdated, they will be replaced by some

Starting point is 00:05:20 by another reading technology that is way more improved than the previous one. So DNA will always be read along human history. I brought this example just to show that how could we store data in DNA? I brought the LTO tape. Many people here will certainly know how to explain how an LTO tape works way better than I do. But what I want to show here is that, you know, in an LTO tape, we have the different magnetic states along the, our tape. So if it's north, in this case, we have a one, and if it's south, it's a zero. So

Starting point is 00:06:08 according to the orientation of the magnetic field, they represent our binary code. In DNA, we can think in a similar way. But instead of using the orientation of the magnetic field, we use the sequences of the four different bases that are part of our DNA today, which are the A, T, C, and G. So, I guess everyone here has already presented a DNA data storage pipeline, and we'll end up doing the same. The only difference, I guess, is we ended up breaking down the coding. The coding part, which is first we consider that we convert the information to binary code, and then we convert our binary code to DNA code. Then we synthesize the sequences that we need to store.

Starting point is 00:07:01 Then we store the DNA. When we need to access the information, we recover. We sequence the DNA and use a decoding algorithm. It might seem a little simple because in the end, we know exactly everything that we have to do in order to store and recover our information in DNA. But if it's so simple, we know what we have to do. Why hasn't DNA data storage taken over the world

Starting point is 00:07:31 and is not being employed by many different industries around the world? Because in this pipeline, we can say we have a very important bottleneck, which is the step number three, the DNA synthesis step. Like I mentioned before in the introduction that we can store petabytes of data into just one gram of DNA, but yeah, it's possible, it's not a lie, but that's not so simple. Like if I say, oh, I'm going to come here and I'm going to make one gram of DNA.

Starting point is 00:08:07 And then all of a sudden I will have 200 petabytes into just a microtube. This is not so simple because it's not the amount of DNA. But what really matters here is the sequence that we're making that encode our information. And in order to properly encode our sequence in considering like so many constraints like error correction and other stuff, in order to really store our petabytes of data, in the end, we will end up having to synthesize billions of unique sequences in our process.

Starting point is 00:08:50 So this is why this is the bottleneck, because up to today, no one has been really able to make this in a fast, affordable, and precise way. Because this is the biggest challenge that we have been facing up to today. So, and also what most people who work with synthesis, this is the problem that everyone is trying to solve. So, having this in mind that the synthesis is today the most important bottleneck that we have, we in IPT and Lenovo, we have divided our work approaching two different DNA synthesis routes.

Starting point is 00:09:41 The first one is the enzymatic synthesis, and the second one is the chemical synthesis. And here, so why are we working with two different approaches? Because each one of them presents their advantages and disadvantages, and it's important to say that it's not like a competition between which one is better. But we believe that in the future,

Starting point is 00:10:03 we can explore both of them, because both of them in the future, we can explore both of them because both of them in the future may provide good solutions for DNA data storage world. And both of these groups, the enzymatic and chemical, have the support of an engineering team that is responsible for the design and construction of micro devices that would be the devices that we actually use to synthesize the DNA, a codec team that is responsible for the coding and decoding of our information, and a sequencing team that assures that we can properly recover the information we stored and our synthesizers are properly working. So, going a little bit further in detail, I'm going to talk about the chemical DNA synthesis.

Starting point is 00:10:57 I'm going to give a brief introduction, and Marilia will go deeper in later. So the chemical DNA synthesis, it is a more standardized technique. It has been used for decades in the world to make primers for PCR reactions, like many speakers have presented before. And our challenge here is to miniaturize this process because today the primers that are made,

Starting point is 00:11:28 they're being synthesized in machines like this that I show in this image. It's a very big machine that requires a lot of regions to properly function. And our challenge here is to build the micro device so that we can miniaturize this process and make it more affordable. One good advantage of the chemical synthesis is that we can make a finer control of the synthesis, which means it is easier to make sure that we are properly synthesizing the sequence that we want and having fewer errors. But on the other side, a big disadvantage of this kind of process

Starting point is 00:12:06 is that we employ a lot of different chemicals, like acids or organic solvents, that at the end of the process, they need to find a proper dispose. We need to properly dispose those chemicals. So let's imagine if one day, hopefully, we can have a DNA synthesizer that employs this kind of process and we can store like large banks, a large bank like JP Morgan, for example, a large bank of, no, a large data of a bank. We would need to, for that process to work,

Starting point is 00:12:46 we would have to employ a great amount of those kind of chemicals that are not easy to dispose. And the process would not be so environmentally friendly. But hopefully, we believe when we see all the research that has been done in this area, this is something that is being considered see all the research that has been done in this area, this is something that is being considered along all the development and this is not going to be a problem in the future. On the other hand, we also have the enzymatic DNA synthesis. The. So, what is the most important

Starting point is 00:13:29 thing here? For this kind of synthesis, there is one enzyme that is called TdT, is it correct? That has been widely adopted for this kind of application. So one great feature of this enzyme is that it's capable of getting a single-strand DNA and adding nucleotides to this DNA without the need of a template. So this is one very interesting feature of this enzyme that makes it, that allowed it to be used for data storage process. So because we can synthesize DNA without the need of a template, which is the opposite of what happens in nature.

Starting point is 00:14:19 So how exactly does an enzyme create the DNA? I brought this analogy here with a hard disk, so we can think of an enzyme like the head of the hard disk. The head is the part that is responsible for the recording of the information. And the DNA, we can think of it as the physical medium where the information is being recorded. So basically what happens is we use the enzyme to make the DNA and as long as the enzyme goes adding the nucleotides, the A, T, C and G's, we are recording our information along the DNA. In this case the enzyme is extremely necessary for the DNA synthesis to happen. One of the advantages of working with DNA enzymatic

Starting point is 00:15:08 synthesis is that it's a very novel technique. On one side, the processes, the techniques of this kind of synthesis are not so mature when compared to the chemical synthesis, but on the other side, we believe there's a lot of room for improvement and many new possibilities that we can discover along the process. And it is also much more environmentally friendly process

Starting point is 00:15:34 because what we use here is just water, some salts, some very small amounts of organic solvents, and the enzyme itself, the enzyme which is naturally, can be naturally degradable so it's not gonna be a great environmental issue. But one disadvantage is that using this technique is not so simple to perform a synthesis control because when we were performing the reaction as long as there are nucleotides, the bases in the medium, the enzyme will keep adding with no control. One possible way to solve this is when we use blocks nucleotides, as Marília will

Starting point is 00:16:12 present later, but we're also working on this. So, one of the biggest contributions we want to make to this DNA data storage environment is that we would like to use our expertise to help to make this technology more affordable. production of this kind of enzyme because we want to develop a protocol in order to so we can synthesize the TDT in a high productivity, a high conversion, so that when we have this kind of process we will end up diluting some of our fixed costs and the enzyme will become much cheaper to go to the market. And if the enzyme becomes much cheaper, the enzymatic DNA synthesis process as a whole becomes more affordable. So, how do we do this?

Starting point is 00:17:15 So first, we have been testing many different versions of the TDC enzyme. As we can see in this green image right here, this is the structure of the TDC enzyme. We have been making small changes along the structure of the enzyme, and we have been testing if this changes causes some, makes the enzyme better. Like if it, when we change a small portion of this enzyme, the reaction becomes faster, or if the enzyme becomes more stable, resisting to higher temperatures, we have been making this kind of test. And once we get a good enzyme that we believe is good to be applied for the process,

Starting point is 00:18:02 we start the production. And how do we produce the enzyme? First, we program a bacteria. When I say program, I mean we're genetically modifying the bacteria. So the bacteria will be able to produce this enzyme. We cultivate at a large scale. We purify. And once purified, the enzyme will be ready to be applied in DNA synthesis devices. And speaking of devices, we have also been developing micro-devices for enzymatic synthesis, and we have been working with this kind of technology is that we can parallelize the process, which means we can synthesize a lot of hours, up to millions of sequences at the same time, and we can also work with small volumes, so it's easier to scale up the process later.

Starting point is 00:19:02 But like I said before, we were very young at this DNA data storage field. It's been like a short but very intensive eight months. So it's still an ongoing work, and I hope that next year we can come back here and share some interesting results that we have in DNA enzymatic synthesis. And now Marilia will come here and show us some progress in chemical synthesis. Thank you, João. Hello, everyone. I'm Marília, and I'm going to continue the presentation.

Starting point is 00:20:06 I'm going to talk a little about our results and a little bit about the DNA chemical synthesis. So to synthesize DNA chemically, we first need a surface or substrate and on this surface we need a molecule called linker. We need this linker to couple the nitrogen bases called phosphoramidides here. And these phosphoramidides, which are adenine, thymine, cytosine and guanine, A, T, C, G. So with this process, we can obtain a single strand DNA that is only one part of the double helix DNA. And to understand better this process, here we have a simple cycle for DNA synthesis. This cycle has four different steps. So, for example, we need to synthesize that sequence, A, T, C, G.

Starting point is 00:21:12 So first, we need to couple the base A. We need to watch the system with a chemical solution in a capping process. We oxidated the process to make the bonds between the molecules stronger. And we deblocked the system to remove the protector group. Okay? So now the process starts again and we need to add the base T. And we have the same steps. And with this process we can grow the DNA, but we can observe that in this example we couldn't add the base C because of this

Starting point is 00:22:08 the capping process is important this process is important to to we can grow we can grow the this molecule without errors with this process and so on we can grow the DNA. In our team we are working with DNA synthesis on different substrates on different surfaces with different materials. And we are working with this process in micro-reactors using the microfluidic devices, using the microfluidic process. About our different materials, we are working with new particles for DNA synthesis with different materials and we are working with differentluidic devices. Here we have an example of one of our microfluidic devices. It was only a test for a DNA chemical synthesis and we are working with this process using a electrochemical process for

Starting point is 00:23:50 DNA synthesis. In details we can see that device that I show you. And it's important to observe that we have this geometry here. Because this is important to synthesize most of one type of DNA. We can have a parallelized process with this device. That is, we can synthesize different molecules at the same time and inside the same device. A little part of our results. It's only a demonstration of our work. We encoded a text file with 38 bytes and two DNA sequences with 146 bases each. The message was IPT and Lenovo, a successful partnership.

Starting point is 00:25:10 And in this picture, we have the structure of the message. So we have 19 bases in the beginning and in the end of the sequence, known as a primer. We have 80 bases to check redundancy using the Reed-Solomon algorithm. And we need a payload with 100 bases with one bit of address and 59 bits of the file specifically. Okay? And we converted the message into DNA sequences with a forward primer in yellow and a reverse primer in blue. And we synthesized those sequences and different materials using the chemical

Starting point is 00:26:09 process for DNA synthesis. We synthesize using particles and columns of different materials. And we studied two options for storage of this material, the DNA synthesized. So first, we can store the DNA cutting process to cut the molecule from the surface. And we can start the surface particles and columns, for example, with the synthesized DNA. To recover this information, we need to create copies using PCR polymerase chain reactions. And now we can sequence this to recover those DNA, those two DNA sequences that I presented you a few slides ago. And to recover the message, to recover the information, we needed to align the sequences and we needed to detach of primers and we needed to check the errors using the logical redundance with the Reed-Solomon algorithm. And we can... Okay, yes. We used a non-algorithm cited on literature from Blavat.

Starting point is 00:28:07 I don't... Jay, can you help me? Blavat from... I don't remember the year of the paper. Do you remember? Which I would actually, if you guys want to see it after lunch, I can present roughly how this is done. I don't remember the year of the paper. Do you remember? It's 2015, I guess. Okay, so we used only this algorithm. We are developing new only algorithm, but in this exercise, it was the first exercise that we did, where was this algorithm? No problem.

Starting point is 00:28:51 So we couldn't recover the zeros and ones sequence in a binary code, and we needed to reassemble these files to obtain, to recover the information without any errors. IPT and Lenovo, a successful partnership. And we are working to select the, our work is ongoing, and we needed to select the best surface to DNA synthesis and how to recover the synthesized DNA. We are developing a DNA storage protocol in our institute. And we are working a lot with DNA synthesis inside the microfluidic devices using electrochemical synthesis. And we would like to be a part of DNA Data Storage Alliance and we are working on this.

Starting point is 00:29:56 And now we have a picture of our team. We are almost 50 researchers of many areas of acknowledgement. We have biologists and chemists and data analytics and engineers with our t-shirt working harder together to go further. So that's it. Thank you, everyone. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #192: DNA data storage: Coding and decoding

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.