Storage Developer Conference - #192: DNA data storage: Coding and decoding
Episode Date: June 12, 2023...
Transcript
Discussion (0)
Hello, this is Bill Martin, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast.
Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developers Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 192. My name is João, but you may call me
Lucas. It's my second name. And together with my colleague Marília,
we're going to make a presentation about our work with DNA data storage. So we come from Brazil,
and we have started working with DNA data storage this year in January. So we're very new to this
field. And we're here to talk about our experiences, share some of our results, and also introduce ourselves to the DNA data storage world.
So, I would like to start with this brief introduction. We come from the Institute for
Technological Research in Sao Paulo, Brazil, which we can abbreviate as IPT. And for 123 years,
we have been contributing, providing technical solutions in many different areas for the industry, our government,
and society in general,
contributing for the technological development of Brazil.
And being more specifically, inside IPT,
we have many different research centers.
We, the ones who are here today, we come from the BioNano, I'm so sorry,
I thought I had corrected this yesterday, but apparently not. So it's the Bio, from biology,
BioNano Manufacturing Center. We are a multidisciplinary research facility that we
focus on researches based on biotechnology, nanotechnology, material science, and micromanufacturing.
So, in IPT, and more specifically in the Bionano Center,
we are always looking for partnerships with different companies to develop innovative and disruptive technologies.
And we wanted to work with DNA Data Storage first because we really believe in the potential of this kind of technology to change how many different industries will store their data and work in the future.
And also, we believe that we could combine all four areas of our expertise, biotechnology, nanotechnology, material science, and micromanufacturing, to make a good contribution to this area.
And Lenovo, which is one of the leaders today in designing and manufacturing data storage
devices, they have decided to join us in this venture to bring DNA data storage solutions
into reality.
And we're working together in this project and since January this year, we have named our project Prometheus.
I don't know if this is the correct pronunciation, but it's an honor to the Greek Titan God as a symbol of our joint efforts in DNA data storage.
And we have been working together since then.
So I would like to start with this introduction to you
about why DNA data storage.
I guess it's no surprise for anyone here
that we have been generating more data than we can store,
and therefore we need to find some better solutions.
And DNA is being presented as this possible solution.
But why should we work with DNA data storage?
First, DNA presents the possibility to store a very high density of data along its molecules.
So some research workers say that we have the potential to store petabytes of data into
just one gram of DNA.
That would mean that we could store the whole data that we find in a data center today into
a very small microtube,
which is a very impressive compatibility capacity. On the other side, DNA may also be
very stable along the years. So I brought this research article here as an example to show that
this year, in 2022, researchers were able to recover the DNA from the remains of the victims of the Mount Vesuvius eruption
that happened approximately 2,000 years ago.
It is important to notice in this example that along all those years, those 2,000 years,
the DNA was not stored in proper conditions so it could remain stable.
It was present in the remains. It was suffering very
harsh conditions like an eruption. And the amount of rescissures they were able to recover was
enough to reconstruct the information they needed. So DNA can be a very stable possibility.
And another interesting fact about it is that as
long as there are humans on Earth, DNA will always have to be read, which means
the DNA reading technologies, they will never be outdated. I mean, they can be
they can be outdated, but if they are outdated, they will be replaced by some
by another reading technology that is way more improved than the previous one.
So DNA will always be read along human history.
I brought this example just to show
that how could we store data in DNA?
I brought the LTO tape.
Many people here will certainly know how to explain how an LTO tape
works way better than I do. But what I want to show here is that, you know, in an LTO tape,
we have the different magnetic states along the, our tape. So if it's north, in this case, we have a one, and if it's south, it's a zero. So
according to the orientation of the magnetic field, they represent our binary code. In DNA,
we can think in a similar way. But instead of using the orientation of the magnetic field, we use the sequences of the four different bases that
are part of our DNA today, which are the A, T, C, and G.
So, I guess everyone here has already presented a DNA data storage pipeline, and we'll end up doing the same.
The only difference, I guess, is we ended up breaking down the coding.
The coding part, which is first we consider that we convert the information
to binary code, and then we convert our binary code to DNA code.
Then we synthesize the sequences that we need to store.
Then we store the DNA.
When we need to access the information, we recover. We sequence
the DNA and use a decoding algorithm.
It might seem a little simple because in the end, we know
exactly everything that we have to do in order to
store and recover our information in DNA. But
if it's so simple, we know what we have to do.
Why hasn't DNA data storage taken over the world
and is not being employed by many different industries around the world?
Because in this pipeline, we can say we have a very important bottleneck,
which is the step number three, the DNA synthesis step.
Like I mentioned before in the introduction that we can store
petabytes of data into just one gram of DNA, but
yeah, it's possible, it's not a lie, but
that's not so simple. Like if I say, oh, I'm going to
come here and I'm going to make one gram of DNA.
And then all of a sudden I will have 200 petabytes into just a microtube.
This is not so simple because it's not the amount of DNA.
But what really matters here is the sequence that we're making that encode our information. And in order to properly encode our sequence
in considering like so many constraints
like error correction and other stuff,
in order to really store our petabytes of data,
in the end, we will end up having to synthesize
billions of unique sequences in our process.
So this is why this is the bottleneck,
because up to today,
no one has been really able to make this
in a fast, affordable, and precise way.
Because this is the biggest challenge that we have been facing up to today.
So, and also what most people who work with synthesis,
this is the problem that everyone is trying to solve.
So, having this in mind that the synthesis is today the most important bottleneck that we have, we in IPT and Lenovo, we have divided our work approaching two different DNA synthesis routes.
The first one is the enzymatic synthesis, and the second one is the chemical synthesis.
And here, so why are we working
with two different approaches?
Because each one of them presents their advantages
and disadvantages, and it's important to say
that it's not like a competition
between which one is better.
But we believe that in the future,
we can explore both of them, because both of them in the future, we can explore both of them because both of them
in the future may provide good solutions for DNA data storage world. And both of these groups,
the enzymatic and chemical, have the support of an engineering team that is responsible for the
design and construction of micro devices that would be the devices that we actually use to synthesize the DNA,
a codec team that is responsible for the coding and decoding of our information,
and a sequencing team that assures that we can properly recover the information
we stored and our synthesizers are properly working.
So, going a little bit further in detail, I'm going to talk about the chemical DNA synthesis.
I'm going to give a brief introduction, and Marilia will go deeper in later.
So the chemical DNA synthesis,
it is a more standardized technique.
It has been used for decades in the world
to make primers for PCR reactions,
like many speakers have presented before.
And our challenge here is to miniaturize this process
because today the primers that are made,
they're being synthesized in machines like this
that I show in this image.
It's a very big machine that requires a lot of regions
to properly function.
And our challenge here is to build the micro device
so that we can miniaturize this process and make it more affordable.
One good advantage of the chemical synthesis is that we can make a finer control of the synthesis,
which means it is easier to make sure that we are properly synthesizing the sequence that we want and having fewer errors. But on the other side, a big disadvantage of this kind of process
is that we employ a lot of different chemicals,
like acids or organic solvents,
that at the end of the process, they need to find a proper dispose.
We need to properly dispose those chemicals.
So let's imagine if one day, hopefully,
we can have a DNA synthesizer that employs this
kind of process and we can store like large banks, a large bank like JP Morgan, for example,
a large bank of, no, a large data of a bank. We would need to, for that process to work,
we would have to employ a great amount
of those kind of chemicals that are not easy to dispose.
And the process would not be so environmentally friendly.
But hopefully, we believe when we see all the research
that has been done in this area,
this is something that is being considered see all the research that has been done in this area, this is something that is
being considered along all the development and this is not going to be a problem in the future.
On the other hand, we also have the enzymatic DNA synthesis. The. So, what is the most important
thing here? For this kind of synthesis, there is one enzyme that is
called TdT, is it correct? That has been widely adopted for this kind of application. So one great feature of this enzyme is that it's capable of getting a single-strand DNA
and adding nucleotides to this DNA without the need of a template.
So this is one very interesting feature of this enzyme that makes it,
that allowed it to be used for data storage process.
So because we can synthesize DNA
without the need of a template,
which is the opposite of what happens in nature.
So how exactly does an enzyme create the DNA?
I brought this analogy here with a hard disk, so we
can think of an enzyme like the head of the hard disk. The head is the part that is responsible
for the recording of the information. And the DNA, we can think of it as the physical
medium where the information is being recorded. So basically what happens is we use the enzyme to make the DNA and
as long as the enzyme goes adding the nucleotides, the A, T, C and G's, we are
recording our information along the DNA. In this case the enzyme is extremely
necessary for the DNA synthesis to happen. One of the advantages of working with DNA enzymatic
synthesis is that it's a very novel technique.
On one side, the processes, the techniques of this kind of
synthesis are not so mature when compared
to the chemical synthesis, but on the other side, we believe
there's a lot of room
for improvement and many new possibilities
that we can discover along the process.
And it is also much more environmentally friendly process
because what we use here is just water, some salts,
some very small amounts of organic solvents,
and the enzyme itself, the enzyme which is naturally, can be
naturally degradable so it's not gonna be a great environmental issue. But one
disadvantage is that using this technique is not so simple to perform a
synthesis control because when we were performing the reaction as long as there
are nucleotides, the bases in the medium, the enzyme will keep adding
with no control. One possible way to solve this is when we use blocks nucleotides, as Marília will
present later, but we're also working on this. So, one of the biggest contributions we want to
make to this DNA data storage environment is that we would like to use our expertise to help to make this technology more affordable. production of this kind of enzyme because we want to develop a protocol
in order to so we can synthesize the TDT in a high productivity, a high
conversion, so that when we have this kind of process we will end up diluting
some of our fixed costs and the enzyme will become much cheaper to go
to the market. And if the enzyme
becomes much cheaper, the enzymatic DNA synthesis process as a whole becomes more affordable.
So, how do we do this?
So first, we have been testing many different versions of the TDC enzyme. As we can see in this green image right here,
this is the structure of the TDC enzyme.
We have been making small changes along the structure of the enzyme, and we have been
testing if this changes causes some, makes the enzyme better. Like if it, when we change a small
portion of this enzyme, the reaction becomes faster,
or if the enzyme becomes more stable, resisting to higher temperatures,
we have been making this kind of test.
And once we get a good enzyme that we believe is good to be applied for the process,
we start the production.
And how do we produce the enzyme?
First, we program a bacteria. When I say program, I mean we're genetically modifying the
bacteria. So the bacteria will be able to produce this enzyme. We cultivate at a large scale.
We purify. And once purified, the enzyme will be ready to be applied in DNA synthesis devices.
And speaking of devices, we have also been developing micro-devices for enzymatic synthesis, and we have been working with this kind of technology is that we can parallelize the process,
which means we can synthesize a lot of hours, up to millions of sequences at the same time,
and we can also work with small volumes, so it's easier to scale up the process later.
But like I said before, we were very young at this DNA data storage
field. It's been like a short but very intensive eight months. So it's still an ongoing work,
and I hope that next year we can come back here and share some interesting results that we have in DNA enzymatic synthesis.
And now Marilia will come here and show
us some progress in chemical synthesis.
Thank you, João.
Hello, everyone.
I'm Marília, and I'm going to continue the presentation.
I'm going to talk a little about our results and a little bit about the DNA chemical synthesis.
So to synthesize DNA chemically, we first need a surface or substrate and on this surface we need a molecule called linker. We need this linker to couple the nitrogen bases called phosphoramidides here.
And these phosphoramidides, which are adenine, thymine, cytosine and guanine, A, T, C, G.
So with this process, we can obtain a single strand DNA that is only one part of the double helix DNA.
And to understand better this process,
here we have a simple cycle for DNA synthesis.
This cycle has four different steps.
So, for example, we need to synthesize that sequence, A, T, C, G.
So first, we need to couple the base A.
We need to watch the system with a chemical solution
in a capping process.
We oxidated the process to make the bonds between the molecules stronger.
And we deblocked the system to remove the protector group.
Okay?
So now the process starts again and we need to add the base T. And we have the same steps.
And with this process we can grow the DNA, but we can observe that in this example we couldn't add the base C because of this
the capping process is important this process is important to to we can grow
we can grow the this molecule without errors with this process and so on we can grow the DNA.
In our team we are working with DNA synthesis on different substrates on different surfaces with different materials. And we are working with this process in micro-reactors using the microfluidic devices, using the
microfluidic process.
About our different materials, we are working with new particles for DNA synthesis with different
materials and we are working with differentluidic devices.
Here we have an example of one of our microfluidic devices.
It was only a test for a DNA chemical synthesis and we are working with this process using a electrochemical process for
DNA synthesis. In details we can see that device that I show you. And it's important to observe that we have this geometry here.
Because this is important to synthesize most of one type of DNA.
We can have a parallelized process with this device.
That is, we can synthesize different molecules at the same time and inside the same device.
A little part of our results.
It's only a demonstration of our work.
We encoded a text file with 38 bytes and two DNA sequences with 146 bases each.
The message was IPT and Lenovo, a successful partnership.
And in this picture, we have the structure of the message.
So we have 19 bases in the beginning
and in the end of the sequence, known as a primer. We have 80 bases to check
redundancy using the Reed-Solomon algorithm. And we need a payload
with 100 bases with one bit of address and 59 bits of the file specifically.
Okay? And we converted the message into DNA sequences
with a forward primer in yellow and a reverse primer in blue.
And we synthesized those sequences and different materials using the chemical
process for DNA synthesis. We synthesize using particles and columns of different materials. And we studied two options for storage of this material, the DNA synthesized.
So first, we can store the DNA cutting process to cut the molecule from the surface.
And we can start the surface particles and columns, for example, with the synthesized DNA. To recover this information, we need to create copies using PCR polymerase
chain reactions. And now we can sequence this to recover those DNA, those two DNA sequences that I presented you a few slides ago.
And to recover the message, to recover the information,
we needed to align the sequences and we needed to detach of primers
and we needed to check the errors using the logical redundance
with the Reed-Solomon algorithm. And we can... Okay, yes. We used a non-algorithm cited on literature from Blavat.
I don't... Jay, can you help me? Blavat from...
I don't remember the year of the paper. Do you remember? Which I would actually, if you guys want to see it after lunch, I can present roughly how this is done.
I don't remember the year of the paper. Do you remember?
It's 2015, I guess.
Okay, so we used only this algorithm.
We are developing new only algorithm, but in this exercise, it was the first exercise that we did,
where was this algorithm?
No problem.
So we couldn't recover the zeros and ones sequence in a binary code,
and we needed to reassemble these files to obtain,
to recover the information without any errors.
IPT and Lenovo, a successful partnership.
And we are working to select the, our work is ongoing, and we needed to select the best surface to DNA synthesis and how to recover the synthesized DNA.
We are developing a DNA storage protocol in our institute.
And we are working a lot with DNA synthesis inside the microfluidic devices using electrochemical synthesis.
And we would like to be a part of DNA Data Storage Alliance and we are working on this.
And now we have a picture of our team. We are almost 50 researchers of many areas of acknowledgement. We have biologists
and chemists and data analytics and engineers with our t-shirt working harder together to
go further. So that's it. Thank you, everyone.
Thanks for listening.
For additional information on the material presented in this podcast,
be sure to check out our educational library at snea.org slash library.
To learn more about the Storage Developer Conference,
visit storagedeveloper.org.