Storage Developer Conference - #170: DNA Data Storage and Near-Molecule Processing for the Yottabyte Era
Episode Date: June 20, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 170.
Hello, everyone. My name is Luis Tazi. I'm a professor of computer science at the University
of Washington and co-founder and CEO at OctoML.
It's really great to have an opportunity to give a talk at SNEA STC again.
So it turns out that I gave a talk here with Karen Strauss five years ago on the internet storage.
And it's just so great to see how much has happened since then and how much excitement there is in this new field.
It's great to see a fantastic session here full of great results.
So we're going to tag team.
I just introduced myself.
Now I'm going to pass the talk to Kari
and I'll come back halfway to the talk.
Hi, everyone.
I'm Karen Strauss,
the Senior Principal Research Manager
at Microsoft Research.
And it's really, as Luis mentioned,
a pleasure to come back to
SNEA SDC and talk about the progress in the area. For five years, we've been working on
DNA data storage. There are other companies now who have joined this area, and you're going to
hear from them later on. And we've also created the DNA Data Storage Alliance to push the field
forward. So let's dig into it. Today, what I'm going to do is give you an overview of DNA Data
Storage and sort of provide a little bit of context of the talks that you're about to hear
later in the presentation. And Luis will cover some of the trends and also some of the work
that we've been doing during these five years. All right, so this is a collaboration, as you can
probably gather by now, between Microsoft and University of Washington, and together we created
the Molecular Information Systems Lab to work on the chemistry, biology, innovation to help with issues in the IT industry.
So the Anetero Storage is our flagship project.
And...
Sorry about the technical issues here.
So it's our flagship project. And so just for this audience, I don't need to. But really, there's a gap in what we generate
and our ability to store information. And that gap is growing if we simply follow the trend.
So what this results in is that the percentage of all the data that we generate is shrinking
over time. So we think that there should be different ways. We need different ways to store
information and DNA data storage provides a different way to do that. So here's the rush
on now and how it's different from what's been done before. So the industry, storage industry
and semiconductor industry in general has evolved by following Moore's law.
And so essentially increasing the number of transistors or the number of cells that you have to store information over time and packing more on the same area.
And with that, getting gains in capacity and gains in cost. Now, I want to contrast this approach with an approach that was proposed in
the 60s, in the 50s actually, by Richard Feynman, where he pointed out that if we have the ability
to arrange atoms in the way we want, we may be able to get to molecular computing, molecular
storage. And this is what we're trying to do with DNA data storage. He even used DNA as a molecule, as an example of molecule that can store information.
He stopped short of proposing to use that for,
to use that to store digital information,
but that didn't take very long.
You know, people in the sixties proposed that.
And with the increased technological progress
in biotechnology,
now we have the tools to really realize on that vision.
So that this sort of era of DNA data storage started in about 2012, 2013, when folks observed
that it's possible to, now the tools make it possible for us to store and retrieve digital data from DNA. So what's DNA data storage? DNA, if we think of
DNA as a material, DNA is the double helix, right? And each side of the double helix is a sequence of
bases, A, T, C, and G, right? And these bases, today we have the tools to arrange these bases
in arbitrary sequences so that if we have a sequence of bits
that we want to store, we can store that sequence of bits by translating, by doing a mapping between
the bits and a sequence of bases. So for example, you could use the simple mapping that I'm showing
here, or you could use a completely different mapping. In fact, we use more sophisticated
codes to do that translation. But bottom line, translate a sequence of bits into a sequence of bases. Now, I talked about
the double helix. And what I mean by that, the information in the double helix is redundant.
And what I mean by that is that if you have a particular sequence on one side of the double helix, if you remember from biology classes, A complements T.
And if the A is on one side of the double helix, the T is on the other side.
If a C is on one side, the G is on the other side.
And so there's complementarity there.
But from an information storage point of view, that's sort of redundant information.
So we typically think of the DNA as one side of the double helix, and this is what's drawn here.
Now, as I mentioned, now we have tools to write and read the DNA in arbitrary ways.
And so we have technology to make synthetic DNA.
And this is what I'm going to be talking about during this
presentation and probably throughout the track, we're all talking about synthetic DNA as a way
to store information. And so to point out that this is the material, but there's no cells,
there's no organisms, no live involved here. It's really the material used to store the information.
So why do we want to store digital information in synthetic DNA?
First is density.
So here's a test tube.
This is sort of an expanded picture of the test tube.
And what you see at the bottom of that test tube, that pink smear, is dried DNA.
And that's enough DNA to store, enough physical DNA to store the equivalent of 10 terabytes. So, you know, a pretty reasonable hard drive. And you can store all that in the
bottom of the test tube. So density is quite high for DNA. And if you extrapolate that into what this means data center-wide is what today we
need a building to house, tomorrow you'd need essentially a very small space. And this is
really sort of proportionally what you'd need. So it's really probably just a pixel in your screen.
So very high density. In real size, that's about one cubic inch that can house about one exabyte
of data in it. So quite, quite dense. So density is certainly one property that we like about DNA.
The other is durability. So DNA can be encapsulated. There are even
demonstrations in the wild that DNA that can preserve its information for thousands of years,
even a million years. But obviously that's under very special conditions. But it turns out that
those conditions can be reproduced synthetically and we can store DNA for quite a long time without having the information in it degrade.
And the extreme density of DNA also helps because if you need to, for example, create some conditions like cooling it, it's very dense.
And so it's pretty easy to cool quite a lot of material and therefore quite a lot of data.
So that begs the question, how does it compare with other types of media?
So here we're plotting both density and the durability lifetime of different media.
And so you can see that DNA projections and at the limit, it's many orders of magnitude better than the best types of technologies we have. But even once we discount
everything that we need, that we think we need to build a practical system, so overheads from
error correction, metadata, the containers that hold the DNA themselves, even if we discount that,
we still have a few orders of magnitude higher density. And also, you know, thousands of years, sort of a conservative
prediction here can really, it's actually pretty good from a preservation point of view.
So next is relevance. So now that we know how to read DNA and we use it to read our genomic DNA, we always have readers to recover the information that we store, the digital information that we store in synthetic DNA.
And so we can essentially ride all the improvements that the biotechnology industry has been making to essentially our benefit.
And what's interesting about DNA is that the medium doesn't change.
So as we improve the technology, we don't need to migrate to the next generation of
medium.
So that actually makes it even more relevant and reduces some of the headaches that we
typically have in doing migration.
And finally, one other property, nice property of DNA is the ability to make copies.
So essentially, there's this reaction that you may have heard due to COVID times, polymerase
chain reactions, where you can use it to copy the DNA and copy it in bulk so that you
end up with many copies of the same sequence. And so, you know, for making multiple copies,
and for example, data distribution, this could be quite interesting. And finally, recently,
we did a study on environmental sustainability. That's very much a topic that's front and center these days with climate change.
And what we did was compare DNA or what a system, if we deploy DNA at large scale, would look like and how it compares to a tape system where in terms of the carbon emissions and the energy consumption and the water consumption,
to store one terabyte for a year.
So that was sort of our unit.
And what this pre-screening life cycle assessment has shown is that along these axes,
DNA is actually quite promising.
So we're very excited about that as well. So DNA
could provide a more sustainable way to store digital information. All right. So I'm going to
cover quickly what we, you know, our view of what the DNA data storage system should look like.
And you're going to hear a lot more in this track. You're going to hear a lot more from other companies working on different parts of this,
uh, of this pipeline, this system here.
So we start with bits.
Um, and as I mentioned, we can encode into sequences of those bases, right?
And that's still an electronic representation.
And then it's time to make the molecules.
That's the process of synthesis, which is our write process.
Next, we preserve the DNA to ensure that, create the conditions for the information to be preserved into the DNA. When it's time to read, we'll perform random access and recover the molecules
that we're interested in reading. We'll sequence them, which is the process of reading them,
and we'll decode them and then recover the bits that we had originally stored.
So let me show you how this is done in a little bit more detail.
But before I do that, I just wanted to comment on some of the results so far.
So we've been working, again, Microsoft, University of Washington,
and for this particular work, we collaborated with Twist Bioscience.
We encoded one gigabyte of data in DNA.
Twist has synthesized that into DNA, and we were able to recover.
And we wanted to make a audio, archival quality audio.
So we've encoded a number of files and then we were able to recover using this end to end system that I was talking about.
So, again, I'm going to walk you through what the system looks like. So let's
take this OKGo video that we've encoded in DNA. So it's 44 megabytes of information. First step
was to partition it into smaller segments of different sequences of bits that we wanted to
store and number them like we used to number floppies just so that we can reorder the information
on the way out. We add some redundancy for error correction, and then we translate those bits
into bases of DNA. We add an additional tag that's sort of a file ID and also allows us to
more easily copy the information. And then those sequences go into uh the writing process so that's dna synthesis dna synthesis is
essentially um a parallel process where um many molecules are uh grown uh at the same time and
grow from a surface like lawn growing from the ground up um and uh it's a series of chemical steps. So here, just to cover quickly, we add a base.
There's a process that strengthens the bond of that base.
And then there's a process that allows the next base to attach,
which is the blocking step.
And that cycle goes over and over
and that's what makes the DNA molecules.
We don't grow one molecule at a time.
We grow many, as I mentioned,
and we grow many different sequences at a time
using a method called the ray synthesis.
And essentially, we get multiple,
we get the parallelism of growing multiple sequences at a time,
and that's what gives us throughput.
Next step is to preserve the DNA.
So we'll remove the molecules from the substrate.
We'll encapsulate, and this is work we've done
in collaboration with ETH Zurich,
where the DNA is encapsulated, in this case,
in silicon dioxide nanoparticles.
And those are stored into a DNA library,
very much like we have
tape libraries today except it's a lot smaller. And then when it's time to recover the information
we do the random access using a combination of fluidics and really retrieving the right library
with a random access used with PCR,
which is that process that I mentioned that's copying the DNA.
So we essentially can selectively copy the DNA based on that tag that we added,
that file ID that we added in the beginning when we encoded the data.
So again, we can only copy molecules that belong to the file
that we are interested in reading and not other files.
So we copy those and then we sample from those. And most of the molecules after we sample,
after when we sample, after we do PCR, are the molecules that ultimately make it into a reader.
Okay, so then the next step is to sequence it. So to read the DNA, Illumina, who's going to talk later today, has a sequencer that's based on optical technology. and then do that successfully. And with that, you can use very smart computer vision tricks
to essentially detect which molecules are in a particular space.
That generates a number of reads that will then later decode
into the sequences of bits that we originally stored.
Another way to do that is with using nanopore devices.
And so here the reading is electrical
and it's really dragging the DNA through a nanoscale pore
and measuring electrical disturbances
that it causes as it goes through.
And this is what generates the reads.
So there are errors in both platforms. There's different types of errors,
not just substitutions, which would be the equivalent of a bit flip, but also insertions
where symbols appear and deletions where symbols disappear from our sequences. But we can still
recover the information. And so we can use many tricks from error correction and coding theory to do that.
So let me just quickly walk you through that. So here are sequences that we read. We reordered
those sequences and clustered them so that we essentially grouped together sequences that are
similar. We perform a consensus, do consensus analysis and come up with a sequence that's inferred to be the sequence that came in originally.
And then we'll translate that into bits.
If there's any sequences that are still missing, erasure errors,
then we're going to use that redundancy that we added in the beginning to recover from those. And the code can also recover from a few additional errors if they trickle in.
All right.
And then we have all the sequences.
We can use those sequence numbers to reorder them and then to recreate the file that we had originally stored. So going back to this pipeline here, I just wanted to give you an overview of what's to
come in the rest of the session.
For encoding and decoding, we're going to have Los Alamos National Lab talk about their
system.
In synthesis, we're going to hear from Twist on what they've been
on synthesis and what they've been up to. We're going to hear from Imogene on preservation
and from Denali on random access and finally from Illumina on sequencing. So with that,
I'm going to say bye and Luis will come back and talk some more about the trends and also some more of our collaboration and our work.
Thank you.
Thank you, Karen.
That's a tough act to follow, but I'll try my best here.
So I wanted to start with a few thoughts on just zooming out and seeing the progress that the data storage has made as a field.
And what this plot is showing is here in the x-axis and the amount of data stored
in the y-axis and, you know,
the different forms of synthesis to get there.
And what's interesting first to note
is that this is an exponential scale,
so we are in exponential progress,
which is great to see.
You know, so far lately,
we're roughly on the gigabyte scale right now,
you know, definitely on array-based synthesis.
And, you know, you're going to hear more about this
later today from the Twist folks.
So with that, I want to stop back and think about,
let's look at the trends here.
You know, unfortunately, this is 2015,
you know, we should get Rob Carlson to update this.
But what this is, this is known as the Moore's Law of DNA reading and writing.
And a few things are of note here.
Again, this is in an exponential scale.
And the red line is synthesis.
So synthesis has made some exponential progress, but then kind of slowed down.
And I'm fully confident that DNA data storage is going to push this back into, you know, faster pace curve.
And so the black line here is transistors per chip. So this is essentially Moore's law as a
reference. And it's interesting to see that, you know, DNA sequencing reading is improving faster
than Moore's law, at least for a while. The point here is to show that, you know, this gives a lot
of confidence that DNA data storage is likely to be viable within a reasonable amount of time.
And I'm sure we're going to hear a lot more about sequencing from Craig later today.
But now, thinking from a systems perspective, what I just talked about was throughput.
Now let's think about latency. latency, you know, we talked about the latency of synthesis and sequencing are typically done in
batch and involves fluidics, involve physical actuation to a degree that's way beyond what we
typically mean in storage systems. So we're talking about latency. It's probably going to
be out of tens of minutes to hours for synthesis. And, you know, if we do sequencing by synthesis,
but emerging technologies like nanopore that Karen talked about, since you get data real time, you're likely to push that latency,
at least for the readout side, closer to real time.
So that's one trend.
The other thing that's interesting to think about on read and write mechanisms
is looking at the trade-offs between what has been done for life sciences
and what we need for data storage.
The fundamental write and read mechanisms are very similar,
but actually the trade-offs are different.
For example, you know, think about error rates.
In life sciences, you know, single base Flipper mutation could lead to,
you know, very significant effects.
So, but with data storage, we can, of course, build error correcting schemes.
You're going to hear a lot more about this,
and there has been significant progress on that. And, you know, you have
several error types, and we can actually deal with that
with redundant information that nature does at some extent,
but I would say not as robust as computer scientists have figured out. So one
ends up wondering, like, if we did have stronger correcting codes in nature, maybe
we wouldn't have, you know, horrible diseases that we have today.
Okay.
Now, on the length, you know, so for DNA data storage, and of course, the longer the strand of DNA, the more data you put.
But it turns out that, you know, there's a diminishing return. So after you amortize the overhead of primer sequences or
other tags, or maybe an addressing scheme, those either stay constant or grow logarithmically with
the payload. So after a few hundred bases or so, it's unlikely you're going to get significant
benefit in terms of reducing overheads. Whereas in life sciences, longer sequences tend to have
a lot more function
and you have really long genes.
The reason that I mentioned this is that this suggests a lot of optimizations
to make the native data storage more practical, right?
You can trade off accuracy for lower cost and higher throughput.
You can trade off sequence size for faster write and read.
And, you know, this is, of course,
a lot of folks working in this space
are building on these trade-offs.
Now, all right.
So we're talking about these trade-offs.
Let me now discuss something
that we're particularly passionate about
is the following question.
When we succeed, I'm saying when,
when we succeed in storing a lot of data
in molecular form in DNA,
this does beg the question of, you know, given that the benefit between the molecular world and
the electronic world is likely to be limited, even with all the progress that we have ahead
that, you know, I'm sure we're going to get to, wouldn't it be nice if we could actually do a lot
of computation in molecular form directly? Because then you enable highly parallel efficient
and energy efficient computing,
and then also use a lot less bandwidth
between the molecular world and the electronic world.
So here's one example that we're excited about.
So you might have heard of DNA computing in the 80s.
There's a fantastic paper by Len Adelman in DNA1,
the first DNA conference in 1994.
They talked about solving an explanation time problem, a hematoma path problem with DNA.
And it worked in the following way.
So here you have a graph and you want to find a way that actually a path that visits all of the nodes in a graph.
So what is the shortest path that would do that? Well, the way you do that in DNA, you encode
each node here as a DNA sequence, and then edges are an overhang sequence that actually connects
the two. So after I encode all nodes and all edges in a bunch of DNA molecules, I put them in a
solution, I shake them, right, or I let them settle. And then when look at the longest molecule
they have formed,
that shows us what the result was.
This is super cool.
And it was an incredible idea
that opened up a lot of possibilities
in thinking about molecular computing.
But the problem in this specific solution
that it shifts the complexity
from time to amount of material, right?
So this is an exponential time problem.
Now, if you were to do this in space,
you're going to need an exponential amount of space.
So for you to solve any reasonable size problem here,
you're going to need, you know, a lot of DNA,
potentially even all of the atoms of the universe
in form of DNA, which, you know, is definitely not practical.
But what we've been thinking about is that
what would DNA computing look like in the age of big data?
So what we wanted here is essentially operate on data already stored in DNA. We want to target
polynomial time algorithms like search. And if we did that, we would potentially have an extremely
parallel and energy efficient way of manipulating information in a molecular form. The problem that
we decided to explore is content-based image and video search
for several reasons.
First, you know, this is very bandwidth intensive
and it's also a very key primitive in systems today.
So content-based image and video search exists today
and it's used in a variety of day-to-day systems.
You can do image-based similarity search
in Google or Bing.
And also this similarity search is Google or Bing. And also, this similarity
search is a key primitive in machine learning systems that is part of a bigger flow.
Now, how would you do that in DNA? And the way this works is you give an image to a database,
and you retrieve images that are similar to your input image, in this case, our airplanes.
How do we do that in DNA? Well, we want to encode our database
in DNA and be able to search it. So how do we do that? Well, first, we need to start with an
observation that's fairly straightforward, especially in hindsight, that as you know,
DNA forms this double helix. And if you have a complete match, this binding between the two sides is very, very strong. But if you have a
reasonably good partial match, you still bind the two sides of a double helix, and you can have a
poor match here that still binds, but it's a little bit less stable. So you probably can guess
where I'm getting at here, but suppose that I had the following. Suppose that I had the database
here on the left, encoding a bunch of single-cell DNA molecules, and then you have a query,
something that I want to search for a match here, attach the magnetic beads. Okay. So now if I
actually mix this, if I have a perfect match, there's a much higher likelihood that I'm going
to retrieve the perfect match. Now, as I go with poorer and poorer matches, I decrease the probability
and the frequency in which those resulting mollusks will come out of the magnetic extraction
process, right? So, okay, so how do we build on that to do search? Here's how we do it. We
came up with this idea of essentially encoding features, the same features using computer vision,
where you extract visual,
say in this case it's images,
this is not constrained only to images,
it could be video,
it could be other forms of data.
We're going to extract feature vector
from those data items.
And then the question here is,
how do we encode those data items,
those feature data sets into DNA forms
such that you get the following property.
So features that are similar, okay,
so should have DNA sequences that are more likely to actually stick to each other.
Okay.
And the way we did that is via a learning process.
Okay.
So we, and I'm going to tell you more about it in a second,
but keep that in mind that
the property that we want here is that similar features leads to DNA sequences that are more
likely to stick to each other.
Okay.
So proportionally to the similarity.
Now, how would we build on that?
Suppose that I have a, you know, a query that looks like, like, so a database looks like
this, you know, I have the feature vector encoded in part of my DNA strand, and then I have a tag and it just, just like an idea of the image
in the rest of the strand. Okay. So now I have a query say that, you know, that binocular I encoded
in DNA using the same form here, using the encoder that we had talked about and attaches a magnetic
bead. So now I can build, if they are similar, you know, they're going to hybridize.
So the way we're going to do that into a database is encode all of your images
into database trends, have a, you know,
large number of copies of your query attached to magnetic beads.
I'm going to mix it all in a solution when everything, you know, gets stable,
you know, because it's the state of lowest energy, the most,
the closest matches are the ones that are going to actually hybridize into the magnetic beads.
So when I take that with the magnets, you know, take a picture, that's what it looks like.
Now, those are the magnets and magnetic nanoparticles.
And I take it out from the solution.
And what I get back are things that look like the query image.
So if you're into machine learning, you probably think that this is a form of semantic hashing
like the way we encode say a uh sim a feature in an euclidean space and saying in a vector of
floating point numbers you can encode that into binary strings such that in hemming space uh they
have similar properties uh similar distance right so because in this case it's cheaper to compute in
a binary uh representation than in floating point for. We kind of did a similar thing here, but into DNA.
We think what we did is a translation from these feature vectors in Euclidean space into DNA sequences such that you have these properties that similarity in Euclidean space leads to higher extraction and reaction yields in the DNA molecular space.
Okay.
So I don't spend too much time on this,
but I'm happy to, you can read it in our paper, Nature Coms 2021.
Basically the way this works is starting with a set of images.
We start with, you know, pairs.
There are images whose features are mapped randomly to DNA.
Then as I keep getting these pairs of images, I run into the encoder, I measure their distance, I estimate the yield, and then I compare that, I estimate the yield with what the predictor said.
If they don't agree, I keep mutating the encoding process such that for the training set, for all the similar images that are close enough, the the estimated yield um of this extraction or the
probability of sticking uh is proportional okay um you can read more about the the structure of
the neural network that does that the key thing i want to point out here that's the only reason i'm
showing the structure here is that we definitely need um a fully connected layer because we're
going to reduce the space that's much larger than we're going to put into DNA into a relatively small set of bases. So we need to have a fully connected
layer to actually spread and make use of the encoding space as well as possible.
And if you're interested in the results, you should read our paper. Basically, what we did, we encoded 1.6 million images in a database and synthesized with Twist services.
Thank you, Twist.
They were collaborators with us on this project.
This was funded by the DARPA Molecular Informatics Program.
So thank you, Anne Fisher.
And this was a result that got us really excited because when we actually ran the process and took the images out and ran it through a sequencer,
we got what we were expecting, which is molecules that corresponded to images that had low Euclidean distance or very similar
are much more frequent than the ones that had high Euclidean distance.
This was very encouraging. And we refined this as a lot more results on the paper.
Okay, so now why am I excited about this as a computer architect?
The reason is because I know there's a bunch of storage folks
in this conference, so I'm preaching to the choir here, but
thinking about the storage device as a 3D
shape, typically capacity, you know, as a 3D shape, right?
Typically capacity is not typically the capacity is function of the volume of this physical object.
And the bandwidth is a function of the surface of the area.
So that's why, as you have a larger and larger storage device, you tend to have much higher time to read it all.
Right. So the trends for reading a whole hard drive is going up really fast. Today's on
the order of days when used to be in the order of hours, you know, not too long ago. That's because
the capacity grows with the cube, the bandwidth grows with the surface of the area. So capacity
grows L cubed and bandwidth grows L square. Now, as a computer architect, you know, you probably, you know, I'm a big fan of
near data processing. You probably heard of that before. The key idea there is to, you know, break
down storage, say in this case happens to be, you know, a memory, solid state memory into little,
to smaller units and pair them with processing elements. And the reason is because you have
much higher bandwidth, they're smaller, they're closer, and you have lower volumes. You can actually have much lower times read it all,
right? You change the capacity to bandwidth ratio. That's a trending computer architecture
ready to access data in high bandwidth and low latency. And the way I think about what we are
doing here with this way of embedding computation in molecular form and diffusing through the data
is really that essentially like one way of calling this is like diffusive computing right you can
store the data in a bunch of dna molecules and then you encode processing elements into molecular
form and then you let it diffuse through the data and the reason this is cool is that since you're
not attached to any specific physical organization of your
of 3d physical organization of your data you're not bound to a predefined surface to area
surface to volume ratio so you can choose and dial up your bandwidth depending on the needs
of your application now this is a long argument but i can't resist you know making it because i
do think this is a fundamental advantage that molecular storage would have over traditional data storage, be it in optical form, in electronic form.
Is that the fact that you can diffuse and rearrange it physically leads to really interesting systems properties that are likely to be useful.
Okay, so now that was a form of near-molecule processing.
And I want to transition to a different point here,
which is how are we going to package this all,
all of this into a system we can deploy, say, in a data center, right?
So one, when you think about this, you know,
neatly organized is having stations, a write station, a storage station,
a read station in a rack scale.
And that's the vision. We'll be able to encapsulate this in a way that you can actually put it in a data center.
But the reality is that today we're doing a lot of this still like this.
You know, this is an old photo by now, but this is one of our grad students, Lee Organic,
who did a bunch of the work that we talked about today. And this is me pretending to be useful. It's all very manual. So there's a very big difference from this and where we want to be in a fully automated system.
So we've been thinking a lot about what would it take to automate the whole process from digital data,
encoding to the molecules, storing them, retrieving and running to a sequencer.
So we decided to build a system that, you know, the point is just to show automation, but not to be fast. You know, this was not fast at all. But we built a synthesizer.
You know, you see like the phosphoramidides, A, C, Gs, and Ts. And, you know, there's a bunch
of valves here that controls the synthesis process. We store it away, and then we can retrieve it,
prep it for sequencing, and run it through a nanopore sequencer here.
So it was great to see that it's possible to show that,
open our eyes of what's needed to be fully automated.
But that's still a long ways from this.
This is a tape library that's probably several years old already.
That's still a long ways away.
So the way we are thinking about bridging that gap
is extending the system
with a much more flexible way of manipulating samples.
And our best is that digital microfluidics is one of those
because you can rearrange and configure
and program it in a flexible way.
And this technology has been around for some time.
We've been building an open source one called PurpleDrop. in a flexible way. And this technology has been around for some time.
We've been building an open source one called PurpleDrop.
And the way this works, for those of you who don't know,
is the following.
So digital microplastics has an array of electrodes and has a conductive top plate here.
And as you apply a voltage,
the droplets move to where the voltage is.
Okay?
Here's some examples, some videos.
There's a green droplet moving on our electrode board.
And in order to control and see where things are,
we've been applying computer vision to watch those droplets
or using capacitive sensing to monitor that
just because it still needs closed-loop control.
The reason I'm showing you this is that that's what we've been
using to, we've
used to prototype what would be the equivalent
of a library on a
card of DNA.
The idea here is to store
DNA into pools
that are dehydrated on the surface, and then
use droplets to retrieve
those spots of
DNA. In this top plate that I was talking about, you have spots of DNA that dehydrate, and when you want to retrieve those spots of DNA. So in this top plate that I was talking about, you have spots of DNA that dehydrates.
And when you want to retrieve it, a droplet visits there, stays there for a little bit,
soaks up some DNA, and then we move it and steer it to the DNA sequencer.
And we demonstrated that this works and it can work fairly well.
There was a paper, a NatureCom's paper a couple of years ago that, you know, detailed how we did that and looked at contamination issues.
We're looking at what are the chances of droplets leaving, you know, molecules behind in a path and then later contaminate other droplets that cross the same path.
And we didn't see any, you know any very significant contaminations. This suggests a solution of using microfluidics as a liquid robot
to go pick data from a location and move it to another,
kind of similar to a robotic arm in a tape library.
So we've been very excited about digital microfluidics in general.
And we've been, you know, as part of this,
we're building an affordable, full-stacks,
soft and hard digital microfluidics platform
where you can express your protocols in Python
and compile it down to the assembly code
of controlling the actuation of electrodes on the boards
to move the droplets where they need to go and so on.
So this is far from being a professional, electrodes on the boards to move the droplets where they need to go and so on.
So this is far from being a professional large scale solution,
but we feel like these are ways of actually de-risking and finding what are potential ways forward of building data center scale fluidic manipulation
systems. The price has to be lower.
The reliability can be lower too, as long as there's enough redundancy.
And we've been thinking about this,
not just in the context of data storage and molecular computing,
but also how do we scale this up to build truly,
truly large scale automation for sample processing or forms of molecular
computing, of molecular manipulation.
So zooming out, you know, the way we like thinking about, you know,
DNA data storage and coupling it with a form of molecular computing,
it's really having harder software and wetware.
You know, we have, and we have interfaces that are highly flexible,
like for example, digital microfluidics to interface electronic domain
with the molecular domain.
And we see this as a part of what could be a future
hybrid system. It's pretty clear by now that there's specialization in computer systems. We
have CPUs, GPUs, accelerators done with electronics because it's ultra low latency. It's highly
engineerable and allows you to control it perfectly. But we know that there's a lot of promise
in other forms of computing using, for example,
biomolecules that have self-assembly,
massive data storage density,
and potentially very highly energy-efficient
computing mechanisms that are orders of magnitude
lower than what electronics can be.
And then, of course, there's quantum systems
that's massively parallel,
computing for problems like optimization,
but typically have much lower IO bandwidth in and out of a quantum system.
And the reason I mention all of this is that I think there's an interesting world here where you pick the best type of device technology for the best type of algorithms, right?
And mix, I don't think electronics and CMOS is ever going to go away.
But, you know, we have to think about hybrid systems that
mix the most out of each.
With that, I'll say thank you.
We look forward to answering any
questions you might
have in the online forums now.
Thank you again for having us.
It's really great to see this field
thriving, and we can't wait to
see where it's going to be in the near and medium-term future.
So if you have more questions, you can also explore our website.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.