Microsoft Research Podcast - Ideas: The journey to DNA data storage
Episode Date: November 19, 2024Research manager Karin Strauss and members of the DNA Data Storage Project reflect on the path to developing a synthetic DNA–based system for archival data storage, including the recent open-source ...release of its most powerful algorithm for DNA error correction.Get the Trellis BMA code: GitHub - microsoft/TrellisBMA: Trellis BMA: coded trace reconstruction on IDS channels for DNA storage
Transcript
Discussion (0)
This really starts from the fundamental data production, data storage gap, where we produce
way more data nowadays than we could ever have imagined years ago.
And it's more than we can practically store in magnetic media.
And so we really need a denser medium on the other side to contain that.
DNA is extremely dense.
It holds far, far more information per unit volume, per unit mass than any storage media that we have available today.
This, along with the fact that DNA is itself a relatively rugged molecule, it lives in our body, it lives outside our body for thousands and thousands of years, if we leave it alone to do its thing. Makes it a very attractive media. It's such a futuristic technology, right?
When you begin to work on the tech,
you realize how many disciplines and domains
you actually have to reach in and leverage.
It's really interesting, this multidisciplinarity,
because we're, in a way, bridging software with wetware with hardware. And so you
kind of need all the different disciplines to actually get you to where you need to go.
We all work for Microsoft. We're all Microsoft researchers. Microsoft isn't a startup,
but that team, the team that drove the DNA data storage project,
it did feel like a startup and it was something unusual and exciting for me.
You're listening to Ideas, a Microsoft Research podcast that dives deep into the world of technology research and the profound questions behind the code.
In this series, we'll explore the technologies that are shaping our future and the big ideas that propel them forward.
I'm your guest host, Karen Strauss, a Senior Principal Research Manager at Microsoft.
For nearly a decade, my colleagues and I, along with a fantastic and talented group of collaborators from academia and industry,
have been working together to help close the data creation, data storage gap.
We're producing far more digital information than we can possibly store.
One solution we've explored uses synthetic DNA as a medium,
and over the years we've contributed to steady and promising progress in the area.
We've helped push the boundaries of how much DNA writer can simultaneously store, And over the years, we've contributed to steady and promising progress in the area.
We've helped push the boundaries of how much DNA Writer can simultaneously store, shown
that full automation is possible, and helped create an ecosystem for the commercial success
of DNA data storage.
And just this week, we've made one of our most advanced tools for encoding and decoding
data in DNA open source.
Joining me today to discuss the state of DNA data storage and some of our contributions
are several members of the DNA Data Storage Project at Microsoft Research.
Principal Researcher, Biklin Quinn, Senior Researcher, Jake Smith,
and Partner Research Manager, Sergey Yakunin.
Biklin, Jake, and Sergey, welcome to the podcast.
Thanks for having us, Karen.
Thank you so much.
Thank you.
So before getting into the details of DNA data storage and our work, I'd like to talk
about the big idea behind the work and how we got here.
I've often described the DNA data storage project as turning science fiction into reality. When we started the project
in 2015, though, the idea of using DNA for archival storage was already out there and had been for
over five decades. Still, when I talked about the work in the area, people were pretty skeptical in
the beginning, and I heard things like, wow, why are you thinking about that? It's so far off. So first, please share a bit of
your research backgrounds and then how you came to work on this project. Where did you first
encounter this idea? What do you remember about your initial impressions or the impressions of
others? And what made you want to get involved? Sergey, why don't you start? Thanks a lot. So I'm
at Coding Theories by Training. so my core areas of research have been error
correcting codes and also combinational complexity theory.
And so I joined the project probably like within half a year of the time that it was
born, and thanks, Karin, for inviting me to join.
So that was roughly the time when I moved from a different lab, from the Silicon Valley
lab in California to the Redmond lab. And actually, it just so happened that at that moment, I was thinking
about what to do next. In California, I was mostly working on coding for distributed storage.
And when I joined here, that effort kept going, but I had some free cycles. And that was the
moment when Karen came just to my office and told me about the project. So, so indeed, initially it did feel a lot like science fiction because, I mean,
we are used to coding for, uh, for digital storage media, like for magnetic
storage media and, uh, here, like this is biology and like, why, why exactly
these kinds of molecules that are so many different molecules, like why that?
But honestly, like I didn't try to pretend to be biologists and make
conclusions about whether this is the right medium or the wrong medium. So I tried to look in this kind of questions from a technical standpoint,
and there was a lot of kind of deep, interesting coding questions. And that was the main attraction
for me. At the same time, I wasn't convinced that we will get as far as we actually got,
and I wasn't immediately convinced about the future of the field. But just the depth and the richness of the
actual technical problems, that's what made it appealing for me, and I kind of enthusiastically
joined. And also, I guess, the culture of the team. So it did feel like a startup. We all work
for Microsoft. We're all Microsoft researchers. Microsoft isn't a startup. But that team,
the team that drove the DNA Data Storage project, it did feel like a startup,
and it was something unusual and exciting for me. Oh, I love that, Sergey. So my background is in organic chemistry.
And Karen had reached out to me. And I interviewed not knowing what Karen wanted,
actually. So I took the job kind of blind because I was like, hmm, Microsoft, research, DNA,
biotech. I was very, very curious. And then when she told me that
this project was about DNA data storage, I was like, this is a crazy, crazy idea.
I definitely was not sold on it, but I was like, well, look, I get to meet and work with so many
interesting people from different backgrounds that one, even if it doesn't work out, I'm going to learn something. And two, I think it could work. Like it could
work. And so I think that's really what motivated me to join. The first thing that you think when
you hear about, we're going to take what is our hard drive and we're going to turn that into DNA
is that this is nuts. But you know, it didn't take very long after that. I come from
a chemistry biotech type background where I've been working on designing drugs and their DNA is
this thing off in the nethers. You look at it every now and then to see what information it
can tell you about what maybe your drug might be hitting on the target side. And it's that connection that the DNA contains the information in the living systems.
The DNA contains the information in our assays.
Why could the DNA not contain the information that we think more about every day,
that information that lives in our computers?
That's an extremely cool idea.
Through our work, we've had years to wrap our heads around DNA data storage.
But Jake, could you tell us a little bit about how DNA data storage works and why we're interested in looking into the technology?
So you mentioned it earlier, Karen, that this really starts from the fundamental data production, data storage gap, where we produce way more data nowadays than we could
ever have imagined years ago. And it's more than we can practically store in magnetic media.
This is a problem because we have data. We have recognized the value of data with the rise of
large language models and these other big generative models. The data that we do produce, our video,
has gone from substantially small, down at 480 resolution, all the way up to things at 8K resolution that now take orders of magnitude more storage. And so we really need a denser medium on
the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today.
This, along with the fact that DNA is itself a relatively rugged molecule, it lives in our body, it lives outside our body for thousands and thousands of years.
If we leave it alone to do its thing, makes it a very attractive media, particularly compared to the
traditional magnetic media. It has lower density and a much shorter lifetime on the scale of
decades at most. So how does DNA data storage actually work? Well, at a very high level,
we start out in the digital domain where we have our information represented as ones and zeros,
and we need to convert that into a series of A's, C's,
T's, and G's that we could then actually produce. And this is really the domain of Sergei. He'll
tell us much more about how this works later on. For now, let's just assume we've done this and
now our information lives in the DNA-based domain. It's still in the digital world. It's just
represented as A's, C's, T's, and G's,
and we now need to make this physical so that we can store it. This is accomplished through
large-scale DNA synthesis. Once the DNA has been synthesized with the sequences that we specified,
we need to store it. There's a lot of ways we can think about storing it. Bicklin's done
great work looking at DNA encapsulation, as well as other more raw, just DNA on glass type techniques.
And we've done some work looking at the susceptibility of DNA stored in this unencapsulated form to things like atmospheric humidity, to temperature changes, and most excitingly to things like neutron radiation.
So we've stored our data in this physical form.
We've archived it.
And coming back to it, likely many years in the future,
because the properties of DNA match up very well with archival storage,
we need to convert it back into the digital domain.
And this is done through a technique called DNA sequencing.
What this does is it puts the molecules through some sort of machine. And on the other side of the machine,
we get out a noisy representation of what the actual sequence of bases in the molecules were.
Now, we have one final step. We need to take this series of noisy sequences and convert it back into ones and zeros.
Once we do this, we return to our original data and we've completed, let's call it, one DNA data storage cycle.
We'll get into this in more detail later, but maybe, Sergey, we dig a little bit on encoding, decoding, and of things, and how DNA is different as a medium
from other types of media? Sure. So, like, I mean, coding is an important aspect of this whole idea
of DNA data storage, because we have to deal with errors. It's a new medium. But talking about
error-correcting codes in the context of DNA data storage. So, I mean, usually, like, what are error-correcting
codes about? Like, on a very high level, right? I mean, you have some data. Think of So, I mean, usually, like, what are error-correcting codes about? Like, on a very high level, right?
I mean, you have some data.
Think of it, I know, as a binary string.
You want to store it, but there are errors.
So, and usually, like, in most kind of forms of media,
the errors are bit flips.
Like, you store a zero, you get a one.
And when you store a one, you get a zero.
So these are called substitution errors.
The field of error-correcting codes,
it started, like, in the 1950s,
so it's 70 years old at least.
So we understand how to deal with this kind of error reasonably well.
So with substitution errors, in DNA data storage,
the way you store your data is that given some large amount of digital data,
you have the freedom of choosing which short DNA molecules to generate.
So in a DNA molecule,
it's a sequence of this basis, A, G, C, and D. You get the freedom to decide which of the short
molecules you need to generate. And then these molecules get stored. And then during the storage,
some of them are lost. Some of them can be damaged. There can be insertions and deletions
of bases on every molecule, like we call them strands. So you need redundancy. And
there are two forms of redundancy. There is redundancy that goes across strands,
and there is redundancy on the strand. And so, yeah, so kind of from the error correcting side
of things, like we get to decide what kind of redundancy we want to introduce across strands
on the strand. And then like we want to make sure that our encoding and decoding algorithms
are efficient. So that's the coding theory angle on the strand. And then we want to make sure that our encoding and decoding algorithms are efficient. So that's the coding theory angle on the field. Yeah. And then from there, once you
have that data encoded into DNA, the question is, how do you make that data on a scale that's
compatible with digital data storage? And so that's where a lot of the work came in for really
automating the synthesis process and also the reading process as well.
So synthesis is what we consider the writing process of DNA data storage.
And so we came up with some unique ideas there.
We made a chip that enabled us to get to the densities that we needed.
And then on the reading side, we used different sequencing technologies. And it was
great to see that we could actually just kind of pull sequencing technologies off the shelf
because people are so interested in reading biological DNA. So we explored the Illumina
technologies and also Oxford Nanopore, which is a new technology coming in the horizon. And then preservation, too, because we have to make sure that the data that's stored in the DNA doesn't get damaged
and that we can recover it using the error-correcting codes.
Yeah, absolutely.
And it's clear that, and it's also been our experience,
that DNA data storage and projects like this require more than just a team of computer scientists.
Biklan, you had the opportunity to collaborate with many people in all different disciplines.
So do you want to talk a little bit about that?
What kind of expertise, you know, other disciplines that are relevant to bringing DNA data storage to reality?
Yeah, well, it's such a futuristic technology, right? When you begin to work on
the tech, you realize how many disciplines and domains you actually have to reach in and
leverage. One concrete example is that in order to fabricate an electronic chip to synthesize DNA,
we really had to pull in a lot of material science research
because there's different capabilities that are needed when trying to use liquid on a chip.
We, you know, have to think about DNA data storage itself, and that's a very different beast than,
you know, the traditional storage mediums.
And so we worked with teams who literally create, you know, these little tiny micro or nano capsules in glass and being able to store that there.
It's really interesting, this multidisciplinarity, because we're in a way bridging software with wetware with hardware.
And so you kind of need all the different disciplines to actually get you to where you need to go.
Yeah, absolutely.
And, you know, building on, you know, collaborators, I think one area that was super interesting as well and was pretty early on in the project was building that first end-to-end system
that we collaborated with the University of Washington,
the molecular information systems lab there, to build.
And really, at that point, there had
been work suggesting that DNA data storage was viable,
but nobody had really shown an end-to-end system
from beginning to end.
And in fact, my manager at the time, Doug Carmine, used to call it the bubble gum and shoe string system.
But it was a crucial first step because it shows it was possible to really fully automate the process.
And there have been several interesting challenges there in the system,
but we noticed that one particularly challenging one was Synthesis.
That first system that we built was capable of storing the word hello,
and that was all we could store, so it wasn't a very high-capacity system,
but in order to be able to store a lot more volumes of data instead of a simple word. We really needed much more advanced synthesis systems and this is what both Biklan and Jake ended up working on. So
do you want to talk a little bit about that and the importance of that particular work?
Biklan Ozturk Absolutely. As you said, Karen,
the amount of DNA that is required to store the massive amount of data we spoke about earlier is far beyond the amount of DNA that's needed for any, air quotes, traditional applications of synthetic DNA, whether it's your gene construction or it's your primer synthesis or such. And so we really had to rethink how you make DNA at scale and think about
how could this actually scale to meet the demand. And so Equin started out looking at a thing called
a microelectrode array, where you have this big checkerboard of small individual reaction sites. Each reaction site, we used electrochemistry in order to control base by base,
A, C, T, or G by A, C, T, or G, the sequence that was growing at that particular reaction site.
We got this down to the nanoscale. And so what this means practically is that on one of these
chips, we could synthesize at any given time on the order of hundreds of
millions of individual strands. So once we had the synthesis working with traditional chemistry,
where you're doing chemical synthesis, each base is added in using a mixture of chemicals that are
added to the individual spots that are activated. But each coupling happens due to some energy you
pre-stored in the synthesis of your reagents. And this makes the synthesis of those reagents
costly and themselves a bottleneck. And so taking a look forward at what else was happening in the
synthetic biology world, the next big word in DNA synthesis was and still is enzymatic synthesis, where rather than having
to spend a lot of energy to chemically pre-activate reagents that will go in to make your actual DNA
strands, we capitalize on nature's synthetic robots, enzymes, to start with less activated,
less expensive to get to, cheaply produced through
natural processes, substrates. And we use the enzymes themselves, toggling their activity
over each of the individual chips or each of the individual spots on our checkerboard
to construct DNA strands. And so we got a little bit into this project. We successfully showed that we could put down selectively one base at a given time.
We hope that others will kind of take up the work that we've put out there,
particularly our wonderful collaborators at ANSA who helped us design the enzymatic system.
And one day we will see a truly parallelized, in this this fashion enzymatic DNA system that can achieve the scales necessary.
It's interesting to note that even though it's DNA and we're still storing data in these DNA strands, chemical synthesis and enzymatic synthesis provide different errors that you see in the actual files, right, in the DNA files.
And so I know that we talked to Sergey about how do we deal with these new types of errors
and also the new capabilities that you can have, for example, if you don't control base
by base the DNA synthesis.
This whole field of DNA data storage, like the technologies on the biologist side
are advancing rapidly, right?
So there are different approaches to synthesis,
there are different approaches to sequencing,
and presumably the way the storage is actually done
is also progressing, right?
And we had works on that.
So there is some very general kind of high-level error profile
that you can say that these are the type of errors
that you encounter in DNA data storage.
Like in DNA molecules, just the sequence of this basis AGCT, and maybe a length like 200 or
so, and you store a very, very large number of them. The errors that you see is that some of
these strands kind of will disappear. Some of these strings can be torn apart, like in, let's
say, in two pieces, maybe even more. And then on every strand, you also encounter these errors,
insertions, deletions, substitutions
with different rates,
like the likelihood of all kinds of these errors
may differ very significantly across different
technologies that you use on the biology side.
And also there can be error bursts
somehow. Maybe you can get an insertion of,
I don't know, 10 A's, like, in a row.
Or you can lose, like, I don't know,
10 bases in a row. So if you
don't quantify, like, what are the likelihoods of all these bad events happening,
then I think that this still kind of fits at least the majority of approaches to DNA data storage.
Maybe not exactly all of them, but it fits the majority.
So when we design coding schemes, we are trying also to look ahead
in the sense that we don't know in five years how would this error profile look like. So the technology that, like, we don't know, like, in five years, like, how would this error profile, how would it look like?
So the technologies that we develop on the error correction side, we try to keep them
very flexible.
So whether it's enzymatic synthesis, whether it's nanopore technology, whether it's
Illumina technology that is being used, the error correction algorithms would be able
to adapt and would still be useful.
But, I mean, this makes also coding aspect harder because, harder, because you want to keep all this flexibility in mind.
So Sergei, we are at an interesting moment
now because you're open sourcing the Trellis BMA piece of code
that you published a few years ago.
Can you talk a little bit about that specific problem
of trace reconstruction and then the paper specifically
and how it solves it? Absolutely. Yeah, so this Trellis BMA paper for that we are releasing this source code right now,
this is the latest in our sequence of publications on error correction for DNA data storage. And I
should say that we already discussed that the project is kind of very interdisciplinary. So
we have experts from all kinds of fields. But really, even within this
coding theory, like within computer science slash information theory, coding theory,
in our algorithms, we use ideas from very different branches. I mean, there are some
core ideas from core algorithm space, and I won't go into this, but let me just focus kind of on two
aspects. So when we just faced this problem of coding for DNA data storage, and we were thinking about, okay, so how to exactly design the coding scheme,
what are the algorithms that we'll be using for error correction?
So, I mean, we were obviously studying the literature,
and we came upon this problem called trace reconstruction.
So that was pretty popular in, I mean, somewhat popular, I would say,
in computer science and in statistics.
It didn't have much motivation, but people have, like,
very strong mathematicians have been looking at it. And the problem is as follows. So, like,
there is a long binary string picked at random, and then it's transmitted over the deletion channel.
So some bits, some zeros, and some ones at certain coordinates get deleted, and you get to see kind
of the shortened version of the string. But you get to see it multiple times. And the question is,
like, how many times do you need to see it so that you can get a reasonably accurate estimate of the original
string that was transmitted? So that was called trace reconstruction. And we took a lot of
motivation, we took a lot of inspiration from the problem, I would say, because really in DNA data
storage, if we think about a single strand, like a single strand which is being stored, after we
read it, we usually get multiple reads of
this string. And while the errors there are not just deletions, there are insertions, substitutions,
and there can be bursts of errors, but still we could rely on this literature in computer science
that already had some ideas. So there was an algorithm called BMA, a bitwise majority alignment.
We extended it, we adopted it kind of for the needs of the in-editor storage, and it became kind of one of the tools in our toolbox for error correction.
So we also started to use ideas from the literature on electrical engineering,
what are called convolutional error correcting codes, and a certain kind of class of algorithms
for decoding errors in these convolutional error correcting codes called, I mean,
Trellis is the main data structure, Trellis-based algorithms for decoding convolutional codes
like Witherby algorithm or BCGRL algorithm.
Convolutional codes allow you to introduce redundancy on the string.
So with algorithms kind of similar to BMA,
they were good for doing error correction
when there is no redundancy on the strand itself.
When there is redundancy on the strand,
we could do some things, but really it was very limited.
With Trellis-based approaches, again inspired by the literature in electrical
engineering, we had an approach to introduce redundancy on the strand, so that allowed
us to have more powerful error correction algorithms. And then in the end we have this
algorithm which we call Trellis BMA, which kind of combines the ideas from both fields.
So it's based on Trellis, but it's also more efficient than standard Trellis-based algorithms
because it uses the ideas from BMA from computer science literature.
So this is kind of the mix of these two approaches.
And yeah, that's the paper that we wrote about three years ago, and now we are open sourcing
it.
So it is the most powerful algorithm for DNA error correction that we developed in the
group.
We're really happy that now we are making it publicly available so that anybody can experiment with the source code,
because again, the field has expanded a lot. And now there are multiple groups around the globe
that work just specifically on error correction, apart from all other aspects. So yeah, so we are
very happy that it's becoming publicly available to hopefully further advance the field.
Yeah, absolutely. And I'm always amazed by, you know, how it is really
about building on other people's work. Jake and Beklin, you recently published a paper in Nature
Communications. Can you tell us a little bit about what it was, what you exposed the DNA to,
and what it was specifically about? Yeah, so that paper was on the effects of neutron radiation
on DNA data storage. So, you know, when we started the DNA data storage project, it was
really a comparison, right, between the different storage medias that exist today. And one of the
issues that have come up through the years of development of those technologies was, you know, hard errors and soft errors that were induced by radiation. So we wanted to know,
does that maybe happen in DNA? We know that DNA, in humans at least, is affected by radiation from
cosmic rays. And so that was really the motivation for this type of experiment. So what
we did was we essentially took our DNA files and dried them and threw them in a neutron accelerator,
which was fantastic. It was so exciting. That's kind of the merge of you know sci-fi with sci-fi at the same time it was it was fantastic and we
irradiated for over 80 million years the equivalent equivalent equivalent of 80 because
it's a lot of radiation it's a lot of radiation and it's accelerated radiation exposure yeah i
would say it's accelerated aging with radiation.
It's an insane amount of radiation. And it was surprising that even though we irradiated our DNA files with that much radiation, there wasn't that much damage. And that's surprising
because we know that humans, if we were to be irradiated like that, it would be disastrous. But in DNA, our files were able to be recovered with zero bit errors.
And why that difference?
Well, we think there's a few reasons.
One is that when you look at the interaction between a neutron and the actual elemental
composition of DNA, which is basically carbons, oxygens, and hydrogens,
maybe a phosphorus, the neutrons don't interact with the DNA much.
And if it did interact, we would have, for example, a strand break, which based on the
error correcting codes, we can recover from.
So essentially, there's not much, one, there's not much interaction between neutrons and DNA. And second, we have error correcting codes that would prevent any data loss. There are also other conditions that are needed for technology to be brought to the market.
And one thing I've worked on is to, you know, create the DNA Data Storage Alliance.
This is something Microsoft co-founded with Illumina, Twist Bioscience, and Western Digital.
And the goal there was to essentially provide the right conditions for the technology to
thrive commercially.
We did bring together multiple universities and companies that were interested in the technology.
And one thing that we've seen with storage technologies that's been pretty important is standardization
and making sure that the technology is interoperable. And, you know, we've seen stalemate situations like Blu-ray and high-definition DVD,
where, you know, really we couldn't decide on a standard and that the technology,
it took a while for the technology to be picked up.
And the intent of the DNA Data Star just to provide an ecosystem of companies, universities, groups
interested in making sure that this time it's an interoperable technology from the get-go and
that increases the chances of commercial adoption. As a group, we often talk about how amazing it is
to work for a company that empowers us to do this kind of research. And for me, one of Microsoft researchers' unique strengths, particularly in this project,
is the opportunity to work with such a diverse set of collaborators on such a multidisciplinary
project like we have.
How do you all think where you've done this work has impacted how you've gone about it
and the contributions you've been able to make?
I'm going to start with, if we look around this table and we see who's sitting at it,
which is two chemists, a computer architect, and a coding theorist, and we come together
and we're like, what can we make that would be super, super impactful?
I think that's the answer right there is that being at Microsoft and
being in a culture that really fosters this type of interdisciplinary collaboration is
the key to getting a project like this off the ground.
Yeah, absolutely.
And we should acknowledge the gigantic contributions made by our collaborators at the University
of Washington.
Many of them would fall
in not any of these three categories. They're electrical engineers, they're mechanical engineers,
they're pure biologists that we worked with. And each of them brought their own perspective.
And particularly when you talk about going to a true end-to-end system, those perspectives
were invaluable as we try and fit all the puzzle pieces together. Yeah, absolutely. We've had great collaborations over time, University of Washington, ETH Zurich,
Los Alamos National Lab, CHIP-IR, Twist Bioscience, ANZA Biotechnologies. Yeah,
it's been really great and a great set of different disciplines all the way from
coding theorist to the molecular biology and chemistry, electrical
and mechanical engineering.
One of the great things about research is there's never a shortage of interesting questions
to pursue, and for us, this particular work has opened the door to research in adjacent
domains, including sustainability
fields. DNA data storage requires small amounts of materials to accommodate the large amounts of
data. And early on, we wanted to understand if DNA data storage was, as it seemed, a more
sustainable way to store information. And we learned a lot. Biklin and Jake, you had experience in green chemistry
when you came to Microsoft.
What new findings did we make and what sustainability benefits
do we get with DNA data storage?
And finally, what new sustainability work has the project led to?
As a part of this project, if we're going to bring new technologies
to the forefront, you know, to the world, we should make sure that they have a lower carbon footprint,
for example, than previous technologies. And so we ran a lifecycle assessment, which is a way
to systematically evaluate the environmental impacts of anything of interest. And we did this
on DNA data storage and compared it to electronic storage medium. And we noticed that if we were
able to store all of our digital information in DNA, that we would have benefits associated with
carbon emissions. We would be able to reduce that
because we don't need as much infrastructure compared to the traditional storage methods.
And there would be an energy reduction as well because this is a passive way of archival data
storage. So that was the main takeaways that we had. but that also kind of led us to think about other technologies that would be beneficial beyond data storage and how we could use the same kind of lifecycle thinking towards that. us stumbling on, not inventing, but seeing other people doing in the literature and trying
to implement ourselves on the DNA data storage project is something that can be much bigger
than any single material.
And where we think there's a chance for folks like ourselves at Microsoft Research to make
a real impact on this sustainability-focused design is through the application of machine learning, artificial intelligence, the new tools that will allow us to look at much bigger design spaces than we could previously, to evaluate sustainability metrics that were not possible when everything was done manually, and to ultimately, at the end of the day, take a sustainability first look at what a material should be composed of.
And so we've tried to prototype this with a few projects.
We had another wonderful collaboration with the University of Washington where we looked at recyclable circuit boards and a novel material called a vitremer that it could possibly be made out of. We've had another great collaboration with the University of Michigan,
where we've looked at the design of charge-carrying molecules and these things called flow batteries
that have good potential for energy smoothing and renewables production,
trying to get us out of that day-night boom-bust cycle.
And we had one more project, this time with collaborators at the
University of Berkeley, where we looked at design of a class of materials called a metal-organic
framework, which have great promise in low-energy-cost gas separation, such as pulling
CO2 out of the plume of a smokestack or ideally out of the air itself?
For me, the DNA work has made me much more open to projects outside my own research area.
As Biklin mentioned, my core research area is computer architecture,
but we've ventured in quite a bit of other areas here and going way beyond my own comfort zone and really made me love
interdisciplinary projects like this and really try to do the most important work I can.
And this is what attracted me to these other areas of environmental sustainability that
Bicklin and Jay covered, where there's absolutely no lack of problems.
Like them, I'm super interested in using AI to solve many of them. So how do each of you think working on the DNA
data storage project has influenced your research approach more generally, and how you think about
research questions to pursue next.
It definitely expanded the horizons a lot.
Just kind of just having these interactions with people whose core areas of research are so different from my own.
And also a lot of learning, even within my own field,
that I had to do kind of to carry this project out.
So, I mean, it was a great and rewarding experience.
Yeah, for me, it's kind of the opposite of Karen, right?
I started as an organic chemist and then now really, one, appreciate the breadth and depth of going
from a concept to a real end-to-end prototype and all the requirements that you need to get there. And then also really the importance of having,
you know, a background in computer science and really being able to understand the lingo that
is used in multidisciplinary projects because you might say something and someone else interprets it
very differently. And it's because you're not speaking the same language. And so that understanding that you have to really be, you have to learn a little
bit of vocabulary from each person and understand how they contribute and then how your ideas can
contribute to their ideas has been really impactful in my career here. Yeah, I think the key change in approach that I took away, and I think many of us took away
from the DNA Data Storage Project, was rather than starting with an academic question, we
started with a vision of what we wanted to happen.
And then we derived the research questions from analyzing what would need to happen in
the world. What are the bottlenecks
that need to be solved in order for us to achieve, you know, that goal? And this is something that
we've taken with us into the sustainability-focused research and, you know, something that I think
will affect all the research I do going forward. Awesome. As we close, let's reflect a bit on what a world in which DNA the storage is widely used
might look like.
If everything goes as planned, what do you hope the lasting impact of this work will
be?
Sergey, why don't you lead us off?
Sure.
I remember that like when in the early days when I was starting working on this project,
actually, you got and told me that you were taking a Uber somewhere, Uber ride somewhere, and you were talking to the taxi driver.
And the taxi driver, I don't know if you remember that, but the taxi driver mentioned that he has a camera which is recording everything which is happening in the car.
And then you had a discussion with him about how long does he keep the data, how long does he keep the videos.
And he told you that he keeps it for about a couple of days because it's too expensive.
But otherwise, if it weren't that expensive,
he would keep it for much, much longer.
Because he wants to have the recordings if later somebody
is upset about the ride and he's getting sued or something.
So this is one small, narrow application area
where DNA data storage would clearly, if it happens,
then it would solve it.
Because then this long-term archival storage
will become very cheap, available to everybody,
will become a commodity, basically.
There are many things that will be enabled,
like helping the Uber drivers, for instance.
But also one has to think, of course,
about the broader implications
so that we don't get into something negative.
Because again, this power of recording everything
and storing everything, it can also lead to some use cases that
might be kind of morally wrong. So again, hopefully, by the
time that we get to like to really wide deployments of this
technology, the regulation will also be catching up and like we
would have great use cases and we won't have bad ones. I mean,
that's how I think of it. But definitely, there are lots of
kind of great scenarios
that this can enable.
MARK MANDELMANN, I'll grab onto the word you used there,
which was making DNA a commodity.
And one of the things that I hope comes out of this project,
besides all the great benefits of DNA data storage itself,
is spillover benefits into the field of health,
where if we make DNA synthesis at large scale truly
a commodity thing, which I hope some of the work that we've done to really accelerate the throughput
of synthesis will do, then this will open new doors in what we can do in terms of gene synthesis,
in terms of fundamental biotech research that will lead to that next set of drugs and give us
medications or treatments that we could not have thought possible if we were not able to synthesize
DNA and related molecules at that scale. So much information gets lost because of just time.
And so I think being able to recover really ancient history that humans wrote in the future,
I think is something that I really hope could be achieved because we're so information rich,
but in the course of time, we become information poor. So I would like for our future generations
to be able to understand the life of, you know, an everyday 21st century person.
Well, Beklin, Jake, Sergei, it's been fun having this conversation with you today and collaborating with you in all this amazing project and all the research we've done together.
Thank you so much.
Thank you.
Thanks.