Microsoft Research Podcast - Ideas: The journey to DNA data storage

Episode Date: November 19, 2024

Research manager Karin Strauss and members of the DNA Data Storage Project reflect on the path to developing a synthetic DNA–based system for archival data storage, including the recent open-source ...release of its most powerful algorithm for DNA error correction.Get the Trellis BMA code: GitHub - microsoft/TrellisBMA: Trellis BMA: coded trace reconstruction on IDS channels for DNA storage

Transcript
Discussion (0)
Starting point is 00:00:00 This really starts from the fundamental data production, data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it's more than we can practically store in magnetic media. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule, it lives in our body, it lives outside our body for thousands and thousands of years, if we leave it alone to do its thing. Makes it a very attractive media. It's such a futuristic technology, right? When you begin to work on the tech,
Starting point is 00:00:50 you realize how many disciplines and domains you actually have to reach in and leverage. It's really interesting, this multidisciplinarity, because we're, in a way, bridging software with wetware with hardware. And so you kind of need all the different disciplines to actually get you to where you need to go. We all work for Microsoft. We're all Microsoft researchers. Microsoft isn't a startup, but that team, the team that drove the DNA data storage project, it did feel like a startup and it was something unusual and exciting for me.
Starting point is 00:01:31 You're listening to Ideas, a Microsoft Research podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we'll explore the technologies that are shaping our future and the big ideas that propel them forward. I'm your guest host, Karen Strauss, a Senior Principal Research Manager at Microsoft. For nearly a decade, my colleagues and I, along with a fantastic and talented group of collaborators from academia and industry, have been working together to help close the data creation, data storage gap. We're producing far more digital information than we can possibly store. One solution we've explored uses synthetic DNA as a medium, and over the years we've contributed to steady and promising progress in the area.
Starting point is 00:02:24 We've helped push the boundaries of how much DNA writer can simultaneously store, And over the years, we've contributed to steady and promising progress in the area. We've helped push the boundaries of how much DNA Writer can simultaneously store, shown that full automation is possible, and helped create an ecosystem for the commercial success of DNA data storage. And just this week, we've made one of our most advanced tools for encoding and decoding data in DNA open source. Joining me today to discuss the state of DNA data storage and some of our contributions are several members of the DNA Data Storage Project at Microsoft Research.
Starting point is 00:02:54 Principal Researcher, Biklin Quinn, Senior Researcher, Jake Smith, and Partner Research Manager, Sergey Yakunin. Biklin, Jake, and Sergey, welcome to the podcast. Thanks for having us, Karen. Thank you so much. Thank you. So before getting into the details of DNA data storage and our work, I'd like to talk about the big idea behind the work and how we got here.
Starting point is 00:03:18 I've often described the DNA data storage project as turning science fiction into reality. When we started the project in 2015, though, the idea of using DNA for archival storage was already out there and had been for over five decades. Still, when I talked about the work in the area, people were pretty skeptical in the beginning, and I heard things like, wow, why are you thinking about that? It's so far off. So first, please share a bit of your research backgrounds and then how you came to work on this project. Where did you first encounter this idea? What do you remember about your initial impressions or the impressions of others? And what made you want to get involved? Sergey, why don't you start? Thanks a lot. So I'm at Coding Theories by Training. so my core areas of research have been error
Starting point is 00:04:08 correcting codes and also combinational complexity theory. And so I joined the project probably like within half a year of the time that it was born, and thanks, Karin, for inviting me to join. So that was roughly the time when I moved from a different lab, from the Silicon Valley lab in California to the Redmond lab. And actually, it just so happened that at that moment, I was thinking about what to do next. In California, I was mostly working on coding for distributed storage. And when I joined here, that effort kept going, but I had some free cycles. And that was the moment when Karen came just to my office and told me about the project. So, so indeed, initially it did feel a lot like science fiction because, I mean,
Starting point is 00:04:47 we are used to coding for, uh, for digital storage media, like for magnetic storage media and, uh, here, like this is biology and like, why, why exactly these kinds of molecules that are so many different molecules, like why that? But honestly, like I didn't try to pretend to be biologists and make conclusions about whether this is the right medium or the wrong medium. So I tried to look in this kind of questions from a technical standpoint, and there was a lot of kind of deep, interesting coding questions. And that was the main attraction for me. At the same time, I wasn't convinced that we will get as far as we actually got, and I wasn't immediately convinced about the future of the field. But just the depth and the richness of the
Starting point is 00:05:25 actual technical problems, that's what made it appealing for me, and I kind of enthusiastically joined. And also, I guess, the culture of the team. So it did feel like a startup. We all work for Microsoft. We're all Microsoft researchers. Microsoft isn't a startup. But that team, the team that drove the DNA Data Storage project, it did feel like a startup, and it was something unusual and exciting for me. Oh, I love that, Sergey. So my background is in organic chemistry. And Karen had reached out to me. And I interviewed not knowing what Karen wanted, actually. So I took the job kind of blind because I was like, hmm, Microsoft, research, DNA, biotech. I was very, very curious. And then when she told me that
Starting point is 00:06:07 this project was about DNA data storage, I was like, this is a crazy, crazy idea. I definitely was not sold on it, but I was like, well, look, I get to meet and work with so many interesting people from different backgrounds that one, even if it doesn't work out, I'm going to learn something. And two, I think it could work. Like it could work. And so I think that's really what motivated me to join. The first thing that you think when you hear about, we're going to take what is our hard drive and we're going to turn that into DNA is that this is nuts. But you know, it didn't take very long after that. I come from a chemistry biotech type background where I've been working on designing drugs and their DNA is this thing off in the nethers. You look at it every now and then to see what information it
Starting point is 00:06:59 can tell you about what maybe your drug might be hitting on the target side. And it's that connection that the DNA contains the information in the living systems. The DNA contains the information in our assays. Why could the DNA not contain the information that we think more about every day, that information that lives in our computers? That's an extremely cool idea. Through our work, we've had years to wrap our heads around DNA data storage. But Jake, could you tell us a little bit about how DNA data storage works and why we're interested in looking into the technology? So you mentioned it earlier, Karen, that this really starts from the fundamental data production, data storage gap, where we produce way more data nowadays than we could
Starting point is 00:07:46 ever have imagined years ago. And it's more than we can practically store in magnetic media. This is a problem because we have data. We have recognized the value of data with the rise of large language models and these other big generative models. The data that we do produce, our video, has gone from substantially small, down at 480 resolution, all the way up to things at 8K resolution that now take orders of magnitude more storage. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule, it lives in our body, it lives outside our body for thousands and thousands of years. If we leave it alone to do its thing, makes it a very attractive media, particularly compared to the traditional magnetic media. It has lower density and a much shorter lifetime on the scale of
Starting point is 00:08:51 decades at most. So how does DNA data storage actually work? Well, at a very high level, we start out in the digital domain where we have our information represented as ones and zeros, and we need to convert that into a series of A's, C's, T's, and G's that we could then actually produce. And this is really the domain of Sergei. He'll tell us much more about how this works later on. For now, let's just assume we've done this and now our information lives in the DNA-based domain. It's still in the digital world. It's just represented as A's, C's, T's, and G's, and we now need to make this physical so that we can store it. This is accomplished through
Starting point is 00:09:29 large-scale DNA synthesis. Once the DNA has been synthesized with the sequences that we specified, we need to store it. There's a lot of ways we can think about storing it. Bicklin's done great work looking at DNA encapsulation, as well as other more raw, just DNA on glass type techniques. And we've done some work looking at the susceptibility of DNA stored in this unencapsulated form to things like atmospheric humidity, to temperature changes, and most excitingly to things like neutron radiation. So we've stored our data in this physical form. We've archived it. And coming back to it, likely many years in the future, because the properties of DNA match up very well with archival storage,
Starting point is 00:10:18 we need to convert it back into the digital domain. And this is done through a technique called DNA sequencing. What this does is it puts the molecules through some sort of machine. And on the other side of the machine, we get out a noisy representation of what the actual sequence of bases in the molecules were. Now, we have one final step. We need to take this series of noisy sequences and convert it back into ones and zeros. Once we do this, we return to our original data and we've completed, let's call it, one DNA data storage cycle. We'll get into this in more detail later, but maybe, Sergey, we dig a little bit on encoding, decoding, and of things, and how DNA is different as a medium from other types of media? Sure. So, like, I mean, coding is an important aspect of this whole idea
Starting point is 00:11:13 of DNA data storage, because we have to deal with errors. It's a new medium. But talking about error-correcting codes in the context of DNA data storage. So, I mean, usually, like, what are error-correcting codes about? Like, on a very high level, right? I mean, you have some data. Think of So, I mean, usually, like, what are error-correcting codes about? Like, on a very high level, right? I mean, you have some data. Think of it, I know, as a binary string. You want to store it, but there are errors. So, and usually, like, in most kind of forms of media, the errors are bit flips.
Starting point is 00:11:35 Like, you store a zero, you get a one. And when you store a one, you get a zero. So these are called substitution errors. The field of error-correcting codes, it started, like, in the 1950s, so it's 70 years old at least. So we understand how to deal with this kind of error reasonably well. So with substitution errors, in DNA data storage,
Starting point is 00:11:55 the way you store your data is that given some large amount of digital data, you have the freedom of choosing which short DNA molecules to generate. So in a DNA molecule, it's a sequence of this basis, A, G, C, and D. You get the freedom to decide which of the short molecules you need to generate. And then these molecules get stored. And then during the storage, some of them are lost. Some of them can be damaged. There can be insertions and deletions of bases on every molecule, like we call them strands. So you need redundancy. And there are two forms of redundancy. There is redundancy that goes across strands,
Starting point is 00:12:30 and there is redundancy on the strand. And so, yeah, so kind of from the error correcting side of things, like we get to decide what kind of redundancy we want to introduce across strands on the strand. And then like we want to make sure that our encoding and decoding algorithms are efficient. So that's the coding theory angle on the strand. And then we want to make sure that our encoding and decoding algorithms are efficient. So that's the coding theory angle on the field. Yeah. And then from there, once you have that data encoded into DNA, the question is, how do you make that data on a scale that's compatible with digital data storage? And so that's where a lot of the work came in for really automating the synthesis process and also the reading process as well. So synthesis is what we consider the writing process of DNA data storage.
Starting point is 00:13:12 And so we came up with some unique ideas there. We made a chip that enabled us to get to the densities that we needed. And then on the reading side, we used different sequencing technologies. And it was great to see that we could actually just kind of pull sequencing technologies off the shelf because people are so interested in reading biological DNA. So we explored the Illumina technologies and also Oxford Nanopore, which is a new technology coming in the horizon. And then preservation, too, because we have to make sure that the data that's stored in the DNA doesn't get damaged and that we can recover it using the error-correcting codes. Yeah, absolutely.
Starting point is 00:13:55 And it's clear that, and it's also been our experience, that DNA data storage and projects like this require more than just a team of computer scientists. Biklan, you had the opportunity to collaborate with many people in all different disciplines. So do you want to talk a little bit about that? What kind of expertise, you know, other disciplines that are relevant to bringing DNA data storage to reality? Yeah, well, it's such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. One concrete example is that in order to fabricate an electronic chip to synthesize DNA,
Starting point is 00:14:43 we really had to pull in a lot of material science research because there's different capabilities that are needed when trying to use liquid on a chip. We, you know, have to think about DNA data storage itself, and that's a very different beast than, you know, the traditional storage mediums. And so we worked with teams who literally create, you know, these little tiny micro or nano capsules in glass and being able to store that there. It's really interesting, this multidisciplinarity, because we're in a way bridging software with wetware with hardware. And so you kind of need all the different disciplines to actually get you to where you need to go. Yeah, absolutely.
Starting point is 00:15:34 And, you know, building on, you know, collaborators, I think one area that was super interesting as well and was pretty early on in the project was building that first end-to-end system that we collaborated with the University of Washington, the molecular information systems lab there, to build. And really, at that point, there had been work suggesting that DNA data storage was viable, but nobody had really shown an end-to-end system from beginning to end. And in fact, my manager at the time, Doug Carmine, used to call it the bubble gum and shoe string system.
Starting point is 00:16:20 But it was a crucial first step because it shows it was possible to really fully automate the process. And there have been several interesting challenges there in the system, but we noticed that one particularly challenging one was Synthesis. That first system that we built was capable of storing the word hello, and that was all we could store, so it wasn't a very high-capacity system, but in order to be able to store a lot more volumes of data instead of a simple word. We really needed much more advanced synthesis systems and this is what both Biklan and Jake ended up working on. So do you want to talk a little bit about that and the importance of that particular work? Biklan Ozturk Absolutely. As you said, Karen,
Starting point is 00:17:02 the amount of DNA that is required to store the massive amount of data we spoke about earlier is far beyond the amount of DNA that's needed for any, air quotes, traditional applications of synthetic DNA, whether it's your gene construction or it's your primer synthesis or such. And so we really had to rethink how you make DNA at scale and think about how could this actually scale to meet the demand. And so Equin started out looking at a thing called a microelectrode array, where you have this big checkerboard of small individual reaction sites. Each reaction site, we used electrochemistry in order to control base by base, A, C, T, or G by A, C, T, or G, the sequence that was growing at that particular reaction site. We got this down to the nanoscale. And so what this means practically is that on one of these chips, we could synthesize at any given time on the order of hundreds of millions of individual strands. So once we had the synthesis working with traditional chemistry, where you're doing chemical synthesis, each base is added in using a mixture of chemicals that are
Starting point is 00:18:19 added to the individual spots that are activated. But each coupling happens due to some energy you pre-stored in the synthesis of your reagents. And this makes the synthesis of those reagents costly and themselves a bottleneck. And so taking a look forward at what else was happening in the synthetic biology world, the next big word in DNA synthesis was and still is enzymatic synthesis, where rather than having to spend a lot of energy to chemically pre-activate reagents that will go in to make your actual DNA strands, we capitalize on nature's synthetic robots, enzymes, to start with less activated, less expensive to get to, cheaply produced through natural processes, substrates. And we use the enzymes themselves, toggling their activity
Starting point is 00:19:13 over each of the individual chips or each of the individual spots on our checkerboard to construct DNA strands. And so we got a little bit into this project. We successfully showed that we could put down selectively one base at a given time. We hope that others will kind of take up the work that we've put out there, particularly our wonderful collaborators at ANSA who helped us design the enzymatic system. And one day we will see a truly parallelized, in this this fashion enzymatic DNA system that can achieve the scales necessary. It's interesting to note that even though it's DNA and we're still storing data in these DNA strands, chemical synthesis and enzymatic synthesis provide different errors that you see in the actual files, right, in the DNA files. And so I know that we talked to Sergey about how do we deal with these new types of errors and also the new capabilities that you can have, for example, if you don't control base
Starting point is 00:20:19 by base the DNA synthesis. This whole field of DNA data storage, like the technologies on the biologist side are advancing rapidly, right? So there are different approaches to synthesis, there are different approaches to sequencing, and presumably the way the storage is actually done is also progressing, right? And we had works on that.
Starting point is 00:20:37 So there is some very general kind of high-level error profile that you can say that these are the type of errors that you encounter in DNA data storage. Like in DNA molecules, just the sequence of this basis AGCT, and maybe a length like 200 or so, and you store a very, very large number of them. The errors that you see is that some of these strands kind of will disappear. Some of these strings can be torn apart, like in, let's say, in two pieces, maybe even more. And then on every strand, you also encounter these errors, insertions, deletions, substitutions
Starting point is 00:21:06 with different rates, like the likelihood of all kinds of these errors may differ very significantly across different technologies that you use on the biology side. And also there can be error bursts somehow. Maybe you can get an insertion of, I don't know, 10 A's, like, in a row. Or you can lose, like, I don't know,
Starting point is 00:21:21 10 bases in a row. So if you don't quantify, like, what are the likelihoods of all these bad events happening, then I think that this still kind of fits at least the majority of approaches to DNA data storage. Maybe not exactly all of them, but it fits the majority. So when we design coding schemes, we are trying also to look ahead in the sense that we don't know in five years how would this error profile look like. So the technology that, like, we don't know, like, in five years, like, how would this error profile, how would it look like? So the technologies that we develop on the error correction side, we try to keep them very flexible.
Starting point is 00:21:51 So whether it's enzymatic synthesis, whether it's nanopore technology, whether it's Illumina technology that is being used, the error correction algorithms would be able to adapt and would still be useful. But, I mean, this makes also coding aspect harder because, harder, because you want to keep all this flexibility in mind. So Sergei, we are at an interesting moment now because you're open sourcing the Trellis BMA piece of code that you published a few years ago. Can you talk a little bit about that specific problem
Starting point is 00:22:20 of trace reconstruction and then the paper specifically and how it solves it? Absolutely. Yeah, so this Trellis BMA paper for that we are releasing this source code right now, this is the latest in our sequence of publications on error correction for DNA data storage. And I should say that we already discussed that the project is kind of very interdisciplinary. So we have experts from all kinds of fields. But really, even within this coding theory, like within computer science slash information theory, coding theory, in our algorithms, we use ideas from very different branches. I mean, there are some core ideas from core algorithm space, and I won't go into this, but let me just focus kind of on two
Starting point is 00:22:59 aspects. So when we just faced this problem of coding for DNA data storage, and we were thinking about, okay, so how to exactly design the coding scheme, what are the algorithms that we'll be using for error correction? So, I mean, we were obviously studying the literature, and we came upon this problem called trace reconstruction. So that was pretty popular in, I mean, somewhat popular, I would say, in computer science and in statistics. It didn't have much motivation, but people have, like, very strong mathematicians have been looking at it. And the problem is as follows. So, like,
Starting point is 00:23:28 there is a long binary string picked at random, and then it's transmitted over the deletion channel. So some bits, some zeros, and some ones at certain coordinates get deleted, and you get to see kind of the shortened version of the string. But you get to see it multiple times. And the question is, like, how many times do you need to see it so that you can get a reasonably accurate estimate of the original string that was transmitted? So that was called trace reconstruction. And we took a lot of motivation, we took a lot of inspiration from the problem, I would say, because really in DNA data storage, if we think about a single strand, like a single strand which is being stored, after we read it, we usually get multiple reads of
Starting point is 00:24:05 this string. And while the errors there are not just deletions, there are insertions, substitutions, and there can be bursts of errors, but still we could rely on this literature in computer science that already had some ideas. So there was an algorithm called BMA, a bitwise majority alignment. We extended it, we adopted it kind of for the needs of the in-editor storage, and it became kind of one of the tools in our toolbox for error correction. So we also started to use ideas from the literature on electrical engineering, what are called convolutional error correcting codes, and a certain kind of class of algorithms for decoding errors in these convolutional error correcting codes called, I mean, Trellis is the main data structure, Trellis-based algorithms for decoding convolutional codes
Starting point is 00:24:45 like Witherby algorithm or BCGRL algorithm. Convolutional codes allow you to introduce redundancy on the string. So with algorithms kind of similar to BMA, they were good for doing error correction when there is no redundancy on the strand itself. When there is redundancy on the strand, we could do some things, but really it was very limited. With Trellis-based approaches, again inspired by the literature in electrical
Starting point is 00:25:10 engineering, we had an approach to introduce redundancy on the strand, so that allowed us to have more powerful error correction algorithms. And then in the end we have this algorithm which we call Trellis BMA, which kind of combines the ideas from both fields. So it's based on Trellis, but it's also more efficient than standard Trellis-based algorithms because it uses the ideas from BMA from computer science literature. So this is kind of the mix of these two approaches. And yeah, that's the paper that we wrote about three years ago, and now we are open sourcing it.
Starting point is 00:25:39 So it is the most powerful algorithm for DNA error correction that we developed in the group. We're really happy that now we are making it publicly available so that anybody can experiment with the source code, because again, the field has expanded a lot. And now there are multiple groups around the globe that work just specifically on error correction, apart from all other aspects. So yeah, so we are very happy that it's becoming publicly available to hopefully further advance the field. Yeah, absolutely. And I'm always amazed by, you know, how it is really about building on other people's work. Jake and Beklin, you recently published a paper in Nature
Starting point is 00:26:12 Communications. Can you tell us a little bit about what it was, what you exposed the DNA to, and what it was specifically about? Yeah, so that paper was on the effects of neutron radiation on DNA data storage. So, you know, when we started the DNA data storage project, it was really a comparison, right, between the different storage medias that exist today. And one of the issues that have come up through the years of development of those technologies was, you know, hard errors and soft errors that were induced by radiation. So we wanted to know, does that maybe happen in DNA? We know that DNA, in humans at least, is affected by radiation from cosmic rays. And so that was really the motivation for this type of experiment. So what we did was we essentially took our DNA files and dried them and threw them in a neutron accelerator,
Starting point is 00:27:16 which was fantastic. It was so exciting. That's kind of the merge of you know sci-fi with sci-fi at the same time it was it was fantastic and we irradiated for over 80 million years the equivalent equivalent equivalent of 80 because it's a lot of radiation it's a lot of radiation and it's accelerated radiation exposure yeah i would say it's accelerated aging with radiation. It's an insane amount of radiation. And it was surprising that even though we irradiated our DNA files with that much radiation, there wasn't that much damage. And that's surprising because we know that humans, if we were to be irradiated like that, it would be disastrous. But in DNA, our files were able to be recovered with zero bit errors. And why that difference? Well, we think there's a few reasons.
Starting point is 00:28:15 One is that when you look at the interaction between a neutron and the actual elemental composition of DNA, which is basically carbons, oxygens, and hydrogens, maybe a phosphorus, the neutrons don't interact with the DNA much. And if it did interact, we would have, for example, a strand break, which based on the error correcting codes, we can recover from. So essentially, there's not much, one, there's not much interaction between neutrons and DNA. And second, we have error correcting codes that would prevent any data loss. There are also other conditions that are needed for technology to be brought to the market. And one thing I've worked on is to, you know, create the DNA Data Storage Alliance. This is something Microsoft co-founded with Illumina, Twist Bioscience, and Western Digital.
Starting point is 00:29:18 And the goal there was to essentially provide the right conditions for the technology to thrive commercially. We did bring together multiple universities and companies that were interested in the technology. And one thing that we've seen with storage technologies that's been pretty important is standardization and making sure that the technology is interoperable. And, you know, we've seen stalemate situations like Blu-ray and high-definition DVD, where, you know, really we couldn't decide on a standard and that the technology, it took a while for the technology to be picked up. And the intent of the DNA Data Star just to provide an ecosystem of companies, universities, groups
Starting point is 00:30:06 interested in making sure that this time it's an interoperable technology from the get-go and that increases the chances of commercial adoption. As a group, we often talk about how amazing it is to work for a company that empowers us to do this kind of research. And for me, one of Microsoft researchers' unique strengths, particularly in this project, is the opportunity to work with such a diverse set of collaborators on such a multidisciplinary project like we have. How do you all think where you've done this work has impacted how you've gone about it and the contributions you've been able to make? I'm going to start with, if we look around this table and we see who's sitting at it,
Starting point is 00:30:48 which is two chemists, a computer architect, and a coding theorist, and we come together and we're like, what can we make that would be super, super impactful? I think that's the answer right there is that being at Microsoft and being in a culture that really fosters this type of interdisciplinary collaboration is the key to getting a project like this off the ground. Yeah, absolutely. And we should acknowledge the gigantic contributions made by our collaborators at the University of Washington.
Starting point is 00:31:23 Many of them would fall in not any of these three categories. They're electrical engineers, they're mechanical engineers, they're pure biologists that we worked with. And each of them brought their own perspective. And particularly when you talk about going to a true end-to-end system, those perspectives were invaluable as we try and fit all the puzzle pieces together. Yeah, absolutely. We've had great collaborations over time, University of Washington, ETH Zurich, Los Alamos National Lab, CHIP-IR, Twist Bioscience, ANZA Biotechnologies. Yeah, it's been really great and a great set of different disciplines all the way from coding theorist to the molecular biology and chemistry, electrical
Starting point is 00:32:08 and mechanical engineering. One of the great things about research is there's never a shortage of interesting questions to pursue, and for us, this particular work has opened the door to research in adjacent domains, including sustainability fields. DNA data storage requires small amounts of materials to accommodate the large amounts of data. And early on, we wanted to understand if DNA data storage was, as it seemed, a more sustainable way to store information. And we learned a lot. Biklin and Jake, you had experience in green chemistry when you came to Microsoft.
Starting point is 00:32:49 What new findings did we make and what sustainability benefits do we get with DNA data storage? And finally, what new sustainability work has the project led to? As a part of this project, if we're going to bring new technologies to the forefront, you know, to the world, we should make sure that they have a lower carbon footprint, for example, than previous technologies. And so we ran a lifecycle assessment, which is a way to systematically evaluate the environmental impacts of anything of interest. And we did this on DNA data storage and compared it to electronic storage medium. And we noticed that if we were
Starting point is 00:33:36 able to store all of our digital information in DNA, that we would have benefits associated with carbon emissions. We would be able to reduce that because we don't need as much infrastructure compared to the traditional storage methods. And there would be an energy reduction as well because this is a passive way of archival data storage. So that was the main takeaways that we had. but that also kind of led us to think about other technologies that would be beneficial beyond data storage and how we could use the same kind of lifecycle thinking towards that. us stumbling on, not inventing, but seeing other people doing in the literature and trying to implement ourselves on the DNA data storage project is something that can be much bigger than any single material. And where we think there's a chance for folks like ourselves at Microsoft Research to make
Starting point is 00:34:38 a real impact on this sustainability-focused design is through the application of machine learning, artificial intelligence, the new tools that will allow us to look at much bigger design spaces than we could previously, to evaluate sustainability metrics that were not possible when everything was done manually, and to ultimately, at the end of the day, take a sustainability first look at what a material should be composed of. And so we've tried to prototype this with a few projects. We had another wonderful collaboration with the University of Washington where we looked at recyclable circuit boards and a novel material called a vitremer that it could possibly be made out of. We've had another great collaboration with the University of Michigan, where we've looked at the design of charge-carrying molecules and these things called flow batteries that have good potential for energy smoothing and renewables production, trying to get us out of that day-night boom-bust cycle. And we had one more project, this time with collaborators at the University of Berkeley, where we looked at design of a class of materials called a metal-organic
Starting point is 00:35:52 framework, which have great promise in low-energy-cost gas separation, such as pulling CO2 out of the plume of a smokestack or ideally out of the air itself? For me, the DNA work has made me much more open to projects outside my own research area. As Biklin mentioned, my core research area is computer architecture, but we've ventured in quite a bit of other areas here and going way beyond my own comfort zone and really made me love interdisciplinary projects like this and really try to do the most important work I can. And this is what attracted me to these other areas of environmental sustainability that Bicklin and Jay covered, where there's absolutely no lack of problems.
Starting point is 00:36:47 Like them, I'm super interested in using AI to solve many of them. So how do each of you think working on the DNA data storage project has influenced your research approach more generally, and how you think about research questions to pursue next. It definitely expanded the horizons a lot. Just kind of just having these interactions with people whose core areas of research are so different from my own. And also a lot of learning, even within my own field, that I had to do kind of to carry this project out. So, I mean, it was a great and rewarding experience.
Starting point is 00:37:21 Yeah, for me, it's kind of the opposite of Karen, right? I started as an organic chemist and then now really, one, appreciate the breadth and depth of going from a concept to a real end-to-end prototype and all the requirements that you need to get there. And then also really the importance of having, you know, a background in computer science and really being able to understand the lingo that is used in multidisciplinary projects because you might say something and someone else interprets it very differently. And it's because you're not speaking the same language. And so that understanding that you have to really be, you have to learn a little bit of vocabulary from each person and understand how they contribute and then how your ideas can contribute to their ideas has been really impactful in my career here. Yeah, I think the key change in approach that I took away, and I think many of us took away
Starting point is 00:38:29 from the DNA Data Storage Project, was rather than starting with an academic question, we started with a vision of what we wanted to happen. And then we derived the research questions from analyzing what would need to happen in the world. What are the bottlenecks that need to be solved in order for us to achieve, you know, that goal? And this is something that we've taken with us into the sustainability-focused research and, you know, something that I think will affect all the research I do going forward. Awesome. As we close, let's reflect a bit on what a world in which DNA the storage is widely used might look like.
Starting point is 00:39:09 If everything goes as planned, what do you hope the lasting impact of this work will be? Sergey, why don't you lead us off? Sure. I remember that like when in the early days when I was starting working on this project, actually, you got and told me that you were taking a Uber somewhere, Uber ride somewhere, and you were talking to the taxi driver. And the taxi driver, I don't know if you remember that, but the taxi driver mentioned that he has a camera which is recording everything which is happening in the car. And then you had a discussion with him about how long does he keep the data, how long does he keep the videos.
Starting point is 00:39:41 And he told you that he keeps it for about a couple of days because it's too expensive. But otherwise, if it weren't that expensive, he would keep it for much, much longer. Because he wants to have the recordings if later somebody is upset about the ride and he's getting sued or something. So this is one small, narrow application area where DNA data storage would clearly, if it happens, then it would solve it.
Starting point is 00:40:02 Because then this long-term archival storage will become very cheap, available to everybody, will become a commodity, basically. There are many things that will be enabled, like helping the Uber drivers, for instance. But also one has to think, of course, about the broader implications so that we don't get into something negative.
Starting point is 00:40:22 Because again, this power of recording everything and storing everything, it can also lead to some use cases that might be kind of morally wrong. So again, hopefully, by the time that we get to like to really wide deployments of this technology, the regulation will also be catching up and like we would have great use cases and we won't have bad ones. I mean, that's how I think of it. But definitely, there are lots of kind of great scenarios
Starting point is 00:40:47 that this can enable. MARK MANDELMANN, I'll grab onto the word you used there, which was making DNA a commodity. And one of the things that I hope comes out of this project, besides all the great benefits of DNA data storage itself, is spillover benefits into the field of health, where if we make DNA synthesis at large scale truly a commodity thing, which I hope some of the work that we've done to really accelerate the throughput
Starting point is 00:41:11 of synthesis will do, then this will open new doors in what we can do in terms of gene synthesis, in terms of fundamental biotech research that will lead to that next set of drugs and give us medications or treatments that we could not have thought possible if we were not able to synthesize DNA and related molecules at that scale. So much information gets lost because of just time. And so I think being able to recover really ancient history that humans wrote in the future, I think is something that I really hope could be achieved because we're so information rich, but in the course of time, we become information poor. So I would like for our future generations to be able to understand the life of, you know, an everyday 21st century person.
Starting point is 00:42:10 Well, Beklin, Jake, Sergei, it's been fun having this conversation with you today and collaborating with you in all this amazing project and all the research we've done together. Thank you so much. Thank you. Thanks.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.