Microsoft Research Podcast - 048 - Storing Digital Data in Synthetic DNA with Dr. Karin Strauss

Episode Date: October 31, 2018

As traditional semiconductor technologies for computer storage scale down, everyone is looking for alternative solutions to the growing gap between the amount of data we’re capable of producing and ...the amount of data we’re capable of storing. While some have focused on hardware accelerators for machine learning, and others are investigating new memory technologies, Dr. Karin Strauss, a Senior Researcher at Microsoft Research in Redmond, has been exploring the role of biotechnology in IT via an end-to-end system that stores digital data in DNA. On today’s podcast, Dr. Strauss talks about life at the intersection of computer science and biology which, for many, is more like the intersection of science fiction and science, and explains how the unique properties of DNA could eventually enable us to store really big data in really small places for a really long time.

Transcript
Discussion (0)
Starting point is 00:00:00 Right after the structure of DNA started to be more well understood, people looked at it and scratched their heads and thought, you know, it's carrying information about life. Why can't it carry any kind of information? You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga.
Starting point is 00:00:31 As traditional semiconductor technologies for computer storage scale down, everyone is looking for alternative solutions to the growing gap between the amount of data we're capable of producing and the amount of data we're capable of storing. While some have focused on hardware accelerators for machine learning and others are investigating new memory technologies, Dr. Karen Strauss, a senior researcher at Microsoft Research in Redmond, has been exploring the role of biotechnology in IT via an end-to-end system that stores digital data in DNA. On today's podcast, Dr. Strauss talks about life at the intersection of computer science
Starting point is 00:01:10 and biology, which for many is more like the intersection of science fiction and science, and explains how the unique properties of DNA could eventually enable us to store really big data in really small places for a really long time. That and much more on this episode of the Microsoft Research Podcast. Karen Strauss, welcome to the podcast. Thanks for having me. You're a senior researcher at MSR and an affiliate professor in the Department of Computer Science and Engineering at the University of Washington. You situate your research at the intersection of computer science, systems, and biology, which is a fascinating intersection.
Starting point is 00:02:00 That's right. It's been very exciting. How would you describe the work you do in general terms? What gets you up in the morning? The multidisciplinarity of the work is really exciting. We are studying how to store digital data in synthetic DNA and really sort of bringing science fiction into reality. So it's sort of a dream come true. Let's talk about this idea of the end of Moore's law. What do you think people need to know and understand about Moore's law that kind of gives us a framework for why you're doing what you're doing? Gordon Moore has made, that you can shrink transistors at a certain rate. Every certain number of years, you have a doubling on the number of transistors. So it's essentially this observation that you get more transistors into the same area of a processor, and so you can do more with that over time. That's based on silicon technology and fabrication methods that are really miniaturizing the devices that are built with silicon, therefore giving us more of those devices.
Starting point is 00:03:12 That's improving the technology so that we can do more with the same amount of material, the same amount of silicon. So there's a point where it's too small. You can't do any more with it? Precisely. We're starting to get to that point. Some would argue that we've already gotten to that point, depending how you look at it. The structures that we're manufacturing are already pretty small. We're at the nanoscale.
Starting point is 00:03:37 And in some cases, depending on the parts of the structures and the devices we're fabricating, it's just a layer of a few atoms that are being built there. And so it's getting quite hard and also less profitable from the scaling perspective to really use these devices. All right. So why can't you just, this is one of those crazy questions, but why can't you just say, sorry, guys, we're done. We're as small as we get. Well, there's still more technology tricks that we can play. And as long as we can play those tricks, we will. But it's getting harder, right? It's getting harder. It's getting more expensive.
Starting point is 00:04:11 And so in addition to really continuing the process to the extent we can, people are looking at alternatives to it. It's getting harder. So we need to be smarter. So we're tasked with creating new technology. And by we, I mean you. So talk for a minute, just for context, about the history of data storage. I mean, not like a box in a cave, but digital data.
Starting point is 00:04:35 What have been our technologies and how has it evolved to where we are now? And then we'll talk about DNA in a second. Sure. Let me start with motivation and the trend, which is the data we are capable of generating is growing exponentially. And even though the devices that can store the data are also growing exponentially, it's different exponentials. And so the gap between what we are creating and what we can store is growing. And that's the trend we're seeing that is concerning. So if we just follow the industry, if today we're capable of storing about 30% of the information we generate,
Starting point is 00:05:11 in only 10 or 12 years, we'll be able to store about 3%. And then in 10 more years, about half of a percent of everything that we generate. So that's a concerning trend. We think that we need a radical new solution to address that problem. Now, in terms of the storage evolution, right, we've had magnetic devices like tape and hard drives. And in fact, you know, my first computer had 35 megabytes of storage. I remember those days. So this is storage technology. And then we have memory technology. So for example, DRAM, that's electronic. And most recently, electronic and silicon storage being developed in flash and SSDs. So all these different technologies, they're based on creating structures that can then hold a bit.
Starting point is 00:06:01 And that's sort of what they have in common. But the scaling of the storage technology is also based on making these structures smaller. And so we're also hitting limits on the storage, not only transistors, but also storage technologies are hitting the same limits of miniaturization. Let's talk about DNA or deoxyribonucleic acid, you are working on putting data, digital data, on DNA. As a level set, what is DNA from a biological and chemical perspective? All right, so DNA is a big, long molecule. It's essentially a chain of what we call bases, and that's what we describe as a TCNG. Those are called monomers that put together into a chain make up
Starting point is 00:06:47 DNA. Each side of that double helix is made of complementary bases. So A complements with T on the other side and C complements with G. From an information storage perspective, we only need to look at one of the sides. The other is redundant because there's a direct correspondence between A to T, C to G. Okay. So if I'm going back to my biology class, I remember DNA being framed as the building blocks of life, the molecules that make up our genome. So DNA carries the information, your genetic information. And in fact, it doesn't just store information. It actually performs, along doesn't just store information. It actually performs along with enzymes, many of the functions that are needed for life. So, okay, that's a thing. It exists. It's already loaded with instructions to make me
Starting point is 00:07:36 and you and everybody else and everything else, right? Living organisms have DNA. How did this idea of, hey, we could put digital data on there come up? When did that happen? And what was the scientific thinking behind that? So the idea actually dates back from the 60s. And it was right after the structure of DNA started to be more well understood. People looked at it and scratched their heads and thought, you know, it's carrying information about life. Why can't it carry any kind of information? So one could use DNA for that, except at that time we didn't have the technology to fabricate DNA or to read DNA, not at reasonable speeds. Before we get into the technical weeds, why are you interested in using DNA for storage?
Starting point is 00:08:23 What's the premise behind it? We're very excited about DNA for at least three of its properties. The first one is density. So instead of really storing the bits into devices that we have to manufacture, we are really looking at a molecule, storing data in a molecule itself. And so a molecule can be a lot smaller than the devices we're making. Just to give an example, you could store the information today stored in a data center, one exabyte of data, into a cubic inch of DNA. So that's quite tiny. Durability is the next interesting property of DNA. And so DNA, if preserved under the right conditions,
Starting point is 00:09:06 can keep for a very long time, which is not necessarily possible with media that's commercial today. I think the longest commercial media is rated for 30 years. That's tape, still tape. Wow. DNA, if encapsulated in the right conditions, has been shown to survive thousands of years. And so it's very interesting from a data preservation perspective as well.
Starting point is 00:09:29 And then one other property is that now that we know how to read DNA, and we'll always have the technology to read it, so now we'll have those readers. If we don't have those readers, we have a real problem. But we'll have those readers forever, as long as they're civilization. So it's not like floppy disks that are in the back of a drawer just gathering dust. We'll really have technology to read it. So let's get into the technical weeds on your current work. What's the science behind DNA storage? There's a workflow that starts, as I understand, with binary code, and then it gets crazy from there. Tell us the workflow behind the science. That's right. So if we go back to
Starting point is 00:10:22 the structure of DNA, it's the chain of the different bases A, T, C, and G. And so the way to think about them is they're a sequence of these bases. And the way to think about bits is digital data is essentially a sequence of bits. And so the science behind it starts with translating those bits into bases. So a very simple way to think about that is A corresponds to 0, 0, C corresponds to 0, 1, G to 1, 0, and T to 1, 1. And so if we have a sequence of bits,
Starting point is 00:10:55 we'll take every two bits and translate into a base. We use a lot more sophisticated methods, but that's the first step. So that leads to the next question. You've got the binary code translated into DNA code. Then what do you do with it? So we know which sequences of DNA we want. Now there's a process of manufacturing the DNA, and there's a process where multiple chemicals are flowed, and the DNA sort of grows, right? And so we know which sequences need to be grown, and those sequences are grown from a surface.
Starting point is 00:11:25 So once we grow the DNA, we'll remove it from where it was grown and we'll encapsulate it. So encapsulation can be done in a number of different ways. One of the types of encapsulation we've been looking at is an encapsulation developed by ETH Zurich. They were inspired by the fact that DNA survives in bone and wanted to do something that had similar properties, but it was easier to handle rather than having to grow bone. And so what they developed was a type of chemistry that will encapsulate the DNA in glass. It's actually silicon dioxide and they developed nanoparticles that then the DNA gets attached to and then a layer of glass has grown around it. And so that keeps it away from water, which is something that degrades the DNA, UV light. And when the temperature goes higher,
Starting point is 00:12:20 it protects it from really degrading. So what happens if I want to access my data then? Let's say I gave you a video and you say, okay, I'm going to put it on DNA and I'm going to store it. And I say, Karen, I want my video back. What do you do? So when your video was stored, it was stored in a certain location. So it's just organized in a spatial way, right? So there's some way to retrieve the actual molecules encapsulated. The next step is to de-encapsulate them. So we need to remove all that glass that was added for stability and extract the DNA. Once that's done, and that's a first part of a random access process, it's a hierarchical process. First, you physically find the smaller set of DNA molecules you're interested in. But
Starting point is 00:13:07 then within that, there are many molecules that may belong to different movies that you've stored. And you're just interested in one movie, you don't want to read the whole collection, right? And reading the whole collection would actually be wasteful. And so we would like the ability to further select that particular movie you want to watch. And it turns out that there's a process to do that. We do it chemically. And so it's just another reaction that comes from nature, actually, and is repurposed for this goal, for this purpose. So the paper you wrote about it that explains how you do that.
Starting point is 00:13:44 Can you give us an executive summary of how you've gone about that, that chemical process? The chemical process is really borrowed from the biotech industry. It's a pretty standard process called polymerase chain reaction, and it's the process that copies DNA. However, you can use it in a way that will just copy sequences that have a certain sequence at the end. And so we select the right sequences at the end and then use an enzyme to do the DNA copy. Now, that enzyme is selective. It will only copy pieces of DNA that have a little bit of a double helix. And so we store the DNA as a double helix. To do the process, we separate the two sides of the helix
Starting point is 00:14:31 and then only attach a little tail to each of the sides so that that tail matches the object that we're interested in reading. And so the enzymes will only copy the sequences that have that specific tail. So we can select which of the molecules are copied. Now, the other molecules storing the rest of the videos are still there, but in much smaller quantity. PCR, this polymerase chain reaction, actually copies the DNA exponentially. And so we can very quickly get to a lot of copies of the data we're interested in and just a few copies of other data we're not interested in. So that when we sample to read that information, most molecules we're sampling are molecules that
Starting point is 00:15:17 we're interested in. What's the process of identifying a molecule? The technology to read DNA is sort of indirect reading, if you will. It's indirect observation of the molecules. And so there are a number of different technologies to do that. Some of them use optical solutions. And so they multiply the molecules and then make them glow in different colors just based on the sequences that they are made of. There's another technology that is electronic. And so you drag the DNA through a nanopore. It's literally called a nanopore because it's a pore that's nano size. And the DNA goes through and it causes some
Starting point is 00:16:01 current disturbances through that nanopore that are then read and sensed, and then you can tell what the sequence of DNA is. That's, again... So, yeah, it really does, because we have some frame of reference. But when we hit petabyte, exabyte, zettabyte, and I just read the new one, yottabyte, 10 to the 24th power, it's just mind-boggling territory. So here's the funny part. In your research, it was big news that you stored 200 megabytes on strands of DNA. So I think you're well past that now, like around 800 megabytes. But it gives us some indication of how hard it is to
Starting point is 00:17:00 do what you do. I mean, to be excited about 200 megabytes. I remember when I had a 200 megabyte hard drive, but that was in the 80s, right? So what are the challenges that researchers face? If 200 megabytes is a big deal, that indicates that there's still some challenges to getting to scale. That's right. So the challenges are really improving the speeds at which we can read the DNA. So we don't need very low latency. We can wait for the information to come out. But for it to be a practical solution for storage, we wanted to give us a high rate of bytes per second. Those rates today are so low.
Starting point is 00:17:41 And that's why the 200 megabytes is a big deal, is, you know, the best we can do with technology today. But just to put it in perspective, just one year or two years earlier, in our lab, we were working with 200 kilobytes of data, which is a thousand times smaller. And so we've been experiencing improvements already just by doing the research. All of these technologies had to go through this milestone to get to full deployment today. So we're on the way there. You're on track. Yes.
Starting point is 00:18:14 Back to the challenges. So getting the throughputs up is a challenge and then lowering costs. So DNA manufacturing today is still quite costly. But for both of these challenges, they sort of go hand in hand. If you get the speed up, you also get the cost down. We see no fundamental physical reason why you couldn't really scale it to the level of being acceptable or being suitable for DNA data storage. So what are the roadblocks then getting to that speed? One of the big roadblocks is really scaling the structures,
Starting point is 00:18:51 but also automating the process. So writing DNA and reading DNA today is automated, but there are many steps in between. So for example, the encapsulation and the preparation, the random access that we were talking about, those are not automated yet. We really need to automate the whole process. So automation for us is also a big challenge and a big deal. And you're working on that? Yes, we're working on that as well.
Starting point is 00:19:14 We've talked about quite a few factors that make this more viable. And I would imagine, since we're talking about computer science involved in the biology and systems science, that algorithms play a role here. Could you talk a little bit about what's going on in that arena? Yes, absolutely. So, in fact, algorithms do play a big role here. And we have a fantastic team of coding theorists working here at Microsoft Research on this problem and on the project itself. So they developed algorithms that really reduced the effort to recover the data from DNA. One of the big contributions there was to encode the data in a way that once we read it
Starting point is 00:20:02 on the way out, we need to process minimal amounts of information to really recover the data. So that was also a big contribution in that paper. What's the ultimate goal of DNA storage? What's the aspiration? If you succeeded beyond your wildest dreams, what would success look like to you? Success would look like everyone in the world
Starting point is 00:20:23 has access to DNA data storage. And so really, at Microsoft, our mission is to empower every person and organization to achieve more. With DNA, we would empower every person and organization to store more. Maybe they'll have that as one of the taglines. But right now, where are we in that quest? We are just starting, right? So we're looking at really wrapping our heads around how to build an end-to-end system that will allow us to achieve that goal. We're first targeting archival storage applications in the data center. And then we'll see where it goes.
Starting point is 00:21:06 We think that that's probably the lowest resistance path for DNA data storage. And we think that there may be improvements in the future that will allow us to target different scenarios as well. So that's interesting because there are different markets and different reasons for people storing things, right? And so the archival storage would be more like, I don't need this right away and I can tolerate a gap versus I need this on my computer because I have to deliver it to my boss tomorrow morning. How does DNA play in that realm right now? Yeah, so we look at DNA at least currently or initially as an archival storage technology. And so in addition to the fact you don't need it immediately, you need to store it for a long time
Starting point is 00:21:49 and hopefully not taking too much space, right? So that's what we see as the great match between the DNA technology and the needs of archival storage. You were featured in Fast Company magazine as one of the 100 most creative people in business in 2016, which is super cool. But something in that write-up caught my eye. They projected that we'd hit 16 zettabytes of data by 2017, which was last year. I imagine we're past that now. The premise behind all this is that we need to preserve the world's data. My question then is, is everything
Starting point is 00:22:23 we're doing worth preserving? When do we cross the line to being people on that TV show Hoarders, where everyone feels sorry for people who can't throw anything away? And to be honest, I'm kind of like that person digitally. I have 32,000 photos on my phone. Well, it's in the cloud, right? Let's be honest. I kind of feel like I need help more than a bigger box. Yes. So there is a lot of information we don't need to store, even though I'd argue that digital archaeologists of the future would love to have access to all your 32,000 pictures. No, they wouldn't. Most of them are really bad. But they would really understand how we live today, right? Cultural exploration. Cultural exploration.
Starting point is 00:23:05 That's right. And that's why we still study past civilizations, right? We want to understand how they lived and, you know, how life was at that time and learn from that, right? But also, you know, there are pieces of data that we are today throwing away that may be very helpful, actually. We talk to customers. We talk to different segments of the market who would like to store more information. I just came back from a meeting at the Library of Congress. They would love to store more information if it weren't that expensive, right, to do it.
Starting point is 00:23:37 I actually love that answer because my question was real. And I think what you just said is an application or a reason for exploring this new technology. The bigger question would be, how then will future digital archaeologists sort through everything to make discoveries and inferences and things? That's a great question. So there's a whole field of machine learning and AI that's evolving today. There's quite a lot of advances happening in the field right now. They do help us sort through all the data that we're generating. There's a limit to that too. So that is really, I think, a question that we'll have to do a lot more research to really understand how to deal with it. I'd say the field of data organization is also a pretty interesting one. If we already have structure on the data, then we might as well preserve that. But we need storage to preserve that information as well. Talk about the people you're working with. You have some really interesting partners, both organizationally and individually, and it's bringing some really interesting diversity to your team that I think probably benefits both the work and the industry.
Starting point is 00:24:53 Yes, absolutely. So the DNA Data Storage Project at Microsoft, we've been collaborating with University of Washington since the beginning of the project. And we also work with partners like Twist Bioscience, and we have a collaboration with ETH Zurich. But overall, since it's such a multidisciplinary project, we need people with different backgrounds offering different perspectives. So we have a very diverse team. And one of the interesting things is that once you start a project like this, people are using different vocabularies and there is a little more work that goes into starting to communicate well and as a team. But once that initial obstacle is overcome, things happen so much better just because people are offering different perspectives. There's a lot of diversity, not only personally, but also how people are
Starting point is 00:25:45 thinking and everybody's learning from each other. We have all the way from coding theorists to computer architects and engineers working on mechanics and molecular biologists. And so completely different backgrounds, people from all over the world just working in an exciting multidisciplinary project and learning from each other. That makes it super exciting. Let's talk about you for a minute. What got you interested in what you're doing personally, and how did you end up at Microsoft Research? Well, I am a forever learner. I think what led me to get my PhD was really, I didn't want to stop learning. And so I made it kind of my profession. So that's what led me to keep pursuing more and more degrees until I ran
Starting point is 00:26:33 out of degrees to pursue, maybe an MBA someday. But being a researcher allows me to continue learning and in fact, learning and then putting together things that I've learned from different areas and different perspectives into the same project and as I mentioned earlier making science fiction into reality it sort of couldn't be more exciting. So how did you end up here? I don't know. You actually had a job interview at some point, right? Yes. I don't remember. It's one of those things.
Starting point is 00:27:14 You bump into somebody in an airport and, you know, later on they contact you and say, hey, I'm forming a new group at Microsoft. Would you like to come interview? Did that happen? That did happen. Do you remember who you bumped into? Yeah, it was the manager who hired me into Microsoft, Doug Berger. All right.
Starting point is 00:27:33 So one of the frameworks I read about this work in is an article that talked about how biotech has benefited tremendously from the advances or the progress in silicon technology. And to some degree, they suggested that why we're doing this is it's time for biotech to give back. In other words, for computer scientists to start putting biomolecules in their computer architecture. So what's next in the field here? What are the exciting lines of research that emerging researchers might be interested in? One of the things we're starting to look into and are really excited about is once you've stored data into DNA, there is the question of what else can you do with it as a molecule? Can you perform any operations over them at the molecular level, not just selecting them and then reading them?
Starting point is 00:28:28 And it turns out there's a whole field of DNA nanotechnology that looks at computing with these molecules. So we even have at Microsoft Research Cambridge a group that has looked at biocomputation. And so we're starting also to see how we can use such techniques and the properties of DNA to perform operations over the data as it's stored in DNA. As we close, Karen, what would you like people to know about your research that they might not know? Yeah, so I think the thing to be aware is that this is an emerging technology and we're starting to work towards really building systems and wrapping services around it.
Starting point is 00:29:10 It will take some time to get there, but I think what's cool is that, you know, there's a company like Microsoft who's really willing to invest and really think about, you know, how are we going to build solutions for the IT industry in the future. And so I'm very happy and feel very fortunate to be working on this project.
Starting point is 00:29:31 Karen Strauss, thank you for joining us today. Thank you. To learn more about Dr. Karen Strauss and the biological future of digital data storage, visit Microsoft.com slash research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.