CyberWire Daily - Synthesized DNA Malware with Peter Ney. [Research Saturday]

Episode Date: October 14, 2017

Peter Ney is a PhD candidate in the Allen School of Computer Science and Engineering at the University of Washington where he is advised by Professor Tadayoshi Kohno. His current research is focused o...n understanding computer security risks in emerging technologies like DNA synthesis and sequencing and the new threats posed by maliciously crafted, synthetic DNA. He and his team found that security of DNA processing programs is poor and show with a proof-of-concept that it is possible to attack computer systems with adversarial synthetic DNA.   Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript
Discussion (0)
Starting point is 00:00:00 You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me. I have to say, Delete.me is a game changer. Within days of signing up, they started removing my personal information from hundreds of data brokers. I finally have peace of mind knowing my data privacy is protected. Delete.me's team does all the work for you with detailed reports so you know exactly what's been done. Take control of your data and keep your private life Thank you. JoinDeleteMe.com slash N2K and use promo code N2K at checkout. The only way to get 20% off is to go to JoinDeleteMe.com slash N2K and enter code N2K at checkout. That's JoinDeleteMe.com slash N2K, code N2K. Hello, everyone, and welcome to the CyberWire's Research Saturday.
Starting point is 00:01:36 I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down threats and vulnerabilities and solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us. And now, a message from our sponsor, Zscaler, the leader in cloud security. Enterprises have spent billions of dollars on firewalls and VPNs, yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks and a $75 million record payout in 2024. These traditional security tools expand your attack surface
Starting point is 00:02:19 with public-facing IPs that are exploited by bad actors more easily than ever with AI tools. It's time to rethink your security. Zscaler Zero Trust plus AI stops attackers by hiding your attack surface, making apps and IPs invisible, eliminating lateral movement, connecting users only to specific apps, not the entire network, continuously verifying every request based on identity and context, simplifying security Thank you. your organization with Zscaler Zero Trust and AI. Learn more at zscaler.com slash security. DNA is a biological molecule that's designed to store information. All living things have DNA. That's Peter Ney. He's a Ph.D. candidate in the Allen School of Computer Science and Engineering at the University of Washington, where he's advised
Starting point is 00:03:30 by Professor Tadayoshi Kono. His current research is focused on understanding computer security risks in emerging technologies like DNA synthesis and sequencing, and the new threats posed by maliciously crafted synthetic DNA. Along with his colleagues at the University of Washington, he's one of the authors of the paper Computer Security, Privacy, and DNA Sequencing, Compromising Computers with Synthesized DNA, Privacy Leaks, and More. DNA is made up of four types of molecules, adenine, cytosine, guanine, and thymine, which we just shorten to A, g and t and so dna molecules are just basically a linear sequence of these a c's g's and t's and so you can think of it
Starting point is 00:04:15 as being very similar to digital data but instead of having binary data like zeros and ones dna is actually made up of four different types a c's, C's, G's, and T's. So kind of like base four. DNA sequencing is just the process of when you're given a particular DNA molecule, you want to know what is the actual order of these bases in the DNA strand. That's pretty much at a high level what DNA sequencing is. It's been around for about 40 years since the early 1970s. And DNA sequencing at that time was a fairly slow and expensive process. But all this changed in the early 2000s with the development of a new class of technologies, which are kind of broadly referred to as next generation sequencers. And unlike their predecessors,
Starting point is 00:05:05 these sequencers are actually capable of sequencing a massive quantity of DNA all in parallel. So you can do things like sequence an entire human genome, or maybe 100 human genomes all at once. And so what's happened is that DNA sequencing has gotten really, really cheap since this started. And so in about 2001, it cost around $100 million to sequence one human genome. And today we can do it for about 1000. So contrast that to me, I think many of us are familiar with some of the consumer DNA sequencing services, you know, that for $100, you can get your DNA sequenced and find out your genetic background. And how does that compare to what you're talking about with this kind of sequencing? When I'm talking about sequencing,
Starting point is 00:05:50 I'm saying given a DNA molecule, I want to know every single base in the order of all the bases in the molecule. So this way you can think of this like, you know, proper full sequencing. I see. There are other kinds of techniques that kind of just sequence little individual bases, but not all of the bases in a DNA molecule. And so that's, for example, if you've heard of 23andMe, that's the kind of sequencing they do. What I'm talking about is typically referred to as full genome sequencing. So it's actually sort of trying to sequence, you know, every single base in the human genome. And how much data are we talking about with a full sequence of, say, the human genome?
Starting point is 00:06:24 And how much data are we talking about with a full sequence of, say, the human genome? So the human genome itself is about 4 billion bases long. So typically when you generate sequencing data, you have lots of redundancy. And so you sequence the same parts of the genome over and over again, maybe upwards of 20, 30 times. So typically you're talking about generating hundreds of billions of DNA bases, which in terms of storage might be upwards of 20 gigabytes or more. And some of the really high throughput sequencing machines, which have been developed, can do, I think they might generate terabytes of data in a single sequencing run. So basically, given that DNA is a way to store information, and these systems are taking the biological thing that is DNA and turning them into computer data. Describe to us your approach for trying to exploit that.
Starting point is 00:07:12 I would just add on to what you just said, which is that, you know, really what you can, when you think about DNA sequencing, what's actually happening is that it's this kind of intermediary between biological data and digital data. And so, you know, I think we've known for a long time that anytime computer systems process digital data, there's the possibility that that data could be used maliciously to target vulnerabilities in that software. And so since DNA sequencing is just taking these biomolecules and turning them into digital data, we were really wondering, can we actually start by making particular biological DNA samples so that when they're sequenced, they would actually end up as malicious sequencing data files? So how did you determine what you were going to target? Am I right? Reading through your research, did you sort of artificially
Starting point is 00:08:05 set up some vulnerabilities within the DNA sequencing software? That's correct. So really, our research was kind of two phases. The first phase, we were interested in kind of a proof of concept to see whether you could actually, starting all the way with DNA molecules, end up with sequencing files that would target, say, a vulnerability that was discovered in the software. So we were really more interested in kind of trying to understand the limitations of both generating artificial DNA molecules and the sequencing process. And then later on, we actually did kind of a security analysis of existing DNA analysis utilities.
Starting point is 00:08:47 We have the ability, it's called de novo DNA synthesis. So we have the ability to make completely artificial DNA molecules that don't derive from biological sources. So in some sense, you can think of us as having the ability to write kind of arbitrary DNA sequences. The problem is, is that both our ability to make DNA molecules is somewhat constrained. You can't just make any DNA sequence. There are limitations there, as well as the DNA sequencing process has lots of noise and randomness that happens just inherent in how sequencing works. And so it's not totally clear up front whether you can actually have
Starting point is 00:09:26 enough control over the information that's flowing to digital data files to actually create malware. And so that was really our research question. In terms of creating DNA from whole cloth, did you have to deal with the fact that the scanning software was expecting to see certain things? Yeah. So at the end of the day, what you're still getting out of the sequencer is going to look like basically DNA sequences. But the thing is that these utilities are doing all sorts of analysis on this data and all sorts of complicated algorithms to manipulate it in particular ways. And so, for example, you might say generate sequences so that when they're sequenced, a particular algorithm gets into a weird state or processes it in a
Starting point is 00:10:12 particular way so that it would maybe say it's processing data that's larger than it would expect. So you can think of like buffer overflow vulnerabilities or different things like that. And I would also point out that you might ask, like, sort of what kind of analysis are you doing too? And I think the idea is that the data that comes out of these sequencers is really quite raw and isn't very useful by itself because what you're actually doing is you're taking these long DNA molecules,
Starting point is 00:10:38 say, like a full chromosome from a human genome, and you actually break it into little tiny pieces. And so what you're actually doing are sequencing, you know, hundreds of millions of really short DNA molecules. And then to actually kind of reconstruct larger DNA sequences, or to ask particular biological questions, you're going to do all sorts of analysis and complicated algorithms using these short DNA fragments that you've sequenced. And one of the things you all discovered in your research was that this software that is used for the DNA analysis was lacking some basic security best practices.
Starting point is 00:11:15 Yeah, that would definitely be true. I think it's helpful, too, to understand who is writing this software. This whole space has been changing so much that a lot of the utilities that are used by scientists to analyze DNA data have actually been written by either biologists or maybe people with some bioinformatics background. But a lot of them, some of them do, but not all of them don't have kind of formal kind of software development experience. And a lot of these programs are written in languages like C and C++ that if you're not careful, oftentimes contain vulnerabilities. You find that in some sense, because there probably hasn't been much adversarial pressure
Starting point is 00:11:56 on these programs so far, that they are somewhat lacking in security. So we found, for example, many buffer overflow vulnerabilities in these programs, which is mostly what we were looking for, but also that they were using a lot of function calls that are known to create security problems. And just looking at a bunch of metrics, they seem like a broad class of these programs don't seem to be written with security in mind. Yeah, it's an interesting lesson, I think, for security professionals in particular, in that, you know, it seems like this was an attack surface that no one had ever really considered before. Yeah, I think so. And I think people have probably thought, well, maybe the traditional security problems you get, like just sending, say, malicious files back
Starting point is 00:12:41 and forth, maybe even DNA sequencing data files. I think people have thought about that. But it is interesting that, you know, anytime you have information that eventually ends up in a computer system, you have to consider who's generating that data, where it's coming from, and try to design programs that are robust to it. And so I do think there is kind of a broader lesson, which is that anytime you're taking data, you need to think about security. In some sense, DNA is very similar to digital data because it's discrete, because there's only, you know, the four bases. So it actually, there's a pretty close analog to digital data. And so you have a lot of control over DNA and what you can create. And so it gives you a lot of control over the types of input you can send to these systems. What are some of the bad things that people could potentially do,
Starting point is 00:13:26 you know, exploiting the things that you all have learned? I would just say, I would just start by saying that DNA analysis is getting fairly ubiquitous. And so we're seeing DNA sequencing being used in all sorts of domains like medicine, so genetic testing, personalized medicine, forensics, the new fields of sort of bioengineering, genetically modified organisms. And so there's a lot of different assets and things that attackers might be interested in manipulating or stealing or modifying. So you could imagine someone could use DNA as a vector to just steal sensitive sequencing data. So this could contain things like intellectual property
Starting point is 00:14:05 or just DNA sequences from individuals. They might also be able to modify in malicious ways, say genetic tests. So could they, you know, if you control a system that processes DNA data, you could use that to manipulate, say, genetic testing to make people look like they have genetic diseases they don't actually have, or the opposite, mask new and genetic diseases. I think forensics is very interesting because if someone is able to create DNA that they know will eventually be sequenced, so you could think of it like a crime scene, and then that data is then sequenced through some particular workflow and processed through vulnerable programs, then you could imagine someone manipulating forensic systems, for example. We're going to enter a world in the near future where
Starting point is 00:14:54 pretty much everyone's genome is going to be sequenced. Sequencing is going to be a very routine procedure, especially as the price of DNA sequencing continues to drop. One of the things you discovered was a side effect, and that was information leakage. Can you describe that for us? Yeah. So, the way these machines work to get so cost-effective is that you typically don't just sequence one sample at a time. You actually sequence many samples at a time. And so, what actually happens is you take, let's say you have five different people whose genomes you want to sequence, you might take these five individuals and pull their genetic data together and sequence it all at once. But to actually figure out whose
Starting point is 00:15:37 sequence goes with which person, you actually, before sequencing, all of the samples have a unique DNA barcode that's added to each sample so that at the end of sequencing, all of the samples have a unique DNA barcode that's added to each sample so that at the end of sequencing, you can actually kind of figure out which DNA went with which person. We kind of call this sample multiplexing. The problem is our ability to sort of demultiplex. So you pull all these samples and then, you know, try to separate them out, try to separate all the sequencing data at the end. Problem is, is that there is sort of a low but small amount of data leakage that
Starting point is 00:16:12 happens between the samples. And so this is kind of, you can think of this like a side channel. So, you know, if an attacker is capable of sequencing a sample alongside other DNA samples, they might actually be able to influence those other samples in particular ways. So for example, if there is vulnerable sequencing software that's going to process this data, they could push malware. Or the other way, they could actually read data from other samples. So because we know that that data from other samples will end up in files that belong to the malicious actor. So in some sense, you have the ability to both kind of pull data and push data into other sequencing data files.
Starting point is 00:16:51 And in our experiments, we were able to find that there was some information leakage. So it's not clear how eminent this threat is, but I think it's something definitely to consider going in the future. Was that information leakage random or was it something that you were able to control? Yeah, at this point, it's fairly random. The thing is, is that if the attacker is able to make a particular DNA sequence and so that their entire sample was, say, made up of just one DNA sequence, then in some sense, while the particular DNA that it's bled over, you can't control that. But since it's all made up of one sequence, you'll end up knowing what sequence is going to move into the other samples. You might have control over it, but it is still a fairly random process.
Starting point is 00:17:35 So your ability to custom sequence DNA, is that at all a limiting factor in terms of access to that or price of that? Sequence or create? To create. Create, yeah. So it's a synthesis. I'm sorry. I'm sorry. The synthesis, yeah. So it is really easy, actually. So we actually used an outsourced synthesis service. There are many of these companies. And what you do is you basically go into their web form. So they have a web form with a big open box. You just paste in the DNA sequence you want to order and they'll ship it to you. So no, and it costs about $100 to order our sequence. Have these people never seen any 1950s science fiction movies? And so these synthesis services do look for, it's interesting though, they do look for known malicious biomolecules.
Starting point is 00:18:29 So say virus sequences. Interesting, yeah. So there are certain types of sequences they do look for, but they're certainly not looking for sequences that might contain computer code or computer data. What's been the reaction so far to your research? You know, I think it's been pretty much what we expected, which is, in some sense, what we demonstrated is really still a proof of concept. There were lots of challenges we encountered.
Starting point is 00:18:53 It was still really challenging just to make it work in sort of the most ideal circumstances. So we don't think it's sort of an eminent threat. But I do think we've gotten people to start thinking about, hey, we're doing all this DNA sequencing, we're sequencing all this really important data, we're going to be doing a lot more sequencing in the future. The technology is changing rapidly. We really need to start thinking about these sort of novel sort of vectors that data can start moving into these computer systems. And so I think it's really more just letting people start thinking about this and not so much that it's sort of eminent right now, but I'm hopeful that in five or ten years,
Starting point is 00:19:32 maybe when these threats are maybe more eminent, that we'll have at least had five or ten years to start shoring up the security of software that's doing all this DNA processing before more bad things happen. And one thing I'd mention too, which is really cool, software that's doing all this DNA processing before more bad things happen. And I would, one thing I'd mentioned too, which is really cool, there's some really interesting use cases of DNA sequencing that are on the horizon that really make this, I think, more relevant. So one really cool use of DNA sequencing is actually using DNA as a method to store digital data. And
Starting point is 00:20:03 the reason you would do this is because DNA is very stable and can last for hundreds or thousands of years. And it has very, very high density. So I've heard, for example, that you could store all the digital data in the world inside of a car if it was stored in DNA. So really what's happening is that we're actually going to be continuing to blur the line between biological and digital data. And so I think there's going to be some really interesting threats and vectors moving into the future. And what are your thoughts in terms of what needs to be done to protect against the types of exploits that you all have explored? I think the first and most obvious is that just common security best practices don't have buffer overflow vulnerabilities, do security audits of your
Starting point is 00:20:45 software, do some input validation. So kind of routine security practices and start thinking about DNA sequencing software in the same way people think about internet services, web servers, things like that. And I think that would go a long way. Because right now, I think these kinds of attacks are challenging, but the software security is so poor that they might actually be possible going into the future. So I think that's kind of, at least in my opinion, sort of like the first step for doing anything else. Our thanks to Peter Ney from the University of Washington for joining us. If you want to read the complete paper, it's available online.
Starting point is 00:21:26 It's called Computer Security, Privacy, and DNA Sequencing, Compromising Computers with Synthesized DNA, Privacy Leaks, and More. And now, a message from Black Cloak. Did you know the easiest way for cybercriminals to bypass your company's defenses is by targeting your executives and their families at home? Black Cloak's award-winning digital executive protection platform secures their personal devices, home networks, and connected lives. Because when executives are compromised at home, your company is at risk.
Starting point is 00:22:07 In fact, over one-third of new members discover they've already been breached. Protect your executives and their families 24-7, 365, with Black Cloak. Learn more at blackcloak.io. The Cyber Wire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe, where they're co-building the next generation of cybersecurity teams and technologies. Our amazing Cyber Wire team is Elliot Peltzman, Puru Prakash, Stefan Vaziri, Kelsey Bond, Tim Nodar, Joe Kerrigan, Carol Terrio, Ben Yellen, Nick Valecki, Gina Johnson, Bennett Moe, Chris Russell, John Petrick, Jennifer Iben, Rick Howard, Peter Kilpie, and I'm Dave Bittner. Thanks for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.