CyberWire Daily - Synthesized DNA Malware with Peter Ney. [Research Saturday]
Episode Date: October 14, 2017Peter Ney is a PhD candidate in the Allen School of Computer Science and Engineering at the University of Washington where he is advised by Professor Tadayoshi Kohno. His current research is focused o...n understanding computer security risks in emerging technologies like DNA synthesis and sequencing and the new threats posed by maliciously crafted, synthetic DNA. He and his team found that security of DNA processing programs is poor and show with a proof-of-concept that it is possible to attack computer systems with adversarial synthetic DNA. Â Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me.
I have to say, Delete.me is a game changer. Within days of signing up, they started removing my
personal information from hundreds of data brokers. I finally have peace of mind knowing
my data privacy is protected. Delete.me's team does all the work for you with detailed reports
so you know exactly what's been done. Take control of your data and keep your private life Thank you. JoinDeleteMe.com slash N2K and use promo code N2K at checkout.
The only way to get 20% off is to go to JoinDeleteMe.com slash N2K and enter code N2K at checkout.
That's JoinDeleteMe.com slash N2K, code N2K.
Hello, everyone, and welcome to the CyberWire's Research Saturday.
I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down threats and vulnerabilities and solving some of the hard problems of
protecting ourselves in a rapidly evolving cyberspace.
Thanks for joining us.
And now, a message from our sponsor, Zscaler, the leader in cloud security.
Enterprises have spent billions of dollars on firewalls and VPNs,
yet breaches continue to rise by an 18% year-over-year increase
in ransomware attacks and a $75 million record payout in 2024.
These traditional security tools expand your attack surface
with public-facing IPs that are exploited by bad actors
more easily than ever with AI tools. It's time to rethink your
security. Zscaler Zero Trust plus AI stops attackers by hiding your attack surface, making apps and IPs
invisible, eliminating lateral movement, connecting users only to specific apps, not the entire
network, continuously verifying every request based on identity and context, simplifying security Thank you. your organization with Zscaler Zero Trust and AI. Learn more at zscaler.com slash security.
DNA is a biological molecule that's designed to store information. All living things have DNA.
That's Peter Ney. He's a Ph.D. candidate in the
Allen School of Computer Science and Engineering at the University of Washington, where he's advised
by Professor Tadayoshi Kono. His current research is focused on understanding computer security
risks in emerging technologies like DNA synthesis and sequencing, and the new threats posed by
maliciously crafted synthetic DNA. Along with his colleagues at the University of Washington,
he's one of the authors of the paper Computer Security, Privacy, and DNA Sequencing,
Compromising Computers with Synthesized DNA, Privacy Leaks, and More.
DNA is made up of four types of molecules, adenine, cytosine, guanine, and thymine,
which we just shorten to A, g and t and so dna molecules
are just basically a linear sequence of these a c's g's and t's and so you can think of it
as being very similar to digital data but instead of having binary data like zeros and ones dna is
actually made up of four different types a c's, C's, G's, and T's.
So kind of like base four. DNA sequencing is just the process of when you're given a particular DNA
molecule, you want to know what is the actual order of these bases in the DNA strand. That's
pretty much at a high level what DNA sequencing is. It's been around for about 40 years since the early
1970s. And DNA sequencing at that time was a fairly slow and expensive process. But all this
changed in the early 2000s with the development of a new class of technologies, which are kind
of broadly referred to as next generation sequencers. And unlike their predecessors,
these sequencers are actually capable of sequencing a massive quantity of DNA all in parallel. So you
can do things like sequence an entire human genome, or maybe 100 human genomes all at once.
And so what's happened is that DNA sequencing has gotten really, really cheap since this started.
And so in about 2001, it cost around $100 million to sequence one human
genome. And today we can do it for about 1000. So contrast that to me, I think many of us are
familiar with some of the consumer DNA sequencing services, you know, that for $100, you can get
your DNA sequenced and find out your genetic background. And how does that compare
to what you're talking about with this kind of sequencing? When I'm talking about sequencing,
I'm saying given a DNA molecule, I want to know every single base in the order of all the bases
in the molecule. So this way you can think of this like, you know, proper full sequencing.
I see. There are other kinds of techniques that kind of just sequence little individual bases,
but not all
of the bases in a DNA molecule. And so that's, for example, if you've heard of 23andMe, that's the
kind of sequencing they do. What I'm talking about is typically referred to as full genome sequencing.
So it's actually sort of trying to sequence, you know, every single base in the human genome.
And how much data are we talking about with a full sequence of, say, the human genome?
And how much data are we talking about with a full sequence of, say, the human genome?
So the human genome itself is about 4 billion bases long. So typically when you generate sequencing data, you have lots of redundancy.
And so you sequence the same parts of the genome over and over again, maybe upwards of 20, 30 times.
So typically you're talking about generating hundreds of billions of DNA bases, which in terms of storage
might be upwards of 20 gigabytes or more. And some of the really high throughput sequencing machines,
which have been developed, can do, I think they might generate terabytes of data in a single
sequencing run. So basically, given that DNA is a way to store information, and these systems are taking the biological thing that is DNA and turning them into computer data.
Describe to us your approach for trying to exploit that.
I would just add on to what you just said, which is that, you know, really what you can, when you think about DNA sequencing, what's actually happening is that it's this kind of intermediary between biological data and digital data. And so, you know, I think
we've known for a long time that anytime computer systems process digital data, there's the
possibility that that data could be used maliciously to target vulnerabilities in that software.
And so since DNA sequencing is just taking these biomolecules and turning them into digital data, we were really wondering, can we actually start by making particular biological DNA samples
so that when they're sequenced, they would actually end up as malicious sequencing data files?
So how did you determine what you were going to target?
Am I right?
Reading through your research, did you sort of artificially
set up some vulnerabilities within the DNA sequencing software?
That's correct. So really, our research was kind of two phases. The first phase, we were interested
in kind of a proof of concept to see whether you could actually, starting all the way with
DNA molecules, end up with sequencing files that would target,
say, a vulnerability that was discovered in the software. So we were really more interested in
kind of trying to understand the limitations of both generating artificial DNA molecules
and the sequencing process. And then later on, we actually did kind of a security analysis of existing DNA analysis
utilities.
We have the ability, it's called de novo DNA synthesis.
So we have the ability to make completely artificial DNA molecules that don't derive
from biological sources.
So in some sense, you can think of us as having the ability to write kind of arbitrary DNA
sequences. The problem is, is that both our
ability to make DNA molecules is somewhat constrained. You can't just make any DNA
sequence. There are limitations there, as well as the DNA sequencing process has lots of noise
and randomness that happens just inherent in how sequencing works. And so it's not totally clear up front whether you can actually have
enough control over the information that's flowing to digital data files to actually create malware.
And so that was really our research question. In terms of creating DNA from whole cloth,
did you have to deal with the fact that the scanning software was expecting to see certain
things? Yeah. So at the end of the day, what you're still getting out of the sequencer is
going to look like basically DNA sequences. But the thing is that these utilities are doing
all sorts of analysis on this data and all sorts of complicated algorithms to manipulate it in
particular ways. And so, for example, you might say generate sequences so
that when they're sequenced, a particular algorithm gets into a weird state or processes it in a
particular way so that it would maybe say it's processing data that's larger than it would
expect. So you can think of like buffer overflow vulnerabilities or different things like that.
And I would also point out that you might ask, like, sort of what kind of analysis are you doing too?
And I think the idea is that the data
that comes out of these sequencers is really quite raw
and isn't very useful by itself
because what you're actually doing
is you're taking these long DNA molecules,
say, like a full chromosome from a human genome,
and you actually break it into little tiny pieces.
And so what you're
actually doing are sequencing, you know, hundreds of millions of really short DNA molecules.
And then to actually kind of reconstruct larger DNA sequences, or to ask particular biological
questions, you're going to do all sorts of analysis and complicated algorithms using these
short DNA fragments that you've sequenced. And one of the things you all discovered in your research was that this software that
is used for the DNA analysis was lacking some basic security best practices.
Yeah, that would definitely be true.
I think it's helpful, too, to understand who is writing this software.
This whole space has been changing so much that a lot of the
utilities that are used by scientists to analyze DNA data have actually been written by either
biologists or maybe people with some bioinformatics background. But a lot of them, some of them do,
but not all of them don't have kind of formal kind of software development experience. And a lot of these programs are written in languages like C and C++ that if you're not
careful, oftentimes contain vulnerabilities.
You find that in some sense, because there probably hasn't been much adversarial pressure
on these programs so far, that they are somewhat lacking in security.
So we found, for example, many buffer overflow vulnerabilities
in these programs, which is mostly what we were looking for, but also that they were using a lot
of function calls that are known to create security problems. And just looking at a bunch of metrics,
they seem like a broad class of these programs don't seem to be written with security in mind.
Yeah, it's an interesting lesson, I think, for security professionals in particular, in that, you know, it seems like this was an attack surface that no one
had ever really considered before. Yeah, I think so. And I think people have probably thought,
well, maybe the traditional security problems you get, like just sending, say, malicious files back
and forth, maybe even DNA sequencing data files. I think people have thought
about that. But it is interesting that, you know, anytime you have information that eventually ends
up in a computer system, you have to consider who's generating that data, where it's coming from,
and try to design programs that are robust to it. And so I do think there is kind of a broader
lesson, which is that anytime you're taking data, you need to think about security.
In some sense, DNA is very similar to digital data because it's discrete, because there's only,
you know, the four bases. So it actually, there's a pretty close analog to digital data. And so you have a lot of control over DNA and what you can create. And so it gives you a lot of control over
the types of input you can send to these systems. What are some of the bad things that people could potentially do,
you know, exploiting the things that you all have learned?
I would just say, I would just start by saying that DNA analysis is getting fairly ubiquitous.
And so we're seeing DNA sequencing being used in all sorts of domains like medicine,
so genetic testing, personalized medicine, forensics, the new
fields of sort of bioengineering, genetically modified organisms. And so there's a lot of
different assets and things that attackers might be interested in manipulating or stealing or
modifying. So you could imagine someone could use DNA as a vector to just steal sensitive sequencing
data. So this could contain things like intellectual property
or just DNA sequences from individuals. They might also be able to modify in malicious ways,
say genetic tests. So could they, you know, if you control a system that processes DNA data,
you could use that to manipulate, say, genetic testing to make people look like they have genetic diseases
they don't actually have, or the opposite, mask new and genetic diseases. I think forensics is
very interesting because if someone is able to create DNA that they know will eventually be
sequenced, so you could think of it like a crime scene, and then that data is then sequenced
through some particular workflow and processed through vulnerable programs, then you could imagine someone manipulating
forensic systems, for example. We're going to enter a world in the near future where
pretty much everyone's genome is going to be sequenced. Sequencing is going to be a very
routine procedure, especially as the price of DNA sequencing continues to drop.
One of the things you discovered was a side effect, and that was information leakage. Can
you describe that for us? Yeah. So, the way these machines work to get so cost-effective
is that you typically don't just sequence one sample at a time. You actually sequence many
samples at a time. And so, what actually happens is you take, let's say you
have five different people whose genomes you want to sequence, you might take these five individuals
and pull their genetic data together and sequence it all at once. But to actually figure out whose
sequence goes with which person, you actually, before sequencing, all of the samples have a
unique DNA barcode that's added to each sample so that at the end of sequencing, all of the samples have a unique DNA barcode that's added
to each sample so that at the end of sequencing, you can actually kind of figure out which
DNA went with which person.
We kind of call this sample multiplexing.
The problem is our ability to sort of demultiplex.
So you pull all these samples and then, you know, try to separate them out, try to separate
all the sequencing data at the end. Problem is, is that there is sort of a low but small amount of data leakage that
happens between the samples. And so this is kind of, you can think of this like a side channel. So,
you know, if an attacker is capable of sequencing a sample alongside other DNA samples, they might
actually be able to influence those other samples in
particular ways. So for example, if there is vulnerable sequencing software that's going to
process this data, they could push malware. Or the other way, they could actually read data from
other samples. So because we know that that data from other samples will end up in files that
belong to the malicious actor. So in some sense,
you have the ability to both kind of pull data and push data into other sequencing data files.
And in our experiments, we were able to find that there was some information leakage. So it's not
clear how eminent this threat is, but I think it's something definitely to consider going in the
future. Was that information leakage random or was it something that you were able to control? Yeah, at this point, it's fairly random. The thing is, is that
if the attacker is able to make a particular DNA sequence and so that their entire sample was, say,
made up of just one DNA sequence, then in some sense, while the particular DNA that it's bled
over, you can't control that. But since it's all
made up of one sequence, you'll end up knowing what sequence is going to move into the other
samples. You might have control over it, but it is still a fairly random process.
So your ability to custom sequence DNA, is that at all a limiting factor in terms of
access to that or price of that?
Sequence or create? To create. Create,
yeah. So it's a synthesis. I'm sorry. I'm sorry. The synthesis, yeah. So it is really easy,
actually. So we actually used an outsourced synthesis service. There are many of these
companies. And what you do is you basically go into their web form. So they have a web form with a big open box. You just paste in the DNA sequence you want to order and they'll ship it to you. So no, and it costs about $100 to order our sequence.
Have these people never seen any 1950s science fiction movies?
And so these synthesis services do look for, it's interesting though, they do look for known malicious biomolecules.
So say virus sequences.
Interesting, yeah.
So there are certain types of sequences they do look for, but they're certainly not looking
for sequences that might contain computer code or computer data.
What's been the reaction so far to your research?
You know, I think it's been pretty much what we expected, which is, in some sense, what
we demonstrated is really still a proof of concept.
There were lots of challenges we encountered.
It was still really challenging just to make it work in sort of the most ideal circumstances.
So we don't think it's sort of an eminent threat.
But I do think we've gotten people to start thinking about,
hey, we're doing all this DNA sequencing, we're sequencing all this really important data,
we're going to be doing a lot more sequencing in the future. The technology is changing rapidly.
We really need to start thinking about these sort of novel sort of vectors that data can start moving into these computer systems. And so I think it's really more just letting people start thinking about this
and not so much that it's sort of eminent right now,
but I'm hopeful that in five or ten years,
maybe when these threats are maybe more eminent,
that we'll have at least had five or ten years
to start shoring up the security of software
that's doing all this DNA processing before more bad things happen.
And one thing I'd mention too, which is really cool, software that's doing all this DNA processing before more bad things happen. And I would,
one thing I'd mentioned too, which is really cool, there's some really interesting use cases
of DNA sequencing that are on the horizon that really make this, I think, more relevant. So
one really cool use of DNA sequencing is actually using DNA as a method to store digital data. And
the reason you would do this is because DNA is very
stable and can last for hundreds or thousands of years. And it has very, very high density.
So I've heard, for example, that you could store all the digital data in the world inside of a car
if it was stored in DNA. So really what's happening is that we're actually going to be
continuing to blur the line between biological and digital data.
And so I think there's going to be some really interesting threats and vectors moving into the future. And what are your thoughts in terms of what needs to be done to protect against the
types of exploits that you all have explored? I think the first and most obvious is that just
common security best practices don't have buffer overflow vulnerabilities, do security audits of your
software, do some input validation. So kind of routine security practices and start thinking
about DNA sequencing software in the same way people think about internet services, web servers,
things like that. And I think that would go a long way. Because right now, I think these kinds
of attacks are challenging, but the
software security is so poor that they might actually be possible going into the future.
So I think that's kind of, at least in my opinion, sort of like the first step for doing anything
else. Our thanks to Peter Ney from the University of Washington for joining us. If you want to read
the complete paper, it's available online.
It's called Computer Security, Privacy, and DNA Sequencing,
Compromising Computers with Synthesized DNA, Privacy Leaks, and More.
And now, a message from Black Cloak.
Did you know the easiest way for cybercriminals to bypass your company's defenses
is by targeting your executives and their families at home?
Black Cloak's award-winning digital executive protection platform
secures their personal devices, home networks, and connected lives.
Because when executives are compromised at home, your company is at risk.
In fact, over one-third of new members discover they've already been breached.
Protect your executives and their families 24-7, 365, with Black Cloak.
Learn more at blackcloak.io.
The Cyber Wire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe,
where they're co-building the next generation of cybersecurity teams and technologies.
Our amazing Cyber Wire team is Elliot Peltzman, Puru Prakash, Stefan Vaziri, Kelsey Bond, Tim Nodar, Joe Kerrigan, Carol Terrio, Ben Yellen, Nick Valecki, Gina Johnson, Bennett Moe, Chris Russell,
John Petrick, Jennifer Iben, Rick Howard, Peter Kilpie, and I'm Dave Bittner. Thanks for listening.