CyberWire Daily - Using bidirectionality override characters to obscure code. [Research Saturday]
Episode Date: November 20, 2021Guests Nicholas Boucher and Ross Anderson from the University of Cambridge join Dave Bittner to discuss their research, "Trojan Source: Invisible Vulnerabilities." The researchers present a new type o...f attack in which source code is maliciously encoded so that it appears different to a compiler and to the human eye. This attack exploits subtleties in text-encoding standards such as Unicode to produce source code whose tokens are logically encoded in a different order from the one in which they are displayed, leading to vulnerabilities that cannot be perceived directly by human code reviewers. ‘Trojan Source’ attacks, as they call them, pose an immediate threat both to first-party software and of supply-chain compromise across the industry. They present working examples of Trojan-Source attacks in C, C++, C#, JavaScript, Java, Rust, Go, and Python. They propose definitive compiler-level defenses, and describe other mitigating controls that can be deployed in editors, repositories, and build pipelines while compilers are upgraded to block this attack. The project website and research can be found here: Trojan Source: Invisible Source Code Vulnerabilities project website Trojan Source: Invisible Vulnerabilities research paper Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me.
I have to say, Delete.me is a game changer. Within days of signing up, they started removing my
personal information from hundreds of data brokers. I finally have peace of mind knowing
my data privacy is protected. Delete.me's team does all the work for you with detailed reports
so you know exactly what's been done. Take control of your data and keep your private life Thank you. JoinDeleteMe.com slash N2K and use promo code N2K at checkout.
The only way to get 20% off is to go to JoinDeleteMe.com slash N2K and enter code N2K at checkout.
That's JoinDeleteMe.com slash N2K, code N2K. Hello everyone and welcome to the CyberWire's Research Saturday.
I'm Dave Bittner and this is our weekly conversation with researchers and analysts
tracking down threats and vulnerabilities,
solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace.
Thanks for joining us.
So going back about a year ago, we started working on a totally distinct project where we were attempting to break natural language processing systems.
inked project where we were attempting to break natural language processing systems.
Joining us this week are Nicholas Boucher and Ross Anderson,
both from the University of Cambridge.
The research is titled Trojan Source, Invisible Vulnerabilities.
And now a message from our sponsor, Zscaler, the leader in cloud security.
Enterprises have spent billions of dollars on firewalls and VPNs, yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks
and a $75 million record payout in 2024.
These traditional security tools expand your attack surface with
public-facing IPs that are exploited by bad actors more easily than ever with AI tools.
It's time to rethink your security. Zscaler Zero Trust plus AI stops attackers by hiding your
attack surface, making apps and IPs invisible, eliminating lateral movement, connecting users only to specific apps, not the entire network.
Continuously verifying every request based on identity and context.
Simplifying security management with AI-powered automation.
And detecting threats using AI to analyze over 500 billion daily transactions.
Hackers can't attack what they can't see.
Protect your organization with Zscaler Zero Trust and AI.
Learn more at zscaler.com slash security.
Our goal was to create adversarial examples that would cause NLP systems like toxic content
classifiers and machine translation systems to break when you gave specific inputs to these
systems. That's Nicholas Boucher. And there had been lots of work on this in the past, but one of
the criticisms we had of past work, or perhaps a shortcoming of prior work, was that all of these adversarial examples, they changed the way that the text
looked. That is to say that someone who was using an adversarial example against a natural language
processing system would see that it has been rephrased or misspelled or something along these
lines, and it was usually quite clear to the victim that they were given a
poisoned example, so to speak. And we thought that we could do better than this. And we stumbled
across this idea with a couple of other co-authors here at Cambridge and also in Canada, that we
could change the way that strings are encoded, that text is encoded in a way that would cause
natural language processing systems to more or less fall apart and give you very poor performance.
And once we had put out a paper on this, which is called Bad Characters, the name of the paper, we started saying, well, gosh, we could probably use these malicious encodings to do other evil things in the various domains of computer science.
And compilers and interpreters quickly
became our focus. And the story here is that we realized we could use very similar techniques.
We could modify the encoding, not of inputs to machine learning in this case, but of inputs to
compilers and interpreters in order to cause those compilers and interpreters to output binaries of logic that was different than what a developer may expect.
And that led us to the Trojan source work.
Well, can you describe for us what exactly is going on here?
I mean, how does this exploit work?
Yes, the idea is rather simple.
One simply encodes source code files in a way that will render differently to a human user, someone
who's using, say, a text editor on their computer, than to a compiler or an interpreter, which just
ingest the raw bytes of the source code file. And there's a couple little tricks that we use to
pull this off, but the primary technique is that we use bidirectionality override control characters.
And these are things that exist in specifications like Unicode, for example, which are the most
common way to encode text these days. And they exist to allow you to override the direction of
a text, say from left to right and change it to right to left. And these exist because there are
many different languages in the world that use different directionality of text. And when you
are writing in a multilingual setting, you may choose to write words in a way that is different
than the default ordering, if you will. You may want to inject some specifically right to left
words into left to right text or change the standard order that something would
be rendered. And what we found is that we could use these bidirectionality override control
characters to change the way that text is presented on a screen, specifically source code text. And we
found that we could take these characters and we could inject them into comments and into strings
inside of source code files. And when we did this,
it would cause the program to, the source code of the program to be displayed differently than
it was actually encoded. And that ultimately leads us to the vulnerability where we craft
different logic at the encoding level than we do at the visualization level. And if that logic is
cleverly crafted, you could, for example,
take the opposite action when a compiler sees something than when a developer sees something.
Now, Ross, one of the things that I think has captured the public's imagination with your
research here is how broadly this affects things. I mean, there are many, many languages that could fall victim to this
sort of attack. Well, this attack potentially affects almost every modern computer language,
with the possible exception of Haskell. So whether you're writing in Go or Java or Python or C or C++
or C Sharp or whatever, we've come up with examples of code that will look different to a human
reviewer than it will look to a compiler.
And this has led to a number of interesting effects as we've been disclosing the vulnerability
and trying to get the industry to fix it.
Some of the firms to which we disclosed it said, this isn't a language problem at all.
This is the fault of the responsibility of the people who sell code editors in development
environments.
or the responsibility of the people who sell code editors in development environments.
That was the attitude, for example, taken by Oracle, who stewardship of the Java language.
Other language teams, such as the Rust compiler team, for example,
were very enthusiastic about fixing this problem in their language immediately.
As for the development environments, GitHub, GitLab, and Atlassian are all on the job,
but it's by no means obvious that everybody is.
And so now that the vulnerability has been disclosed, there is the real risk that some bad person would use, you know,
would target programs that are written in a language that hasn't been fixed, such as Java,
in a company isn't using a fixed development environment,
therefore might be able to do something rather nasty.
And so for that reason, we thought it prudent to get as much publicity as possible
to get across to CIOs and CISOs worldwide that they'd better check their tool chain
and see to it that any code that they rely on isn't vulnerable to a supply chain attack.
Now, is my understanding correct that the notion of this had come up previously in the past?
I don't think anyone has dug into the depth that you all did here,
but this as a possibility had been brought up before.
Nicholas, is that correct?
There's been different ways that bidirectionality override characters
have been exploited across a number of domains in the past,
some of them being programming languages.
So to go through a couple of examples that we found in the wild,
one major use is for obfuscating interpreted languages.
So JavaScript, for example, is typically sent to client-side users in their browsers, and a reasonable person may be able to decipher what JavaScript code is doing, and therefore companies may try to obfuscate the code and will inject these bidirectionality override characters
in order to make it even harder to read text. You really would need to either strip out these
characters or just looking at the raw bytes of the text to see what's going on. Now, there have
also been other more malicious uses of these bidirectionality overrides in the past. So,
for example, there have been use cases in smart
contracts. So we see particularly on the Ethereum blockchain that we've seen bidirectionality
override characters used to swap the arguments passed to different functions, for example,
to change the sender and receiver to swap them in a particular payment. And this is very interesting.
We discovered this or came
across this example rather late in our work after we had assembled our paper. And it's a very,
very malicious use of this particular technique. And there are actually a variety of other people
who have proposed online, well, gosh, we could use bidirectionality override characters and,
say, comments to do precisely that, to swap the order
of different arguments. And I think what we are trying to present here is a rather systematic
overview of all of this, and we believe kind of injects some novel techniques that one can use
for these bidirectionality override characters, particularly in the ways we propose of injecting
them into strings and the ways we put them into comments we break them into three different categories we call them commenting out stretched
strings and gosh what is the other group that we came up with that we put them into three categories
all the same so and and actually going back even further before the ethereum example there has been
even prior work one of the most interesting ones, I think,
is that bidirectionality overrides have been used to try and change the file extension or change the way that a file extension of, say, malware is displayed. So if I have some executable file,
some.exe that's sent via email, and I want a user to open it and them not to suspect that it's an
executable file, I could, for example,
inject a right-to-left override in the file name, and I could include.txt or some other relatively innocuous file extension in the name, and I could use that character to swap it and make it look
like.txt is the overall extension of the file. And it turns out that this has been used to
disseminate malware across email going back well more than 10 years, which is perhaps a slightly different domain, but shows that within the security setting, it is certainly well known that bidirectionality overrides can cause problems.
But our goal is to present this systematic overview in the compiler setting.
And so, Ross, to what degree did your research find that this is
being used out there in the wild? How serious an issue is this today? Well, thanks to a number of
development environment maintainers such as GitHub and Ross, we got thousands and thousands of
suspect examples of possible
abuse of BD characters, which exists in public repositories. And we found that the great
majority of these were just people doing careless programming, which involved strings or comments
in Hebrew or Arabic. We discovered a significant amount of use for obfuscating JavaScript,
but we didn't find anything else of consequence.
So what appears to have happened is that up until now, various people had said, hey, you
could do bad things with BD characters.
And then this kind of hadn't been followed through.
The people who designed BD control characters into the Unicode set put in a very quiet warning saying this might be used to do
bad stuff. Nobody kind of followed through on that. There was also some work about 15 years ago
around the possible use of strange characters in domain names. And we've now got PunyCode as a
standard for getting canonical expressions of domain names to stop this being used in fishing.
In other words, there was a substantial vulnerability there,
but various people had just looked at various small aspects of it,
like the five blind men and the elephant.
You know, one thought this is a tree and one thought this is a rope and so on and so forth.
And what we've basically contributed, we believe,
is to firstly to trace out the whole shape of the beast and second, to motivate the industry to roll up its sleeves and fix it.
Nicholas, you know, you bring up sort of a fascinating element of this, which is, you know, who takes responsibility for the fix here?
Is it the people making the development tools? Is it the developers themselves?
Is it, you know, do we go searching for these sorts of
things on the endpoint after the fact? What did you all explore as far as that element goes?
It's a very interesting question as to whose responsibility it is to fix this vulnerability.
So oftentimes we speak in terms of expecting compilers or interpreters to put out patches to mitigate this particular
attack, but that is not necessarily the only answer. And in many viewpoints, that may not even
be the correct place to patch this. So some may take the view that compilers exist to implement
a particular language specification. And those following that view for languages that have formal
specifications,
the place to fix this would be to add logic or add rules into a language specification,
which then would later be implemented by compilers. But still others may say, well,
you know, adversarial encodings really, you know, that might not be in the job of a compiler to
defend against. That might be in, for example, a static code scanner, in which case you have a
variety of different security companies that sell services or even open source products online that
will do static code scanning and potentially be able to expose attacks like this. And perhaps
that is the place to prevent something like this. But a still different approach that one can take
is to say that this isn't perhaps even a problem with compilers. It's a problem with visualization. We have these, say, text editors or perhaps
repository front ends, websites that we use to view code online that are visualizing code in a
way that is misleading for what that code would actually do if it was ingested into a proper
compiler or interpreter. And because of that, perhaps the
answer is that we need to fix the way that text is displayed inside of text editors, and we need
to add warnings and make these directionality override characters visible in online platforms.
And any one of these techniques is a perfectly reasonable way to defend against these attacks.
But I think the important thing to keep in mind is, you know, if large scale attacks were to be launched using these techniques, your best strategy is probably a defense in depth strategy where you have mitigations in place at each of these layers.
Because even if we say we're to patch all of the compilers that we know are affected, it is very likely that there are compilers that we just haven't looked at that are indeed affected.
Or, you know, of course, the legacy versions of compilers that hang around in certain development environments. And because of this, we would look for things like static code scanners legitimate. You know, there are times when perhaps
you want to obfuscate something for a security reason. And so do we eliminate that possibility
here? It's really quite interesting, isn't it? Well, and one of the underlying issues that I
think this exposes or draws attention to is that internationalized text
is something that is just inherently challenging problem in computer science. And I think that
we have these systems like Unicode, which do a great job of providing very thorough support for
a very large number of languages. But there are security issues that arise from these platforms.
But what is the answer to that? Certainly, we can't say that everyone needs to use ASCII for everything,
that non-English languages have been disadvantaged
in many computing contexts for a very long time,
and certainly our solution can't be to regress
and say that everyone needs to use a small number of Latin characters
in all of their writing.
But at the same time, that means that if we are to use
these powerful internationalized text standards, like Unicode, for example, there are these nuances
that are very important for all parts of the development pipelines to take into account
lest security vulnerabilities arise. Ross, what are the next steps here? I mean,
in your research, you point out that you all went through proper responsible disclosure
to the various developers of the tools that are involved with these languages.
Where do you hope your research leads?
Well, first, we think there may be other similar vulnerabilities
that arise out of the insane amount of complexity
that has arisen around modern development environments.
And so we leave that as an open challenge to everybody to look for other stuff that
was put in to be helpful, but is now hiding unpleasant stuff under pretty stones.
Second thing that we're going to write up is the enormous diversity of the response
that we got to coordinated disclosure.
Because one of the really important things for information security
is the rate at which vulnerabilities are fixed once they get disclosed.
Because if they don't get fixed quickly, then lots of systems end up being vulnerable.
And we discovered that there was a very broad range of responses in the industry to our disclosure.
The disclosure was somewhat off the beaten track because we weren't in a position of saying,
you know, hey guys, here's a zero-day vulnerability that allows me to take remote control of one of your systems without human intervention.
We were saying, here is a vulnerability that allows a bad person to smuggle code into your system,
perhaps through the supply chain, in such
a way that humans won't notice it. And that's altogether more difficult to deal with. Now,
as time goes on, we'll have more and more vulnerabilities that are more conceptually
difficult to deal with. And the industrialized processes that a number of the big tech firms
and others have are going to be less and less able to cope. And it's particularly interesting
to see the relatively poor performance of some of the
big tech companies who had outsourced the vulnerability disclosure process, right?
Because if you've hired a subcontractor and told them, you know, you will pay the following
amount, dollars X for bugs of the type Y, and we will pay you so many dollars a month
to run the service for us,
then of course the responders don't have an incentive to put any effort into anything that's even slightly out of the ordinary.
So as a result, you may find that a number of companies have got the appearance of a
disclosure system, but without really the reality.
Nicholas, any thoughts from you there on where you're hoping this leads?
My hope is that as many compilers and interpreters as possible
be patched against this particular vulnerability.
And in addition to that, that we continue to see changes
in code visualization pipelines,
and perhaps even notes added to the Unicode standard
to very clearly and explicitly say that, you know, this attack
pattern is something that we need to watch out for. I think in the bigger picture, what worries
me is less the individual developer adversaries that want to exploit something like this, but,
you know, perhaps some of the more powerful advanced persistent threats, if you will. You
could imagine that should someone have an insider,
control of an insider at a particular company or project, or simply has lots of time and
opportunity to try lots of different techniques, if they are able to inject a particular backdoor
that goes unnoticed, a particular vulnerability, well, you know, we could find ourselves in a
situation perhaps similar to some of the supply chain attacks that we've seen in recent months.
SolarWinds, not that long ago.
And seeing something play out where it might not be immediately clear where the vulnerability is in code or how some backdoor got in place.
Or, of course, you know, if that bug happens to be ingested into a compiler's source code itself, you know, we could find ourselves with untrustworthy compilers floating around
and it not being immediately clear where these vulnerabilities are.
And it's those, you know, slightly more insidious,
slightly more difficult to plan attack vectors that, to me,
this represents as one of the most scary threats of the Trojan source work.
Our thanks to Nicholas Boucher and Ross Anderson from the University of Cambridge for joining us.
The research is titled Trojan Source, Invisible Vulnerabilities.
We'll have a link in the show notes.
And now a message from Black Cloak. Did you know the easiest way for cyber criminals to bypass your company's defenses is by targeting your executives and their families at home? Black Cloak's award
winning digital executive protection platform secures their personal devices, home networks, and connected lives.
Because when executives are compromised at home, your company is at risk.
In fact, over one-third of new members discover they've already been breached.
Protect your executives and their families 24-7, 365, with BlackCloak.
Learn more at blackcloak.io. Our amazing CyberWire team is Elliot Peltzman, Trey Hester, Brandon Karp, Puru Prakash, Justin Sabey, Tim Nodar, Joe Kerrigan, Carol Terrio, Ben Yellen,
Nick Vilecki, Gina Johnson, Bennett Moe, Chris Russell, John Petrick,
Jennifer Iben, Rick Howard, Peter Kilby, and I'm Dave Bittner.
Thanks for listening. We'll see you back here next week.