CyberWire Daily - Using bidirectionality override characters to obscure code. [Research Saturday]

Starting point is 00:00:00 You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me. I have to say, Delete.me is a game changer. Within days of signing up, they started removing my personal information from hundreds of data brokers. I finally have peace of mind knowing my data privacy is protected. Delete.me's team does all the work for you with detailed reports so you know exactly what's been done. Take control of your data and keep your private life Thank you. JoinDeleteMe.com slash N2K and use promo code N2K at checkout. The only way to get 20% off is to go to JoinDeleteMe.com slash N2K and enter code N2K at checkout. That's JoinDeleteMe.com slash N2K, code N2K. Hello everyone and welcome to the CyberWire's Research Saturday. I'm Dave Bittner and this is our weekly conversation with researchers and analysts

Starting point is 00:01:38 tracking down threats and vulnerabilities, solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us. So going back about a year ago, we started working on a totally distinct project where we were attempting to break natural language processing systems. inked project where we were attempting to break natural language processing systems. Joining us this week are Nicholas Boucher and Ross Anderson, both from the University of Cambridge. The research is titled Trojan Source, Invisible Vulnerabilities.

Starting point is 00:02:28 And now a message from our sponsor, Zscaler, the leader in cloud security. Enterprises have spent billions of dollars on firewalls and VPNs, yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks and a $75 million record payout in 2024. These traditional security tools expand your attack surface with public-facing IPs that are exploited by bad actors more easily than ever with AI tools. It's time to rethink your security. Zscaler Zero Trust plus AI stops attackers by hiding your attack surface, making apps and IPs invisible, eliminating lateral movement, connecting users only to specific apps, not the entire network. Continuously verifying every request based on identity and context.

Starting point is 00:03:14 Simplifying security management with AI-powered automation. And detecting threats using AI to analyze over 500 billion daily transactions. Hackers can't attack what they can't see. Protect your organization with Zscaler Zero Trust and AI. Learn more at zscaler.com slash security. Our goal was to create adversarial examples that would cause NLP systems like toxic content classifiers and machine translation systems to break when you gave specific inputs to these systems. That's Nicholas Boucher. And there had been lots of work on this in the past, but one of

Starting point is 00:04:01 the criticisms we had of past work, or perhaps a shortcoming of prior work, was that all of these adversarial examples, they changed the way that the text looked. That is to say that someone who was using an adversarial example against a natural language processing system would see that it has been rephrased or misspelled or something along these lines, and it was usually quite clear to the victim that they were given a poisoned example, so to speak. And we thought that we could do better than this. And we stumbled across this idea with a couple of other co-authors here at Cambridge and also in Canada, that we could change the way that strings are encoded, that text is encoded in a way that would cause natural language processing systems to more or less fall apart and give you very poor performance.

Starting point is 00:04:48 And once we had put out a paper on this, which is called Bad Characters, the name of the paper, we started saying, well, gosh, we could probably use these malicious encodings to do other evil things in the various domains of computer science. And compilers and interpreters quickly became our focus. And the story here is that we realized we could use very similar techniques. We could modify the encoding, not of inputs to machine learning in this case, but of inputs to compilers and interpreters in order to cause those compilers and interpreters to output binaries of logic that was different than what a developer may expect. And that led us to the Trojan source work. Well, can you describe for us what exactly is going on here? I mean, how does this exploit work?

Starting point is 00:05:37 Yes, the idea is rather simple. One simply encodes source code files in a way that will render differently to a human user, someone who's using, say, a text editor on their computer, than to a compiler or an interpreter, which just ingest the raw bytes of the source code file. And there's a couple little tricks that we use to pull this off, but the primary technique is that we use bidirectionality override control characters. And these are things that exist in specifications like Unicode, for example, which are the most common way to encode text these days. And they exist to allow you to override the direction of a text, say from left to right and change it to right to left. And these exist because there are

Starting point is 00:06:25 many different languages in the world that use different directionality of text. And when you are writing in a multilingual setting, you may choose to write words in a way that is different than the default ordering, if you will. You may want to inject some specifically right to left words into left to right text or change the standard order that something would be rendered. And what we found is that we could use these bidirectionality override control characters to change the way that text is presented on a screen, specifically source code text. And we found that we could take these characters and we could inject them into comments and into strings inside of source code files. And when we did this,

Starting point is 00:07:05 it would cause the program to, the source code of the program to be displayed differently than it was actually encoded. And that ultimately leads us to the vulnerability where we craft different logic at the encoding level than we do at the visualization level. And if that logic is cleverly crafted, you could, for example, take the opposite action when a compiler sees something than when a developer sees something. Now, Ross, one of the things that I think has captured the public's imagination with your research here is how broadly this affects things. I mean, there are many, many languages that could fall victim to this sort of attack. Well, this attack potentially affects almost every modern computer language,

Starting point is 00:07:53 with the possible exception of Haskell. So whether you're writing in Go or Java or Python or C or C++ or C Sharp or whatever, we've come up with examples of code that will look different to a human reviewer than it will look to a compiler. And this has led to a number of interesting effects as we've been disclosing the vulnerability and trying to get the industry to fix it. Some of the firms to which we disclosed it said, this isn't a language problem at all. This is the fault of the responsibility of the people who sell code editors in development environments.

Starting point is 00:08:24 or the responsibility of the people who sell code editors in development environments. That was the attitude, for example, taken by Oracle, who stewardship of the Java language. Other language teams, such as the Rust compiler team, for example, were very enthusiastic about fixing this problem in their language immediately. As for the development environments, GitHub, GitLab, and Atlassian are all on the job, but it's by no means obvious that everybody is. And so now that the vulnerability has been disclosed, there is the real risk that some bad person would use, you know, would target programs that are written in a language that hasn't been fixed, such as Java,

Starting point is 00:09:04 in a company isn't using a fixed development environment, therefore might be able to do something rather nasty. And so for that reason, we thought it prudent to get as much publicity as possible to get across to CIOs and CISOs worldwide that they'd better check their tool chain and see to it that any code that they rely on isn't vulnerable to a supply chain attack. Now, is my understanding correct that the notion of this had come up previously in the past? I don't think anyone has dug into the depth that you all did here, but this as a possibility had been brought up before.

Starting point is 00:09:41 Nicholas, is that correct? There's been different ways that bidirectionality override characters have been exploited across a number of domains in the past, some of them being programming languages. So to go through a couple of examples that we found in the wild, one major use is for obfuscating interpreted languages. So JavaScript, for example, is typically sent to client-side users in their browsers, and a reasonable person may be able to decipher what JavaScript code is doing, and therefore companies may try to obfuscate the code and will inject these bidirectionality override characters in order to make it even harder to read text. You really would need to either strip out these

Starting point is 00:10:30 characters or just looking at the raw bytes of the text to see what's going on. Now, there have also been other more malicious uses of these bidirectionality overrides in the past. So, for example, there have been use cases in smart contracts. So we see particularly on the Ethereum blockchain that we've seen bidirectionality override characters used to swap the arguments passed to different functions, for example, to change the sender and receiver to swap them in a particular payment. And this is very interesting. We discovered this or came across this example rather late in our work after we had assembled our paper. And it's a very,

Starting point is 00:11:12 very malicious use of this particular technique. And there are actually a variety of other people who have proposed online, well, gosh, we could use bidirectionality override characters and, say, comments to do precisely that, to swap the order of different arguments. And I think what we are trying to present here is a rather systematic overview of all of this, and we believe kind of injects some novel techniques that one can use for these bidirectionality override characters, particularly in the ways we propose of injecting them into strings and the ways we put them into comments we break them into three different categories we call them commenting out stretched strings and gosh what is the other group that we came up with that we put them into three categories

Starting point is 00:11:55 all the same so and and actually going back even further before the ethereum example there has been even prior work one of the most interesting ones, I think, is that bidirectionality overrides have been used to try and change the file extension or change the way that a file extension of, say, malware is displayed. So if I have some executable file, some.exe that's sent via email, and I want a user to open it and them not to suspect that it's an executable file, I could, for example, inject a right-to-left override in the file name, and I could include.txt or some other relatively innocuous file extension in the name, and I could use that character to swap it and make it look like.txt is the overall extension of the file. And it turns out that this has been used to disseminate malware across email going back well more than 10 years, which is perhaps a slightly different domain, but shows that within the security setting, it is certainly well known that bidirectionality overrides can cause problems.

Starting point is 00:12:56 But our goal is to present this systematic overview in the compiler setting. And so, Ross, to what degree did your research find that this is being used out there in the wild? How serious an issue is this today? Well, thanks to a number of development environment maintainers such as GitHub and Ross, we got thousands and thousands of suspect examples of possible abuse of BD characters, which exists in public repositories. And we found that the great majority of these were just people doing careless programming, which involved strings or comments in Hebrew or Arabic. We discovered a significant amount of use for obfuscating JavaScript,

Starting point is 00:13:43 but we didn't find anything else of consequence. So what appears to have happened is that up until now, various people had said, hey, you could do bad things with BD characters. And then this kind of hadn't been followed through. The people who designed BD control characters into the Unicode set put in a very quiet warning saying this might be used to do bad stuff. Nobody kind of followed through on that. There was also some work about 15 years ago around the possible use of strange characters in domain names. And we've now got PunyCode as a standard for getting canonical expressions of domain names to stop this being used in fishing.

Starting point is 00:14:25 In other words, there was a substantial vulnerability there, but various people had just looked at various small aspects of it, like the five blind men and the elephant. You know, one thought this is a tree and one thought this is a rope and so on and so forth. And what we've basically contributed, we believe, is to firstly to trace out the whole shape of the beast and second, to motivate the industry to roll up its sleeves and fix it. Nicholas, you know, you bring up sort of a fascinating element of this, which is, you know, who takes responsibility for the fix here? Is it the people making the development tools? Is it the developers themselves?

Starting point is 00:15:03 Is it, you know, do we go searching for these sorts of things on the endpoint after the fact? What did you all explore as far as that element goes? It's a very interesting question as to whose responsibility it is to fix this vulnerability. So oftentimes we speak in terms of expecting compilers or interpreters to put out patches to mitigate this particular attack, but that is not necessarily the only answer. And in many viewpoints, that may not even be the correct place to patch this. So some may take the view that compilers exist to implement a particular language specification. And those following that view for languages that have formal specifications,

Starting point is 00:15:45 the place to fix this would be to add logic or add rules into a language specification, which then would later be implemented by compilers. But still others may say, well, you know, adversarial encodings really, you know, that might not be in the job of a compiler to defend against. That might be in, for example, a static code scanner, in which case you have a variety of different security companies that sell services or even open source products online that will do static code scanning and potentially be able to expose attacks like this. And perhaps that is the place to prevent something like this. But a still different approach that one can take is to say that this isn't perhaps even a problem with compilers. It's a problem with visualization. We have these, say, text editors or perhaps

Starting point is 00:16:29 repository front ends, websites that we use to view code online that are visualizing code in a way that is misleading for what that code would actually do if it was ingested into a proper compiler or interpreter. And because of that, perhaps the answer is that we need to fix the way that text is displayed inside of text editors, and we need to add warnings and make these directionality override characters visible in online platforms. And any one of these techniques is a perfectly reasonable way to defend against these attacks. But I think the important thing to keep in mind is, you know, if large scale attacks were to be launched using these techniques, your best strategy is probably a defense in depth strategy where you have mitigations in place at each of these layers. Because even if we say we're to patch all of the compilers that we know are affected, it is very likely that there are compilers that we just haven't looked at that are indeed affected.

Starting point is 00:17:23 Or, you know, of course, the legacy versions of compilers that hang around in certain development environments. And because of this, we would look for things like static code scanners legitimate. You know, there are times when perhaps you want to obfuscate something for a security reason. And so do we eliminate that possibility here? It's really quite interesting, isn't it? Well, and one of the underlying issues that I think this exposes or draws attention to is that internationalized text is something that is just inherently challenging problem in computer science. And I think that we have these systems like Unicode, which do a great job of providing very thorough support for a very large number of languages. But there are security issues that arise from these platforms. But what is the answer to that? Certainly, we can't say that everyone needs to use ASCII for everything,

Starting point is 00:18:28 that non-English languages have been disadvantaged in many computing contexts for a very long time, and certainly our solution can't be to regress and say that everyone needs to use a small number of Latin characters in all of their writing. But at the same time, that means that if we are to use these powerful internationalized text standards, like Unicode, for example, there are these nuances that are very important for all parts of the development pipelines to take into account

Starting point is 00:18:56 lest security vulnerabilities arise. Ross, what are the next steps here? I mean, in your research, you point out that you all went through proper responsible disclosure to the various developers of the tools that are involved with these languages. Where do you hope your research leads? Well, first, we think there may be other similar vulnerabilities that arise out of the insane amount of complexity that has arisen around modern development environments. And so we leave that as an open challenge to everybody to look for other stuff that

Starting point is 00:19:29 was put in to be helpful, but is now hiding unpleasant stuff under pretty stones. Second thing that we're going to write up is the enormous diversity of the response that we got to coordinated disclosure. Because one of the really important things for information security is the rate at which vulnerabilities are fixed once they get disclosed. Because if they don't get fixed quickly, then lots of systems end up being vulnerable. And we discovered that there was a very broad range of responses in the industry to our disclosure. The disclosure was somewhat off the beaten track because we weren't in a position of saying,

Starting point is 00:20:10 you know, hey guys, here's a zero-day vulnerability that allows me to take remote control of one of your systems without human intervention. We were saying, here is a vulnerability that allows a bad person to smuggle code into your system, perhaps through the supply chain, in such a way that humans won't notice it. And that's altogether more difficult to deal with. Now, as time goes on, we'll have more and more vulnerabilities that are more conceptually difficult to deal with. And the industrialized processes that a number of the big tech firms and others have are going to be less and less able to cope. And it's particularly interesting to see the relatively poor performance of some of the

Starting point is 00:20:46 big tech companies who had outsourced the vulnerability disclosure process, right? Because if you've hired a subcontractor and told them, you know, you will pay the following amount, dollars X for bugs of the type Y, and we will pay you so many dollars a month to run the service for us, then of course the responders don't have an incentive to put any effort into anything that's even slightly out of the ordinary. So as a result, you may find that a number of companies have got the appearance of a disclosure system, but without really the reality. Nicholas, any thoughts from you there on where you're hoping this leads?

Starting point is 00:21:26 My hope is that as many compilers and interpreters as possible be patched against this particular vulnerability. And in addition to that, that we continue to see changes in code visualization pipelines, and perhaps even notes added to the Unicode standard to very clearly and explicitly say that, you know, this attack pattern is something that we need to watch out for. I think in the bigger picture, what worries me is less the individual developer adversaries that want to exploit something like this, but,

Starting point is 00:21:57 you know, perhaps some of the more powerful advanced persistent threats, if you will. You could imagine that should someone have an insider, control of an insider at a particular company or project, or simply has lots of time and opportunity to try lots of different techniques, if they are able to inject a particular backdoor that goes unnoticed, a particular vulnerability, well, you know, we could find ourselves in a situation perhaps similar to some of the supply chain attacks that we've seen in recent months. SolarWinds, not that long ago. And seeing something play out where it might not be immediately clear where the vulnerability is in code or how some backdoor got in place.

Starting point is 00:22:37 Or, of course, you know, if that bug happens to be ingested into a compiler's source code itself, you know, we could find ourselves with untrustworthy compilers floating around and it not being immediately clear where these vulnerabilities are. And it's those, you know, slightly more insidious, slightly more difficult to plan attack vectors that, to me, this represents as one of the most scary threats of the Trojan source work. Our thanks to Nicholas Boucher and Ross Anderson from the University of Cambridge for joining us. The research is titled Trojan Source, Invisible Vulnerabilities. We'll have a link in the show notes.

Starting point is 00:23:34 And now a message from Black Cloak. Did you know the easiest way for cyber criminals to bypass your company's defenses is by targeting your executives and their families at home? Black Cloak's award winning digital executive protection platform secures their personal devices, home networks, and connected lives. Because when executives are compromised at home, your company is at risk. In fact, over one-third of new members discover they've already been breached. Protect your executives and their families 24-7, 365, with BlackCloak. Learn more at blackcloak.io. Our amazing CyberWire team is Elliot Peltzman, Trey Hester, Brandon Karp, Puru Prakash, Justin Sabey, Tim Nodar, Joe Kerrigan, Carol Terrio, Ben Yellen, Nick Vilecki, Gina Johnson, Bennett Moe, Chris Russell, John Petrick, Jennifer Iben, Rick Howard, Peter Kilby, and I'm Dave Bittner.

Starting point is 00:24:38 Thanks for listening. We'll see you back here next week.

Your Ad Here

CyberWire Daily - Using bidirectionality override characters to obscure code. [Research Saturday]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.