CyberWire Daily - Malware sometimes changes its behavior. [Research Saturday]

Starting point is 00:00:00 You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me. I have to say, Delete.me is a game changer. Within days of signing up, they started removing my personal information from hundreds of data brokers. I finally have peace of mind knowing my data privacy is protected. Delete.me's team does all the work for you with detailed reports so you know exactly what's been done. Take control of your data and keep your private life Thank you. Hello everyone and welcome to the CyberWire's Research Saturday. I'm Dave Bittner and this is our weekly conversation with researchers and analysts tracking down threats and vulnerabilities, solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace.

Starting point is 00:01:46 Thanks for joining us. So it started from the observation that goes back 10-15 years that malware, when executed on different hosts or also on the same host but at different times, it sometimes changes behavior. That's Tudor Dimitris. He's a professor and researcher at the University of Maryland and the Maryland Cyber Security Center. The research we're discussing today is titled, When Malware Changed Its Mind, An Empirical Study of Variable of variable program behaviors in the real world.

Starting point is 00:02:36 And now, a message from our sponsor, Zscaler, the leader in cloud security. Enterprises have spent billions of dollars on firewalls and VPNs, yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks and a $75 million record payout in 2024. These traditional security tools expand your attack surface with public-facing IPs that are exploited by bad actors more easily than ever with AI tools. It's time to rethink your security. Thank you. specific apps, not the entire network. Continuously verifying every request based on identity and context. Simplifying security management with AI-powered

Starting point is 00:03:30 automation. And detecting threats using AI to analyze over 500 billion daily transactions. Hackers can't attack what they can't see. Protect your organization with Zscaler Zero Trust and AI. Learn more at zscaler.com slash security.

Starting point is 00:03:56 This is something that I've wanted to do for a long time, but only recently I was able to, I was able to, uh, collect, uh, in collaboration with an industry partner, collect a large enough data set in order to, to analyze this. And this is important to understand because malware researchers typically collect execution traces in a sandbox. So that's, that's a controlled lab environment. And they do this to understand what the malware does, to analyze the malware, to figure out if it belongs to a known family, and also to create detection signatures, behavior-based detector signatures. on different hosts. This will affect the effectiveness of the conclusions of the malware analysis and the effectiveness of the signatures created for detection. Some people call them split personalities.

Starting point is 00:04:53 So when malware does different things on different hosts, they are often implemented by the malware authors with intention to evade sandboxes, so to not perform the malicious behavior in the sandbox. And we just wanted to understand this. There hadn't really been a large-scale measurement of just how much this behavior changes in the real world. What exactly, what components of the behavior are more likely to change? How does malware change versus benign?

Starting point is 00:05:26 And how does this affect malware detection and malware analysis? In this research, in this paper, we worked with a partner to analyze a data set that was recorded on five and a half million real hosts out there that included multiple executions for each sample, multiple traces for each sample. Well, let's walk through the methodology together. I mean, at the outset, how did you all decide to come at this? When you've got malware that you suspect is trying to avoid you taking a closer look at it, where do you begin?

Starting point is 00:06:05 Right. So this is the core of the matter. When you know that malware is likely to evade, has the intention and the incentive to evade detection and analysis, how do you go about selecting a sample that is representative? And really the only way to do this is to look at what happens on real hosts, which is also

Starting point is 00:06:29 what makes this difficult to do. But we worked with an industry partner which has an antivirus product that runs on end hosts. It collects execution traces. It monitors for a little while what the malware is trying to do and collects these things in order to perform further analysis to try to figure out how these things actually happen in the real world. But this is only done as a last line of defense. So if they can detect the malware through any other means, any other engine, they would just detect it. They would not let it run at all. And similarly, if something is clearly benign or is known to be benign, they would not do anything to it. They would exonerate it.

Starting point is 00:07:19 But there is always a gray area, a set of binaries that they are suspicious, but still you cannot be completely certain. So they execute it. And they also stop the execution as soon as the malware tries to do something nasty, as soon as it becomes clear that something bad is happening. But a lot of the initial setup and initial behaviors of the malware are recorded. And this gives us a wealth of data to look at, in particular, the differences in behavior between different hosts of the same malware sample. And because this is actually, we never tried to distribute the malware to the host. We never tried to do this in a lab. This is in the real world.

Starting point is 00:08:07 And these are actual hosts that are under attack from the malware. So this is what gives us some confidence that these results are representative. Well, can you take us through some examples here of the types of things you were looking for and some of the conclusions you were able to make? Let me give you one example. So the RAM knitworm, for example, is a well-known piece of malware. And in the particular variant that we had in our data set, it tries to exploit a vulnerability. It's an older vulnerability, CWE-2013-3660. And it does this in order to gain privilege escalation on Windows 7 in particular. When it launches this exploit, what you see in

Starting point is 00:08:54 the execution trace is that it creates hundreds of mutexes until the exploit succeeds. So this is part of the exploit execution process. But the worm is smart, so it tries to profile the target. If it figures out that the target does not include the vulnerability or it's already running in admin mode, so it doesn't need to do privilege escalation, it doesn't launch the exploit. Should an analyst run this malware in a sandbox, they would only observe one of these behaviors,

Starting point is 00:09:29 depending on what the environment was. In general, if you look at executions on different hosts, or even on the same host, but maybe a few weeks later or a few months later, you may see different behaviors. So malware performing different registry operations, making different or additional or removing certain API calls, or in some cases, exiting without doing anything. So like I said, the existence of these split personalities was known,

Starting point is 00:10:02 and researchers and practitioners also developed methods to try to discover the existence of these evasive behaviors in malware samples. But it was never measured at scale in the wild before. And because of that, it is hard to tell what impact does this really have on the way we do malware analysis and malware detection. Now, you all were able to gather, again, as you mentioned, with your industry partners, quite a data set here. Can you describe to us how big was it? How did you go about gathering it and having that large a data set? What does that provide for you as a researcher? Absolutely.

Starting point is 00:10:46 So the data set, as I mentioned, comes from a collaboration that we had with an industry partner. I personally, my background is in the industry. I used to work at Symantec Research Labs before I became an academic. So back in the day when Symantec was the largest security vendor. And I really like to work with folks in the industry to understand what the biggest problems there are that they are facing. And also to help them with large data analysis projects like this one. So the data set was collected on 5.4 million real hosts.

Starting point is 00:11:24 on 5.4 million real hosts, and it includes multiple execution traces for the same sample. So I think in total we have about 7.5 million execution traces. In some cases we have hundreds of execution traces per sample. And what these traces are is they come from API traces, So these are Windows

Starting point is 00:11:48 malware that perform API calls in order to download files, to connect to the internet, to set certain registry entries or mutexes. So we recorded the actions that the malware was trying to perform. So for example, when the malware is trying to create a new file, as, for example, ransomware would do. And each of these actions has a certain parameter. So, for example, the file name or the registry path that's being accessed. We have collected this data set. We parsed it into these actions and parameters. So for each execution, each execution is attributed to the process ID

Starting point is 00:12:29 that triggered it, including things like thread injection and launching new processes. So we can figure out what was the executable that started all of this. So then we look at the hash of that executable, and then we look at multiple, when we have multiple traces of that same hash, we analyze the variability in their behavior. So we do this in a couple of ways. We look at how the number of actions and the types of actions differ. And also we look at the differences in the parameters. We try to break this down into variability that occurs across hosts and also variability that occurs across time. And then we also try to see if there is something invariant, something that doesn't

Starting point is 00:13:17 really change between executions that, for example, a malware signature could be based on in order to be reliable, and how many executions you would need to see in order to see such a reliable signature. And then we also looked at, we tried to conduct sort of an experiment to demonstrate what would happen if you try to draw conclusions from a single execution. That is typically the way things are done

Starting point is 00:13:44 when traces are collected in a sandbox. So what were the results then? I mean, what did you find? So let's start with the behavioral differences themselves. So first of all, it is interesting that there are many reasons for these behavioral differences. The researchers previously focused primarily on this sandbox evasion behavior as a cause of behavioral differences, but there are many, many other root causes.

Starting point is 00:14:13 There are, for example, differences in operating systems and the libraries available, as in the example that I gave you with the ramnet worm. We also saw that malware may attempt to perform some risky operations that fail on some hosts. And because of that, the subsequent actions will be different on those hosts. Malware may receive different commands from their CNC channels. So then at different points in time, they may do different things or may not do anything at all. We also saw that many of these perform an initial installation. So when you run the malware for the first time, you are likely to see different trace than when you run it the second time and the third time.

Starting point is 00:14:57 And that's because the initial installation will perform some one-time operations, such as setting certain registry keys, for example. Perhaps not very surprisingly, malware often creates very random file names. So the file name itself may differ quite a bit from one host to another. So the interesting thing about this is that even if you are somehow able to catch sandbox evasion and deal with this in a sandbox, the traces will still not reflect the full range of behaviors that you're likely to encounter in the wild

Starting point is 00:15:32 because there are all these additional reasons for variability. And we also saw that benign software also exhibits variability. If you think about it, a Windows update will perform different operations for each update because it receives different things to install. Also, it will differ from one host to another because of differences in the patch levels of those hosts. So we see variability in benign software as well. However, malware varies more.

Starting point is 00:16:04 And the variability is significantly higher in terms of the number of actions. And when I say variability, I mean the delta. So the fact that one host performs 100 versus one host performs 12 actions. So not the length of the trace, the variability within the traces of the same sample. So this is what we looked at. This was our main metric that we measured, the variability, the per sample variability. And this varies significantly more for malware in terms of the number of actions that are performed. And the biggest contributor to this are the file creations. So on some hosts, there are many file creations. On some hosts, there are much fewer file creations.

Starting point is 00:16:53 So that's across hosts. And if we look across time, the main way that things vary is in missing actions. So some actions that you see at some point, several weeks later, they're not going to be there. In many cases, this is because the malware just stops doing anything. This variability, is it inherently risky? In other words, the fact that it has so many options and tries to do many things, does that make it noisier and increase the possibility that it will be detected? Absolutely. I think that the variability itself, this is actually one of the conclusions of our paper, that the variability itself can be a useful signal for detection. And I don't think, as far as I know, I don't think anybody uses it in this way right now, but it potentially could be

Starting point is 00:17:53 one useful signal for figuring out if something is likely to be malicious or not. In terms of the danger that this poses to the way we conduct business today in terms of malware analysis and malware detection, we conducted one experiment in our paper with a malware clustering technique, which is often used to determine if an individual sample belongs to a known family. So companies do this clustering in order to group samples into families based on this behavioral similarity. And the assumption here is that if you observe a certain behavior, then all the other behaviors will also fall into the same cluster, the cluster of the family. Otherwise, you cannot really conclude that it's the same cluster, the cluster of the family.

Starting point is 00:18:45 Otherwise, you cannot really conclude that it's the same family if the behaviors are so different. And this is, in fact, what we observed. So typically, when you do clustering, you use only one trace per sample, and then the resulting clusters, at least the most obvious ones, indicate the malware families in your data set. In our case, we use a clustering technique that is pretty seminal

Starting point is 00:19:13 from, I think, maybe 10 years ago. We try to do the same thing, but with multiple traces for each sample. And we just threw these in without telling the algorithm that these actually belong to the same sample, so they are the same malware. And then what happened was that in 33% of the samples, there was enough variability across these four traces that traces of the same sample ended up in different clusters.

Starting point is 00:19:46 So that's as if they belong to different families. And in fact, 1%, each of the four traces was in a different cluster. This doesn't necessarily mean that clustering is useless, but this really indicates that you should be very careful when you draw conclusions from experiments conducted with a single trace per sample because this kind of behavior of samples that end up clustered in different clusters, different families this would not be observed if you only use one sample per trace, one trace per sample, sorry.

Starting point is 00:20:27 This suggests that the accuracy of these results that is being reported of mapping samples to families through behavioral clustering is really lower than previously believed because of this variability. is really lower than previously believed because of this variability. This is just one example, one concrete experiment that I'm telling you about, but this actually has broader implications also for malware detection and for malware analysis. The accuracy of these things is likely to be lower than you might expect if you only look at one trace per sample. So where do you all go next with this?

Starting point is 00:21:10 I mean, obviously you're partnering with industry, and they will reap some of the benefits of the things that you've found here, no doubt. Are there areas, I mean, has this piqued your curiosity? Is there more to be done? I think there's a lot more to be done. Like I said, I love working with the industry. And so, first of all, the results of our study are public. We have published our paper in a leading academic research conference.

Starting point is 00:21:41 The paper is available. And we are also available to answer more questions if anybody is interested. So other than our individual, our particular collaborator that worked with us on this study, the broader implications or the broader sort of conclusions, the bigger picture if you want, are that this is something that really should be taken into account when doing malware analysis and malware detection. These behaviors that you can extract from multiple traces. In general, companies, organizations that

Starting point is 00:22:23 have antivirus products or do malware detection on end hosts in some form, they tend to collect very similar data to the one that we analyzed in this paper. As far as I know, they don't do much with it. But here we try to show just what could be done with it, what you could learn and how it might affect your bottom line if you don't understand how this variability, which is a real thing in the wild, how is it likely to affect your experimental results?

Starting point is 00:22:58 I think in terms of going forward, I think one thing that I'm really interested in, in the bigger picture, is that this problem that malware experiments can give a false sense of security. And what I mean by that is that we see a lot of academic papers and industry evaluations discussing new malware detection techniques that often report detection rates above 90%. And then invariably this high level of high performance is hard to reach in the real world. The question is why? Why is that? Part of the answer is that when these techniques are developed and also tested using traces from a sandbox, then they may seem that they work better than they really do, right? Because they don't capture this broad range of

Starting point is 00:23:54 behaviors that happen in the wild. So this is one reason for this false sense of security. But ultimately, I would like to understand the full picture and how much each potential fact, there are other factors, of course, that they contribute to this. And I'd like to understand how each of this contributes to, you know, this accuracy degradation that folks observe from experiments conducted in the lab to when they deploy their tools in the real world. Our thanks to Tudor Dimitris from the University of Maryland for joining us. The research is titled When Malware Changed Its Mind, an empirical study of variable program behaviors in the real world. We'll have a link in the show notes. And now, a message from Black Cloak.

Starting point is 00:24:56 Did you know the easiest way for cyber criminals to bypass your company's defenses is by targeting your executives and their families at home? Black Cloak's award-winning digital executive protection platform secures their personal devices, Thank you. Protect your executives and their families 24-7, 365 with Black Cloak. Learn more at blackcloak.io. The CyberWire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe, where they're co-building the next generation of cybersecurity teams and technologies. of cybersecurity teams and technologies.

Starting point is 00:25:45 Our amazing CyberWire team is Elliot Peltzman, Trey Hester, Brandon Karp, Puru Prakash, Justin Sabey, Tim Nodar, Joe Kerrigan, Carol Terrio, Ben Yellen, Nick Vilecki, Gina Johnson, Bennett Moe, Chris Russell, John Petrick, Jennifer Iben, Rick Howard, Peter Kilpie, and I'm Dave Bittner. Thanks for listening. We'll see you back here next week.

Your Ad Here

CyberWire Daily - Malware sometimes changes its behavior. [Research Saturday]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.