CyberWire Daily - Malware sometimes changes its behavior. [Research Saturday]
Episode Date: October 30, 2021Dr. Tudor Dumitras from University of Maryland and joins Dave Bittner to share a research study conducted in collaboration with industry partners from Facebook, NortonLifeLock Research Group and EURE...COM. The project is called: "When Malware Changed Its Mind: An Empirical Study of Variable Program Behaviors in the Real World." In the study, the team analyzed how malware samples change their behavior when executed on different hosts or at different times. Such “split personalities” may confound the current techniques for malware analysis and detection. Malware execution traces are typically collected by executing the samples in a controlled environment (a “sandbox”), and the techniques created and tested using such traces do not account for the broad range of behaviors observed in the wild. In the paper, the team shows how behavior variability can make those techniques appear more effective than they really are, and they make some recommendations for dealing with the variability. The research and executive summary can be found here: When Malware Changed Its Mind: An Empirical Study of Variable Program Behaviors in the Real World Analysing malware variability in the real world Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me.
I have to say, Delete.me is a game changer. Within days of signing up, they started removing my
personal information from hundreds of data brokers. I finally have peace of mind knowing
my data privacy is protected. Delete.me's team does all the work for you with detailed reports
so you know exactly what's been done. Take control of your data and keep your private life Thank you. Hello everyone and welcome to the CyberWire's Research Saturday.
I'm Dave Bittner and this is our weekly conversation with researchers and analysts
tracking down threats and vulnerabilities,
solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace.
Thanks for joining us.
So it started from the observation that goes back 10-15 years that malware, when executed on different hosts or also on the same host but at different times,
it sometimes changes behavior.
That's Tudor Dimitris.
He's a professor and researcher at the University of Maryland and the Maryland Cyber Security Center.
The research we're discussing today is titled,
When Malware Changed Its Mind,
An Empirical Study of Variable of variable program behaviors in the real world.
And now, a message from our sponsor, Zscaler, the leader in cloud security.
Enterprises have spent billions of dollars on firewalls and VPNs,
yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks
and a $75 million record payout in 2024.
These traditional security tools expand your attack surface with public-facing IPs
that are exploited by bad actors more easily than ever with AI tools. It's time to rethink your security. Thank you. specific apps, not the entire network. Continuously verifying every request based on identity
and context. Simplifying
security management with AI-powered
automation. And detecting
threats using AI to analyze
over 500 billion daily
transactions. Hackers can't
attack what they can't see. Protect
your organization with Zscaler
Zero Trust and AI.
Learn more at zscaler.com slash security.
This is something that I've wanted to do for a long time,
but only recently I was able to, I was able to, uh, collect, uh, in collaboration
with an industry partner, collect a large enough data set in order to, to analyze this. And this
is important to understand because malware researchers typically collect execution traces
in a sandbox. So that's, that's a controlled lab environment. And they do this to understand what the malware does, to analyze the malware, to figure out if it belongs to a known family, and also to create detection signatures, behavior-based detector signatures.
on different hosts.
This will affect the effectiveness of the conclusions of the malware analysis and the effectiveness of the signatures created for detection.
Some people call them split personalities.
So when malware does different things on different hosts,
they are often implemented by the malware authors with intention to evade sandboxes,
so to not perform the malicious behavior in the sandbox.
And we just wanted to understand this.
There hadn't really been a large-scale measurement
of just how much this behavior changes in the real world.
What exactly, what components of the behavior are more likely to change?
How does malware change versus benign?
And how does this affect malware detection and malware analysis?
In this research, in this paper, we worked with a partner to analyze a data set
that was recorded on five and a half million real hosts out there
that included multiple executions for each sample,
multiple traces for each sample.
Well, let's walk through the methodology together. I mean, at the outset, how did you all
decide to come at this? When you've got malware that you suspect is trying to avoid you taking
a closer look at it, where do you begin?
Right.
So this is the core of the matter.
When you know that malware is likely to evade,
has the intention and the incentive
to evade detection and analysis,
how do you go about selecting a sample
that is representative?
And really the only way to do this is to look at what happens on real hosts, which is also
what makes this difficult to do.
But we worked with an industry partner which has an antivirus product that runs on end
hosts.
It collects execution traces.
It monitors for a little while what the malware is trying to do and collects these things in order to perform further analysis to try to figure out how these things actually happen in the real world. But this is only done as a last line of defense. So if they can detect the malware through any other means, any other engine, they would just detect it.
They would not let it run at all.
And similarly, if something is clearly benign or is known to be benign, they would not do anything to it.
They would exonerate it.
But there is always a gray area, a set of binaries that they are suspicious, but still you cannot be completely
certain. So they execute it. And they also stop the execution as soon as the malware tries to
do something nasty, as soon as it becomes clear that something bad is happening. But a lot of the
initial setup and initial behaviors of the malware are recorded.
And this gives us a wealth of data to look at, in particular, the differences in behavior between different hosts of the same malware sample.
And because this is actually, we never tried to distribute the malware to the host.
We never tried to do this in a lab.
This is in the real world.
And these are actual hosts that are under attack from the malware.
So this is what gives us some confidence that these results are representative.
Well, can you take us through some examples here of the types of things you were looking for
and some of the conclusions you were able to make?
Let me give you one example. So the RAM knitworm, for example, is a well-known piece of malware.
And in the particular variant that we had in our data set, it tries to exploit a vulnerability.
It's an older vulnerability, CWE-2013-3660. And it does this in order to gain
privilege escalation on Windows 7 in particular. When it launches this exploit, what you see in
the execution trace is that it creates hundreds of mutexes until the exploit succeeds.
So this is part of the exploit execution process. But the worm is smart, so it tries to profile the target.
If it figures out that the target does not include the vulnerability
or it's already running in admin mode,
so it doesn't need to do privilege escalation,
it doesn't launch the exploit.
Should an analyst run this malware in a sandbox,
they would only observe one of these behaviors,
depending on what the environment was.
In general, if you look at executions on different hosts,
or even on the same host, but maybe a few weeks later or a few months later,
you may see different behaviors.
So malware performing different registry operations,
making different or additional or removing certain API calls,
or in some cases, exiting without doing anything.
So like I said, the existence of these split personalities was known,
and researchers and practitioners also developed methods to try to discover the
existence of these evasive behaviors in malware samples. But it was never measured at scale in
the wild before. And because of that, it is hard to tell what impact does this really have on the way we do malware analysis and malware detection.
Now, you all were able to gather, again, as you mentioned, with your industry partners, quite a data set here.
Can you describe to us how big was it?
How did you go about gathering it and having that large a data set?
What does that provide for you as a researcher?
Absolutely.
So the data set, as I mentioned,
comes from a collaboration that we had with an industry partner.
I personally, my background is in the industry.
I used to work at Symantec Research Labs before I became an academic.
So back in the day when Symantec was the largest security vendor.
And I really like to work with folks in the industry to understand what the biggest problems there are that they are facing.
And also to help them with large data analysis projects like this one.
So the data set was collected on 5.4 million real hosts.
on 5.4 million real hosts,
and it includes multiple execution traces for the same sample.
So I think in total we have about 7.5 million
execution traces.
In some cases we have hundreds of execution traces
per sample.
And what these traces are is they come from
API traces, So these are Windows
malware that perform API calls in order to download files, to connect to the internet,
to set certain registry entries or mutexes. So we recorded the actions that the malware was
trying to perform. So for example, when the malware is trying to create a new file, as, for example, ransomware would do.
And each of these actions has a certain parameter.
So, for example, the file name or the registry path that's being accessed.
We have collected this data set.
We parsed it into these actions and parameters.
So for each execution, each execution is attributed to the process ID
that triggered it, including things like thread injection and
launching new processes. So we can figure out what was the executable that started
all of this. So then we look at the hash of that executable, and then we look at multiple,
when we have multiple traces of that same hash, we analyze the variability in their behavior. So
we do this in a couple of ways. We look at how the number of actions and the types of actions
differ. And also we look at the differences in the parameters. We try to break
this down into variability that occurs across hosts and also variability that occurs across
time. And then we also try to see if there is something invariant, something that doesn't
really change between executions that, for example, a malware signature could be based on in order to be reliable,
and how many executions you would need to see
in order to see such a reliable signature.
And then we also looked at,
we tried to conduct sort of an experiment
to demonstrate what would happen
if you try to draw conclusions from a single execution.
That is typically the way things are done
when traces are collected in a sandbox.
So what were the results then?
I mean, what did you find?
So let's start with the behavioral differences themselves.
So first of all, it is interesting that there are many reasons
for these behavioral differences.
The researchers previously focused primarily on this sandbox evasion behavior as a cause
of behavioral differences, but there are many, many other root causes.
There are, for example, differences in operating systems and the libraries available, as in
the example that I gave you with the ramnet worm.
We also saw that malware may attempt to perform some risky operations that
fail on some hosts. And because of that, the subsequent actions will be different on those
hosts. Malware may receive different commands from their CNC channels. So then at different
points in time, they may do different things or may not do anything at all. We also saw that many of these perform an initial installation.
So when you run the malware for the first time,
you are likely to see different trace than when you run it the second time and the third time.
And that's because the initial installation will perform some one-time operations,
such as setting certain registry keys, for example.
Perhaps not very surprisingly, malware often creates very random file names.
So the file name itself may differ quite a bit from one host to another.
So the interesting thing about this is that even if you are somehow able to catch sandbox
evasion and deal with this in a sandbox,
the traces will still not reflect the full range of behaviors
that you're likely to encounter in the wild
because there are all these additional reasons for variability.
And we also saw that benign software also exhibits variability.
If you think about it,
a Windows update will perform different operations
for each update because it receives different things to install.
Also, it will differ from one host to another because of differences in the patch
levels of those hosts. So we see variability in benign software
as well. However, malware varies more.
And the variability is significantly higher in terms of the number of actions.
And when I say variability, I mean the delta.
So the fact that one host performs 100 versus one host performs 12 actions.
So not the length of the trace, the variability within the traces of the same sample.
So this is what we looked at. This was our main metric that we measured, the variability,
the per sample variability. And this varies significantly more for malware in terms of the
number of actions that are performed. And the biggest contributor to this are the file creations.
So on some hosts, there are many file creations. On some hosts, there are much fewer file creations.
So that's across hosts. And if we look across time, the main way that things vary is in missing
actions. So some actions that you see at some point, several weeks later, they're not going
to be there. In many cases, this is because the malware just stops doing anything. This variability,
is it inherently risky? In other words, the fact that it has so many options and tries to do many
things, does that make it noisier and increase the possibility that it will be detected?
Absolutely. I think that the variability itself, this is actually one of the conclusions of our paper,
that the variability itself can be a useful signal for detection. And I don't think, as far
as I know, I don't think anybody uses it in this way right now, but it potentially could be
one useful signal for figuring out if something is likely to be malicious or not. In terms of
the danger that this poses to the way we conduct business today in terms of malware analysis and
malware detection, we conducted one experiment in our paper with a malware clustering technique,
which is often used to determine if an individual sample belongs to a known family.
So companies do this clustering in order to group samples into
families based on this behavioral similarity. And the assumption here is that if you observe
a certain behavior, then all the other behaviors will also fall into the same cluster,
the cluster of the family. Otherwise, you cannot really conclude that it's the same cluster, the cluster of the family.
Otherwise, you cannot really conclude that it's the same family
if the behaviors are so different.
And this is, in fact, what we observed.
So typically, when you do clustering,
you use only one trace per sample,
and then the resulting clusters, at least the most obvious ones,
indicate the malware families in your data set.
In our case, we use a clustering technique that is pretty seminal
from, I think, maybe 10 years ago.
We try to do the same thing, but with multiple traces for each sample.
And we just threw these in without telling the algorithm
that these actually belong to the same sample,
so they are the same malware.
And then what happened was that in 33% of the samples,
there was enough variability across these four traces
that traces of the same sample ended up in different clusters.
So that's as if they belong to different families.
And in fact, 1%, each of the four traces was in a different cluster.
This doesn't necessarily mean that clustering is useless,
but this really indicates that you should be very careful when you draw conclusions from experiments conducted
with a single trace per sample
because this kind of behavior of samples
that end up clustered in different clusters, different families
this would not be observed if you only use one sample per trace, one trace per sample, sorry.
This suggests that the accuracy of these results that is being reported of mapping samples to families through behavioral clustering
is really lower than previously believed because of this variability.
is really lower than previously believed because of this variability.
This is just one example, one concrete experiment that I'm telling you about, but this actually has broader implications also for malware detection
and for malware analysis.
The accuracy of these things is likely to be lower than you might expect
if you only look at one trace per sample.
So where do you all go next with this?
I mean, obviously you're partnering with industry,
and they will reap some of the benefits of the things that you've found here, no doubt.
Are there areas, I mean, has this piqued your curiosity?
Is there more to be done?
I think there's a lot more to be done.
Like I said, I love working with the industry.
And so, first of all, the results of our study are public.
We have published our paper in a leading academic research conference.
The paper is available.
And we are also available to answer more questions if anybody is
interested. So other than our individual, our particular collaborator that worked with us on
this study, the broader implications or the broader sort of conclusions, the bigger picture if you want, are that this is something that
really should be taken into account when
doing malware analysis and malware detection. These behaviors that you
can extract from multiple traces.
In general, companies, organizations that
have antivirus products or do malware detection on end hosts in some form,
they tend to collect very similar data to the one that we analyzed in this paper.
As far as I know, they don't do much with it.
But here we try to show just what could be done with it,
what you could learn and how it might affect your bottom line
if you don't understand how this variability,
which is a real thing in the wild,
how is it likely to affect your experimental results?
I think in terms of going forward,
I think one thing that I'm really interested in,
in the bigger picture, is that this problem that malware experiments can give a false sense of security. And what I mean by that is that we see a lot of academic papers and industry evaluations discussing new malware detection techniques that often report detection rates above 90%. And then invariably this high level of high performance
is hard to reach in the real world.
The question is why? Why is that?
Part of the answer is that when these techniques are developed
and also tested using traces from a sandbox, then they may seem
that they work better than they really do, right? Because they don't capture this broad range of
behaviors that happen in the wild. So this is one reason for this false sense of security. But
ultimately, I would like to understand the full picture and how much each potential fact, there are other factors, of course, that they contribute to this.
And I'd like to understand how each of this contributes to, you know, this accuracy degradation that folks observe from experiments conducted in the lab to when they deploy their tools in the real world.
Our thanks to Tudor Dimitris from the University of Maryland for joining us.
The research is titled When Malware Changed Its Mind,
an empirical study of variable program behaviors in the real world.
We'll have a link in the show notes.
And now, a message from Black Cloak.
Did you know the easiest way for cyber criminals to bypass your company's defenses
is by targeting your executives and their families at home?
Black Cloak's award-winning
digital executive protection platform secures their personal devices, Thank you. Protect your executives and their families 24-7, 365 with Black Cloak.
Learn more at blackcloak.io.
The CyberWire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe,
where they're co-building the next generation of cybersecurity teams and technologies.
of cybersecurity teams and technologies.
Our amazing CyberWire team is Elliot Peltzman,
Trey Hester, Brandon Karp, Puru Prakash,
Justin Sabey, Tim Nodar, Joe Kerrigan,
Carol Terrio, Ben Yellen, Nick Vilecki,
Gina Johnson, Bennett Moe, Chris Russell,
John Petrick, Jennifer Iben, Rick Howard,
Peter Kilpie, and I'm Dave Bittner.
Thanks for listening. We'll see you back here next week.