(
I previously published this article as an answer to a question posted on Information Security website under a profile I deleted)
First, we need to dive into the methods commonly used by JS malware:
- Server side polymorphism
Literally meaning many shapes, polymorphism is a technique used by
malware authors to evade signatures based detectors. Polymorphism is
qualified as being server sided when the engine which produces several
but different copies of the malware is hosted on a compromised web
server
(Server-Side Polymorphism:
Crime-Ware as a Service Model (CaaS)). simulated metamorphic
encryption generator (SMEG) version 1.0 was the first engine developed
to implement the notion of polymorphism for computer viruses on the
early 1990's (
Parallel analysis of polymorphic viral code using automated deduction system)
- Code obfuscation
The other common feature you may find in malicious JavaScript code is
that obfuscation is always used. This common factor -obfuscation- does
not make even things simpler: because innocuous JavaScript code also
uses obfuscation (for instance, some developers for example do not want
their personal
pretty JavaScript function to be understood by
others as you can easily read HTML and JS pages codes). Along with
server side polymorphism, code obfuscation is a widely used technique by
malware authors to circumvent antivirus scanners. A myriad of
techniques could be used to obfuscate JavaScript codes such as string
reversing, Unicode and base 64 encoding, string splitting and document
object model (DOM) interaction (
Malware with your Mocha? Obfuscation and anti-emulation tricks
in malicious JavaScript.).
- Code unfolding
Code unfolding is the mechanism with which a new code is introduced
at run time. In JavaScript, this is made concrete by invoking functions
like
document.write()
and
eval()
in order to execute obfuscated portions of code and functions. (
Weaknesses in Defenses Against Web-Borne Malware)
- Heap spray
This attack targets mainly web browsers. The user controllable data
can corrupt the heap by a remote execution code if the miscreant has
compromised the user's computer to the point he can have access to this
vulnerable memory area (
BuBBle: A Javascript Engine Level Countermeasure against Heap-Spraying Attacks)
- Drive-by download
Drive-by download attacks consist in downloading and and executing or
installing malicious programs without the user's consent. Such attacks
occur by exploiting browsers' vulnerabilities, their add-ons or plugins
such as ActiveX controls or unpatched useful software such as Acrobat
Reader and Adobe Flash Player (
Drive-by download attacjs: effect and detection methods, MSc Information Security)
- Multi execution paths
It is possible to trigger an action only if certain conditions are
fulfilled. Such circumstances could be the arrival of a given date or
the existence of a file on the system on which the malware is intended
to be executed. An other quick and well known example could be a denial
of service attack that must be fired only if the number of the botnet's
nodes has reached a certain value. That
is the notion of multi execution paths (
Exploring Multiple Execution Paths for Malware Analysis)
- Implicit conditionals
This technique is mainly used against dynamic approach detectors. The
main idea for this process is to execute a set of instructions by
hiding the condition that fires it (
Weaknesses in Defenses Against Web-Borne. Malware)
Given these common features and tactics used by JaaScript malware, if
you want to detect this type of malware as you asked, you need first to
study the state of the art of the methods used to detect that. Various
methods have been developed so as to detect web (JavaScript) malware. We
can divide them into two main categories as follows:
- Machine learning based classifiers
- Features: HTML and JavaScript codes distinguishing
features extraction. These features are then evaluated to train a
machine learning for classifier generation. The premise of this approach
is that malicious webpages are likely to be different from benign ones
(Thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages)
- Advantages: Lightweight approach, useful to deal with a bulk of websites analysis.
- Drawbacks: Obsolete against obfuscated JavaScript code and totally useless against new malicious code patters or zero attacks.
- Dynamic methods
- Features: Based on the dynamic behavior analysis,
these techniques are implemented using either proxies where a page is
rendered to the visitor only after its safety is checked, or a
sandboxing environment relying on honeyclients (Same thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages).
- Advantages: Efficient against zero day attacks and obfuscated code.
- Drawbacks: Resources and time consuming. Sandboxing
environments rely on low interaction honeyclients which themselves are
based on virus signatures, and thus suffer from the same disadvantages
as the static methods' ones.
What you have tried to do belongs to the first category.
Now, after you are well informed about all this, it can be useful for
you to study some available tools dedicated for this purpose in order
to implement your own technique. So let me mention you three important
tools among so many others:
- Zozzle
Zoozle relies on Bayesian classification
abstract syntax tree
(AST) . It is legitimately classified as mostly static web malware
detector because it embeds another engine that supervises the JavaScript
code execution at run time. Its authors claim that it has a very low
false positive rate of 0.0003% and is able to process over one megabyte
of HTML and
JavaScript code per second. This tool is intended to be used as a
browser plugin; its aim is to protect browsers against heap spray
attack. It is time to point out how ZOZZLE operates.
How ZOZZLE operates? The following figure summarizes its core (
ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection):
- Extraction and labeling phase: The classifier
needs training data. This data is extracted from obfuscated JavaScript
code. Instead of developing an efficient de-obfuscation technique,
Compile function interception calls is performed. Compile function is
located in
jscript.dll
library. It is a smart way to obtain plain JavaScript code because it is called each time <SCRIPT>
and <IFRAME>
tags, or eval()
and document.write()
functions have been called, which thing defines also the code context.
Each code context is saved on the hard drive for further analysis.
- Feature selection: JavaScript AST is used to
tag each labeled context code for its safety or malignancy. The features
are pre-selected using this formula:
Where:
- A: malicious context with feature
- B: benign context with feature
- C: malicious context without feature
- D: benign context without feature
- Classification: The Bayesian classifier is used
for classification because even if it seems obsolete, in practice it
gives good results and it is not time consuming.
- Profiler
Profiler follows the static schema to detect web malware. It combines
static features analysis of HTML and JavaScript code, including unified
resource locator (URL)s. Then it uses machine learning techniques to
teach a classifier that decides if a webpage embeds malicious content or
not. Suspicious webpages are not processed by this tool. It rather
forwards them to third party
technologies such as Wepawet (Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages)
- SpyProxy
SpyProxy follows the dynamic analysis principles. It monitors the
active content of webpages within a virtual machine before deciding to
render them to the visitor or not. The architecture of SpyProxy is
illustrated through this figure (
SpyProxy: Execution-based Detection of Malicious Web Content):
- (a): The proxy performs a static analysis over the
requested page. In the case it judges is likely to be malicious, if
forwards it to the virtual machine. basically only pages with active
content are forwarded to the virtual machine (VM).
- (b): The virtual machine loads the malicious pages to monitor their activities.
- (c): Only benign pages are rendered back to the proxy which forwards them in turn to the user's browser.
- Iceshield
ICESHIELD performs in-line dynamic code analysis using a set of
heuristics to verify attack attempts. Its authors take an inventory of
the attacks that usually target the DOM properties of a website that are
performed by injecting JavaScript into the website's source code.
ICESHIELD supervises the running JavaScript code by predefining a set of
rules related to functions calls and
applying heuristics on them in the hope to determinate whether the
script is malicious or not (
IceShield: Detection and Mitigation of Malicious Websites with a Frozen DOM).