(I previously published this article as an answer to a question posted on Information Security website under a profile I deleted)
First, we need to dive into the methods commonly used by JS malware:
Given these common features and tactics used by JaaScript malware, if you want to detect this type of malware as you asked, you need first to study the state of the art of the methods used to detect that. Various methods have been developed so as to detect web (JavaScript) malware. We can divide them into two main categories as follows:
What you have tried to do belongs to the first category.
Now, after you are well informed about all this, it can be useful for you to study some available tools dedicated for this purpose in order to implement your own technique. So let me mention you three important tools among so many others:
How ZOZZLE operates? The following figure summarizes its core (ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection):
First, we need to dive into the methods commonly used by JS malware:
- Server side polymorphism
- Code obfuscation
- Code unfolding
document.write()
and eval()
in order to execute obfuscated portions of code and functions. (Weaknesses in Defenses Against Web-Borne Malware)- Heap spray
- Drive-by download
- Multi execution paths
- Implicit conditionals
Given these common features and tactics used by JaaScript malware, if you want to detect this type of malware as you asked, you need first to study the state of the art of the methods used to detect that. Various methods have been developed so as to detect web (JavaScript) malware. We can divide them into two main categories as follows:
- Machine learning based classifiers
- Features: HTML and JavaScript codes distinguishing features extraction. These features are then evaluated to train a machine learning for classifier generation. The premise of this approach is that malicious webpages are likely to be different from benign ones (Thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages)
- Advantages: Lightweight approach, useful to deal with a bulk of websites analysis.
- Drawbacks: Obsolete against obfuscated JavaScript code and totally useless against new malicious code patters or zero attacks.
- Dynamic methods
- Features: Based on the dynamic behavior analysis, these techniques are implemented using either proxies where a page is rendered to the visitor only after its safety is checked, or a sandboxing environment relying on honeyclients (Same thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages).
- Advantages: Efficient against zero day attacks and obfuscated code.
- Drawbacks: Resources and time consuming. Sandboxing environments rely on low interaction honeyclients which themselves are based on virus signatures, and thus suffer from the same disadvantages as the static methods' ones.
What you have tried to do belongs to the first category.
Now, after you are well informed about all this, it can be useful for you to study some available tools dedicated for this purpose in order to implement your own technique. So let me mention you three important tools among so many others:
- Zozzle
How ZOZZLE operates? The following figure summarizes its core (ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection):
- Extraction and labeling phase: The classifier
needs training data. This data is extracted from obfuscated JavaScript
code. Instead of developing an efficient de-obfuscation technique,
Compile function interception calls is performed. Compile function is
located in
jscript.dll
library. It is a smart way to obtain plain JavaScript code because it is called each time<SCRIPT>
and<IFRAME>
tags, oreval()
anddocument.write()
functions have been called, which thing defines also the code context. Each code context is saved on the hard drive for further analysis. - Feature selection: JavaScript AST is used to tag each labeled context code for its safety or malignancy. The features are pre-selected using this formula:
- A: malicious context with feature
- B: benign context with feature
- C: malicious context without feature
- D: benign context without feature
- Classification: The Bayesian classifier is used
for classification because even if it seems obsolete, in practice it
gives good results and it is not time consuming.
- Profiler Profiler follows the static schema to detect web malware. It combines static features analysis of HTML and JavaScript code, including unified resource locator (URL)s. Then it uses machine learning techniques to teach a classifier that decides if a webpage embeds malicious content or not. Suspicious webpages are not processed by this tool. It rather forwards them to third party technologies such as Wepawet (Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages)
- SpyProxy
- (a): The proxy performs a static analysis over the requested page. In the case it judges is likely to be malicious, if forwards it to the virtual machine. basically only pages with active content are forwarded to the virtual machine (VM).
- (b): The virtual machine loads the malicious pages to monitor their activities.
- (c): Only benign pages are rendered back to the proxy which forwards them in turn to the user's browser.
- Iceshield