R2-D2

http://r2d2.twman.org

This work would not have been possible without the valuable dataset offered by Security Research Lab. Cheetah Mobile, Leopard Mobile and Android apps Security Master (CM Security). 

This research had been accepted on OWASP AppSec USA 2017 , September 19th-22nd, 2017 | Orlando, FL | September 22, 10:30am~11:15am (GMT -4)


Traditional Chinese: http://www.cmcm.com/blog/tw/security/2017-05-31/1062.html | Simplified Chinese: http://www.cmcm.com/blog/cn/security/2017-05-31/1063.html


English: http://www.cmcm.com/blog/en/security/2017-05-31/1061.html


Convolutional Neural Network / Deceptive AdvertisingDeep Neural Network / Phone ScamsDeep Neural Network / Notification Wars: The Tenderness Awakens



R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections



TonTon Huang*, Chia-Mu Yu**, and Hung-Yu Kao***

Security Research Lab., Cheetah Mobile Inc.*
Department of Computer Science and Engineering, National Chung Hsing University, Taiwan**
Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan* ***

TonTon (at) TWMAN.ORG*

Abstract

Machine Learning (ML) has found it particularly useful in malware detection. However, as the malware evolves very fast, the stability of the feature extracted from malware serves as a critical issue in malware detection. The recent success of deep learning in image recognition, natural language processing, and machine translation indicates a potential solution for stabilizing the malware detection effectiveness. We present a color-inspired convolutional neural network-based Android malware detection, R2-D2, which can detect malware without extracting pre-selected features (e.g., the control-flow of op-code, classes, methods of functions and the timing they are invoked etc.) from Android apps. In particular, we develop a color representation for translating Android apps into rgb color code and transform them to a fixed-sized encoded image. After that, the encoded image is fed to convolutional neural network for automatic feature extraction and learning, reducing the expert’s intervention.We have run our system over 1 million malware samples and 1 million benign samples through our back-end (600 million monthly active users and 10k new malware samples per day), showing that R2-D2 can effectively detect the malware.


Smartphones have gained widespread popularing worldwide Among them, Android is the most commonly used operating system (OS), and is still expanding as territory. International Data Corporation (IDC) 2016 Q3 smartphones OS marketshare reports 86.8% of smartphones are Android phones, indicating a steady growth market share, compared to 84.3% of marketshare in 2015 Q2 (see Fig. 1) [1]. Android is featured by its openness; users can choose to downloads apps from Google Play and third-party marketplace. However due to the popularity and openness, Android has attracted the attacker’s attention. In particalar, malicious software (malware) can easily spread and infect benign devises. Security Report of AV-TEST Institute reports that while the number of malware is increased from 17 millions in 2005 to over 600 millions in 2016, the percentage of Android malware has a significant increase from 3.19% in 2015 to 7.48% in 2016Q2. Among them, Trojans targeting at stealing user data occupy 97.49%.

We can also find that Android malware has dominating percetange, 99.87% on the number of malware on mobile platform [2]. Fig. 2 shows the statistic collected from our back-end system during January 2017, where for countries such as US, UK, France, etc., more than 50000 users are infected everyday. Moreover, from our experience, the number of Android malware is sharply increased from 1000000 in 2012 to 2.8 million in 2014. Due to the serious security problem caused by Android malware, we propose, color-inspired convolutional neural networks (CNN)-based Android malware detection, R2-D2, to detect Android malware. Different from the existing solutions, our proposed R2-D2 detection is featured by its end-to-end learning process. More specifically, in contrast to the prior solutions that require manual process of feature selection and parameter configuration, R2-D2 takes as input the training samples and outputs the malware detection model without the human intervention. Fig. 3 shows the trend of different malware families particularly in China and India. Fig. 4 shows the market shares for different cell-phone models. Based on the above statistic, we can confirm that even the same malware family will exhibit different behaviours in different geographic regions. Our dataset can handle this problem.

 
Figure 1 / Figure 2

 
Figure 3 / Figure 4

As mentioned above, the majority of the malware detection still relies on the static analysis of source code and the dynamic analysis through monitoring the execution of malware. However, these approaches are known to have better detection accuracy for the same family of malware only. In reality, Android malware has dramatically growth in numbers and mutates with fast speed and with various anti-analysis techniques. All of these characteristics make the accurate detection extremely difficult. Thus, we attempt to find out the hidden relationship between the program execution logic and the order of function calls behind the malware by taking advantauge of the deep learning in order to accurately detect known and even unknown malware.

In fact, a huge amount of human labor will instead perform feature engineering and model before the detection model built. To ease the model training, we adopt deep learning to construct an end-to-end learning-based Android malware detection, which is termed as R2-D2 (R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections). Finally, we will reach a color image (shown in Fig. 5 and 6), and the images are feed to CNN and training a model to detection Android malware. Fig. 7 is our system architecture.

Our proposed R2-D2 possesses the following advantages:

  • R2-D2 translates classes.dex, the core of the execution logic of Android apps, into RGB color images, without modifying the original Android apps and without extracting features from apps. In our experiments, only 0.4 seconds suffice to translate an apps into a color image, such translation is also featured by the fact that more information in apps can be preserved in the color image compared to the grayscale image.
  • Though the fully connected layer of DNN can be used to handle the fast mutation, the CNN in R2-D2 actually is more suitable for capturing the malware, because of its features such as local receptive fields and shared weights that can not only significantly reduce the number of model parameters nut also represent the complex structure of Android malware.
  • The filter, pooling, and non-linear activation functions in CNN do not extract features from image for pattern recognition. Instead, the raw pixels are represented by multi-dimensional matrices.
 
Figure 5 / Figure 6


Figure 7

Based on our collected data, we evaluate the detection accuracy and performance with different optimization model techniques. Note that the learning rate is fixed to be 0.01, The model optimization techniques are stochastic gradient descent (SGD), Nesterov Accelerated Gradient (NAG), AdaDelta and AdaGrad. From our experiment, we find that Inception-v3 (show in Fig. 9) is almost always better than AlexNet (show in Fig. 10), with such as observation, we further compare Inception-v3 with AdaDelta and Inception-v3 with AdaGrad (see Fig. 18), and find that SGD is best suitable for our use. In particular, it results in the sharpest increase in accuracy and sharpest decrease in loss. At the end, we reach 98.4225% and 97.7081% accuracy.


 
Figure 9 / Figure 10

 Google Play Sample Analysis Results | VirusTotal Benign Sample Analysis Results | CM Security Benign Sample Analysis Results

 Google Play Sample Analysis Results | VirusTotal Malware Sample Analysis Results | CM Security Malware Sample Analysis Results

  


 
Figure 11 / Figure 12

The evaluation metrics in our experiment include True Positive (TP), False Positive (FP), False Negative (FN), True Negative (TN), Accucarcy (Acc), Precision (Prec), Recall (Detection Rate, DR), False Positive Rate (FPR) and F1-score (F-measure). The  evaluation results are shown in Figure 13. 


Figure 13

This research adopt deep learning to construct an end-to-end learning-based Android malware detection and proposed color-inspired convolutional neural network (CNN)-based Android Malware detection, which is termed as R2-D2. The proposed proof-of-concept system has tested in our internal environment. The results show that our detection system works very well to detect known Android malware and even unknown Android malwarte. Also, we have publish to our core product to provide convenient usage scenarios for endusers or enterprise. The future work is reduce the complex task and train a high performance to faced the Android malware from a huge amount of computation burden. Furthermore, we will keep our research results on http://R2D2.TWMAN.ORG if there any update.