R2-D2


R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections

TonTon Huang*, and Hung-Yu Kao**

Leopard Mobile Inc.*
Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan* **

TonTon (at) TWMAN.ORG*

TonTon Huang*, and Hung-Yu Kao, R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections, arXiv:1705.04448v1

Latest version, please visit: https://arxiv.org/abs/1705.04448

This work would not have been possible without the valuable dataset offered by Leopard Mobile, Cheetah Mobile and Security Master (a.k.a CM Security, an Android application). 
Special thanks to Dr. Chia-Mu Yu for his support on this research.

Our example Android Color images: Malware / Benign | The materials of our research and experiment dataset



This research has been presented on Industry Security Conference OWASP AppSec USA 2017 , September 19th-22nd, 2017, Orlando, FL | 2017/09/22, 10:30am~11:15am (GMT -4)

R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections

Machine Learning (ML) has found it particularly useful in malware detection. However, as the malware evolves very fast, the stability of the feature extracted from malware serves as a critical issue in malware detection. The recent success of deep learning in image recognition, natural language processing, and machine translation indicates a potential solution for stabilizing the malware detection effectiveness. We present a color-inspired convolutional neural network-based Android malware detection, R2-D2, which can detect malware without extracting pre-selected features (e.g., the control-flow of op-code, classes, methods of functions and the timing they are invoked etc.) from Android apps. In particular, we develop a color representation for translating Android apps into rgb color code and transform them to a fixed-sized encoded image. After that, the encoded image is fed to convolutional neural network for automatic feature extraction and learning, reducing the expert’s intervention.We have run our system over 800k malware samples and 800k benign samples through our back-end (60 million monthly active users and 10k new malware samples per day), showing that R2-D2 can effectively detect the malware. Furthermore, we will keep our research results on http://R2D2.TWMAN.ORG if there any update.

The latest version of this research has been presented on Industry Security Conference RuxCon 2017, October 21-22, Melbourne, Australia | 2017/10/22, 15:00~16:00 (GMT +11)

Look Ransomware is There: Large Scale Ransomware Detection with Naked Eye

Ransomware such as WannaCrypt and Petya have caused significant financial loss and even have endangered human life (e.g., ransomware attack on UK hospitals). Ransomware on desktop has gained much attention from academic and industry. However, we see that the number of ransomware on Android phones remains steady increasing, but gains much less attention. As Android has been the most popular smartphone OS and a substantial number of credentials are kept only in smartphones, the data loss incurs serious inconvenience and loss. Here, we present our deep learning-based ransomware detection system, coloR-inspired convolutional neuRal network-based androiD ransomware Detection (R2D2). R2D2 was originally developed to sweep the malware, but we found it particularly useful in detecting ransomware. A unique feature is its end-to-end training, without human intervention. Such an end-to-end training points out a direction that we no longer need tedious search for roust ransomware features for detection. Most importantly, based on R2D2, we develop techniques to encode ransomware as so-called ransomware image, such that the ransomware from the same family exhibit the same pattern and even non-experts can detect and even determine the ransomware family with their the naked eye.



Nowadays, smartphone has become a daily necessity in our life. In the smartphone market, Android is the most commonly used operating system (OS), and it is still expanding its market share. According to the report by International Data Corporation (IDC) in 2016, the market share of Android in smartphone market increased from 84.3% in 2015 Q2 to 86.8% in 2016 Q3. Android is featured by its openness; users can choose to download apps from Google Play or third-party marketplace. However due to the popularity and openness, Android has attracted attacker’s attention. According to the data provided by our research cooperate partner Leopard Mobile Inc. (Cheetah Mobile Taiwan Agency) collected from its core products Security Master and Clean Master, it shows that the number of Android malware increased sharply from 1 million in 2012 to 2.8 million in 2014. In 2015, the number of Android malware detected was three times than the one in 2014 (over 9.5 million). In 2016 the number of Android malware achieved more than 17 million. In the first half of 2017, there was more than 10 million of the number of Android malware found in Security Master and Clean Master.

根據International Data Corporation (IDC) 2016 Q3  Smartphone OS Market Share 調2015 Q3 市佔第一名Android 84.3% 持續成長到 2016  Q3  86.8%,全世界有8成以上智慧型手機都是使用 Android;德國資訊安全獨立研究機構AV-TEST Institute  Security Report 2015/16 指出,惡意程式的數量2005年約 17 million 的數量成長到 2016 年預估的超過 600 millionAndroid 的比例由 2015 年的 3.19%  2016  Q2為止已成長為 7.48%,其中以竊取使用者資料為主的 Trojans 佔了 97.49%,而 Android 惡意程式數量更佔了所有移動平台的惡意程式數量 99.87%我們後端系統針對20171月會竊取資料的Android木馬後門統計數據,其中美國、英國、法國、德國、俄國及中國等,每日皆有超過5萬名使用者的感染量。此外,從2012年開始,透過我們的核心產品所檢測到的Android惡意程式數量從10萬直線上漲快速飆升至2014年的280萬,2015年更是超過2014年的三倍逾950萬,在2016年已成長到1700萬,2017年上半年更是已超過一千萬。為了因應Android的碎片化,我們也針對世界各國的Android 惡意程式進行分析,根據這些數據可得知不同的惡意程式在不同的智慧型手機型號及不同國家可能會呈現不同的行為差異,在Android惡意程式成長如此快速的狀況下,我們要如何快速解決此問題?

As mentioned above, the majority of the Android malware detection still relies on the static analysis of source code and the dynamic analysis through monitoring the execution of malware. However, these approaches are known to have better detection accuracy for the same family of malware only. It means that it requires vast reverse-engineering for feature engineering to find out the most suitable part for identifying Android malware. These methods reply on looking for the API invocation of classes.dex of Android apps permissions. In reality, Android malware has dramatic growth in numbers and mutates with fast speed and with various anti-analysis techniques. All of these characteristics make the accurate detection extremely difficult. In particular, from our samples, we found that the amount of unknown sample (called gray sample) that cannot be detected by current methods is twice than those can be identified. If gray sample can be detected fast enough, we can decrease the tendency for Android device to be attacked.

We attempt to find out the hidden relationship between the program execution logic and the order of function calls behind the malware by taking advantage of the CNN in order to accurately detect known and even unknown malware. Since the byte-code in classes.dex and rgb color code are both hexadecimal, we proposed a transformation algorithm that enables the bytecode as rgb color code and into color image (called Android color images). After that, we apply CNN to the Android color images for the Android malware detection. Our proposed system (R2-D2: coloR-inspired convolutional neuRal network-based androiD malware Detection) works by only decompressing Android apps (APK) and then translating the dalvik executable classes.dex into Android color images. Our R2-D2 possesses the following advantages:

目前對抗惡意程式主要仍以依賴安全專家透過逆向工程來分析惡意程式源碼或者透過模擬器監控惡意程式行為等自動化分析機制來產出特徵並藉此進行偵測為主;若能透過機器或深度學習技術來自主學習訓練,從惡意程式數據中找出程式及其執行邏輯規律並做出預測,便能提升效率發現更多未知惡意程式;但機器或深度學習也面臨特徵工程前處理、調校模型與計算時間等人力資源與效能問題,不見得能因此使整個流程加速並且節省計算資源。 因此,為了解決Android日益嚴重的惡意程式問題,我們針對傳統偵測與機器/深度學習的缺點開發了與現有相關研究等皆不同之技術:我們的方法不需再對Android 應用程式進行任何特徵萃取前處理,可以藉此降低人力與計算資源投入的R2-D2: coloR-inspired convolutional neuRal network (cnn)-based Large Scale androiD malware Detections 框架;促使計算與人力成本降低而加快流程,提早發現未知的Android惡意程式;此外,我們另具備以下優勢:

1. R2-D2 translates classes.dex, the core of the execution logic of Android apps, into RGB color images, without modifying the original Android apps and without extracting features from the apps manually in advance. It can complete a translation from execution code to image within 0.4 second. Such translation is also featured by the fact that more complex information in the Android apps can be preserved in the color image with 16777216 colors (each sampling with 24 bit pixels) compared to the gray scale image with only 256 colors (each sampling with 8 bit pixels). 

2. With the fully connected network infrastructure of DNN, it can though deal with fast-changing malware with its large amount of parameters, however, the local receptive fields and shared weights of CNN make it more suited for more complex structure. It not only decreases the amount of parameters, but also reflects the complexity of Android malware, saving the time for huge computation with current method.

1. 除了將Android 應用程式編碼為圖片,不再事先提取特徵,且為了完整保存Android 應用程式的原始複雜信息,我們不採用只有256種顏色的灰階圖,而是採用共有16777216種顏色的彩色圖片(如Figure 6及7),希望能更明確表達迭代變化快速的Android惡意程式。 

2. 深層卷積類神經網路其局部接受區域與權值共享的特性,能大幅度減少參數計算數量,適合Android惡意程式的複雜結構,有效節省計算和時間資源。

Moreover, we have an image distance test on our Android malware image, where samples of the same Android malware family sharing similarity in their visual patterns are close to each other in the sense of distance from Levenshtein, RMS deviation and MSE image distance which validated our proposed Android color image algorithm that is suitable for the classification with CNN (e.g., 00ee9561c5830690661467cc90b116de of Android:Jisut-JY [Trj] from AVAST and cdaf446c3e1076ab48540e02283595ac of Android.Trojan.SLocker.IS from BitDefender are 54.59% similar, 0cc0908c2fbd8f9b31da3afe05cb3427 of Trojan-Ransom.AndroidOS.Congur.aa from Kaspersky and ce2fedd0ca9327b8d52388c11ff3b4ca of Android.Trojan.SLocker.IS from BitDefender are 54.94% similar etc.). Although in the fine-grained sense is not accurate, Android malware image visual mode of this similarity will help us to quickly classify Android malware, greatly reducing labor costs.

在轉換影像的過程中,我們發現被歸類為同一種類的Android惡意程式,即使因為各家防毒軟體公司的各自惡意程式特徵不同而命名不同,但卻能夠直接以肉眼就辨識出來其特定的影像特徵;因此我們嘗試針對F-Secure 定義為Trojan:Android/SLocker.CF、BitDefender 定義為Android.Trojan.SLocker.IS 以及Kaspersky定義為HEUR:Trojan-Ransom.AndroidOS.Congur.aa 等惡意程式樣本透過Levenshtein、RMS 以及MSE 等影像距離相似度演算法進行驗證,並得到確實如我們所假設多數的惡意程式家族具備高度相似的影像特徵讓肉眼即可辨識驗證。

Our research cooperate partner’s (Leopard Mobile Inc., Cheetah Mobile Taiwan Agency) core products (Security Master and Clean Master reached 3,810 million installations globally with 623 million monthly active users by December 2016) that we used to collect our sample . We can collect 10k benign and 10k malicious samples daily in average. The data was collected from January to May in 2017, among which we had a collection of approximately 1.5 million of benign and malicious Android apps for our experiments. We keep our research results and data on the website http://R2D2.TWMAN.ORG (72 GB). So far, we have accumulated 2 million effective data collections and continue to train and adjust our model.

我們使用Tensorflow做為訓練工具,整個系統的執行流程如圖所示;我們提出的方法不同於現有相關研究或機器/深度學習技術,因不再需對Android Apps進行任何特徵萃取前處理工程,有效降低人力與計算資源投入加快分析速度,提升Android 惡意程式處理效率。此外,持續增加訓練集的大小是提升深度學習的準確度最有效的辦法之一,我們的這項研究 R2-D2從2017年1月開始進行,至今已累積逾200萬的Android 應用程式,並且透過我們原先所具備的自動偵測引擎進行分類,來加強其偵測結果的準確性。

This research adopts deep learning to construct an end-toend learning-based Android malware detection and proposed a color-inspired convolutional neural network (CNN)-based Android Malware detection, labelled as R2-D2. The results show that our detection system works well in detecting known Android malware and even unknown Android malware. We have published the system to our core product to provide convenient usage scenarios for end users or enterprises. The future work is to reduce the complex task and train for higher performance in confronting the Android malware, avoiding from a huge amount of computation burden. The experiment material and research results are shown on the website http://R2D2.TWMAN.ORG if there are any updates.

最後,我們同時評估與微調深度學習的各種參數,其最終結果與相關數據如Figure 10及Figure 11所示。我們所提出的基於深度學習之卷積類神經網路辨識模型的Android惡意程式檢測方法,據我們目前所知是第一個能解決傳統機器學習方法針對Android惡意程式檢測需先反編譯源代碼分析擷取特徵的研究,並透過比起過往相關研究都龐大的逾2百萬的自有數據集證明,不管是手機規格、Android系統版本及所處國家或地區的不同,我們所提出的深度學習模型都能有效對抗,超越傳統及機器學習方法的特徵工程的效率,也獲得比以前的研究更準確的檢測結果穩定的檢測效果;未來我們將持續發佈此研究最新成果,並針對我們的數據與模型修正優化,減少訓練任務的計算負擔,使其能直接於Android手機端進行Android惡意程式偵測,有效且提升Android惡意程式檢測的準確度及效率。 我們將持續將其實驗結果更新於http://R2D2.TWMAN.ORG,並嘗試公開相關Android color image,希望藉此緩解日益嚴重的Android惡意程式。