• Tidak ada hasil yang ditemukan

Linux Dataset Acquisition

Linux is the operating system of choice for a number of users who typically work in core ICT fields. The environment, applications, and the toolchains used, however, vary significantly from user to user. As such, acquiring benign samples was a challenging task as tools and applications used by the average Linux user cannot be easily defined. Hence, to mitigate this issue, the selected samples attempt to conform to a particular category of applications, specifically, applications used by developers for software development.

Table 4.1: Distribution of Android ransomware based on family Ransomware Family Sample Size

Simplelocker 64

Pletor 6

Filecoder 5

Black Rose Lucy 17

Wipelocker 70

Wannalocker 51

Total 213

Based on the aforementioned category, we have selected 170 benign samples to reproduce a similar distribution of samples to the Android Applications dataset.

Similarly, two cache-cleaning applications were included in the 170 samples. The acquired dataset contains command-line interface (CLI) and graphical user-interface (GUI) applications, which were specifically selected to replicate different opera- tions. It should be noted that each operation performed is counted as an individual sample (i.e., one application can be represented as multiple samples based on the operations performed.) For example, Docker has multiple command-line options available, such as docker create, docker pull, and docker remove. Similarly, Visual Studio Code (vscode) (Microsoft, n.d.), a GUI application for editing code and text, can perform operations, such asfile open,file delete, andfile rename. The different operations often present a distinct system call trace. Hence, the opera- tions of applications with multiple options were also evaluated. Some applications

44

have been utilised multiple times to replicate different operations. For example, Visual Studio Code (VSCode) was used to replicate different file operations, such as file reading, file deletion, and file opening. Within the Linux benign dataset, different operations were treated as separate samples. The list of applications and their application type is as follows:

• Apache CLI

• apt CLI

• bash CLI

• bleachbit GUI

• cat CLI

• conda CLI

• cp CLI

• diff CLI

• docker CLI

• emacs GUI

• enpass GUI

• find CLI

• fslint GUI

• gcc CLI

• gedit GUI

• gimp GUI

• git CLI

• keepass GUI

• ls CLI

• make CLI

• mv CLI

• mysql CLI

• netbean GUI

• netcat CLI

• nimstall CLI

• nmon CLI

• npm CLI

• ping CLI

• python CLI

• rename CLI

• rm CLI

• ssh CLI

• tar CLI

• vagrant CLI

• vim CLI

• virtualbox CLI

• vscode GUI

• wget CLI

To acquire the malicious encryption-type Linux samples, two malware reposi- tories were utilised, VirusShare (VirusShare, n.d.) and Malware Bazaar (Malware Bazaar, n.d.). VirusShare provides over 45 million malware samples identified in the wild and relies on VirusTotal (Sood, 2017) reports to categorise the samples.

Furthermore, VirusShare provides a curated dataset consisting of ELF binaries (Linux’s equivalent to Window’s executables), observed from the years 2014 to 2020. This dataset consists of 43,553 samples, which was also considered. Whereas, Malware Bazaar has a corpus of over 500,000 malware samples with associated classification tags, such as family name or malware type.

Both VirusShare and Malware Bazaar repositories were searched for Linux bi-

naries and bash scripts. The repositories’ search functions were used to conduct the search by finding potentially malicious hashes (SHA256 or MD5) associated with specific samples from different anti-virus vendors, specific ransomware fam- ily names, or unique naming conventions used by anti-virus vendors to identify ransomware. From this search, 46 Linux binaries and bash scripts with working encryption capabilities were identified.

For the VirusShare dataset, regular expressions were utilised on common sub- strings found in the classification names of anti-virus vendors, such asrans,coder, and crypt to automatically identify potentially malicious encryption-type ran- somware. After searching through 20,000 samples using this method, one new sample was discovered. Hence, the dataset was no longer considered as the prob- ability of identifying newer samples in the remaining dataset was very low.

The aforementioned searches produced a total of 47 samples consisting of Linux binaries and bash scripts that exhibited working encryption capabilities. The limited number of encryption samples was a result of several issues encountered while testing the samples. Similar to the Android Applications dataset, some samples were unable to run due to corrupted files or other code issues. Whereas, other samples, which specifically target ESXi servers, did not encrypt the emulated environment as the specific file or key was not found in the system. By further observing the 47 samples, it was discovered that the samples were split between 15 different families. This can be seen in Table 4.2, which shows the distribution of samples based on family. It is important to note that some families shown in Table 4.2 can often be classified as the same family due to the stark similarities they share with each other, such as Sodinokibi and REvil. The classification of families was derived from various anti-virus engines using VirusTotal. Hence, in

46

our Linux malicious dataset, these families were considered separate.

Due to the limited number of samples acquired from the aforementioned repos- itories, open-source ransomware projects on GitHub (n.d.) were also included to increase the sample size of the dataset. In total, 11 open-source ransomware projects were explored. Similarly, Table 4.3 shows the distribution of open-source ransomware by project name.

A notable open-source project was RAASNet (leonv024 and HugoLB, 2019), which can be configured differently based on the functionalities selected. In total, RAASNet contained four different forms of encryption, which are Ghost, Wiper, Pycrypto, and PyAES. The open-source project also provides two different file removal behaviours, which are unlink and remove and overwrite and rename. In addition to the different behaviours, there is an option to automatically remove the payload after the encryption process. A total of 16 different payloads were generated using different configurations of these functionalities. With the inclusion of the 16 samples from RAASNet and the aforementioned open-source ransomware the number of samples further increased by 26, which yielded a dataset consisting of 73 ransomware samples from 26 different families exhibiting encryption behaviours.

The presence of the large number of RAASNET variants may introduce bias in the dataset. However, these samples were included to address the limitation of the limited sample size.