Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection
Catur Supriyanto1, Fauzi Adi Rafrastara1,*, Afinzaki Amiral1, Syafira Rosa Amalia1, Muhammad Daffa Al Fahreza1, Mohd.Faizal bin Abdollah2
1Faculty of Computer Science, Informatics Engineering, Universitas Dian Nuswantoro, Semarang, Indonesia
2Faculty of Information and Communication Tecnology, Universiti Teknikal Malaysia Melaka, Melaka, Malaysia Email: 1[email protected], 2,*[email protected], 3[email protected],
4[email protected], 5[email protected], [email protected] Correspondence Author Email: [email protected]
Abstract−Malware is one of the biggest threats in today’s digital era. Malware detection becomes crucial since it can protect devices or systems from the dangers posed by malware, such as data loss/damage, data theft, account break-ins, and the entry of intruders who can gain full access of system. Considering that malware has also evolved from traditional form (monomorphic) to modern form (polymorphic, metamorphic, and oligomorphic), a malware detection system is needed that is no longer signature-based, but rather machine learning-based. This research will discuss malware detection by classifying the file whether considered as malware or goodware, using one of the classification algorithms in machine learning, namely k- Nearest Neighbor (kNN). To improve the performance of kNN, the number of features was reduced using the Information Gain and Principal Component Analysis (PCA) feature selection methods. The performance of kNN with PCA and Information Gain will then be compared to get the best performance. As a result, by using the PCA method where the number of features was reduced until the remaining 32 PCs, the kNN algorithm succeeded in maintaining classification performance with an accuracy of 95.6% and an F1-Score of 95.6%. Using the same number of features as the basis, the Information Gain method is applied by sorting the features from those with the highest Information Gain score and taking the 32 best features. The result, by using this Information Gain method, the classification performance of the kNN algorithm can be increased to 96.9% for both accuracy and F1-Score.
Keywords: Classification; Features Selection; Information Gain; K-Nearest Neighbor; Malware Detection
1. INTRODUCTION
Malware comes from the words "malicious" and "software." A software is considered malicious when it intentionally performs something abnormal and has a negative impact on the victim's computer for personal gain [1], [2]. When a user sends a file from one computer to another through legal means, this activity is referred to as a genuine activity. However, if it happens without the user's knowledge, it is called malicious activity [3], [4]. If this malicious activity is carried out automatically by software, the software is considered as malicious software or malware. Some detrimental activities of malware to users include data manipulation, data encryption, and monitoring the victim's activities without their knowledge [5] [6]. For a cybercriminal, malware is used to exploit security vulnerabilities in the target system to infiltrate the system or device. These malwares work according to their respective types, including viruses, worms, trojan horses, adware, spyware, rootkits, bots, ransomware, etc.
[2], [7].
Various efforts to defend against malware attacks have been developed for a long time. The first-known computer virus was called 'Elk Cloner' [8]. It was created by Rich Skrenta, a 15-year-old teenager from the United States. The virus could infect Apple II computers and replicate itself for wider distribution. The first antivirus program was created in 1987 by Bernd Robert, a computer security expert from Germany [8], [9]. This antivirus had the capability to combat a virus called Vienna, which infected *.com files on DOS-based systems. The method that is used by those antiviruses is considered as old school one. The detection is done by checking the fingerprint of the file, then comparing it with the list of virus fingerprints that have already been recorded in the database.
Once the antivirus finds the fingerprint in the database, it means that the file is a virus. Otherwise, it is considered as a normal file or a goodware.
However, virus detection nowadays is not solely based on fingerprinting but has expanded to encompass virus behavior. This development is a response to the evolution of malware, which has moved from being monomorphic to polymorphic, metamorphic, and oligomorphic [6], [10]. Behavior-based malware detection is not as straightforward as fingerprint-based detection. It involves analyzing a vast number of features, reaching thousands of attributes or features. If there are too many features to analyze, it can impact the performance of machine learning algorithms, especially in terms of execution time. Therefore, the focus of this research is on feature selection in malware datasets. By doing so, a reduced number of features can be obtained without compromising accuracy and F1-Score performance, and in some cases, it may even lead to improvements.
Malware detection continues to be a hot topic in the field of information security and computer security.
Research on malware is not limited to Windows-based devices; it has also extended to other platforms, including Android. In their publications, [11] and [12] applied machine learning algorithms to detect malware on Android- based operating systems. In the study presented in paper [11], researchers compared the performance of five machine learning algorithms (Logistic Regression, kNN, Support Vector Classifier, Decision Tree, and MLP) in classifying files as either malware or goodware. The results showed that kNN achieved the highest F1 score at
85%. Meanwhile, in [12], a broader range of machine learning algorithms was compared for Android malware cases, including Linear SVM, Naïve Bayes, kNN, Decision Tree, Boosted Tree, Extra Trees, Random Forest, Xgboost, and Stacking. Testing against two different datasets yielded the result that the Stacking algorithm outperformed the other eight algorithms with an F1 score reaching 95%.
Regarding malware detection on the Windows operating system, researcher [13] utilized the Random Forest algorithm and achieved an accuracy score of 95.26%. In [14], the paper also explores the use of machine learning for malware detection in the Windows operating system. The highest accuracy score, once again, was obtained by the Random Forest algorithm, with a score of 96.8% for binary classification and 95.69% for multiclass classification. However, the highest F1 score was achieved by the Hard Voting algorithm, scoring 92%. The compared algorithms included kNN, SVM, Bernoulli Naïve Bayes, Random Forest, Hard Voting, Logistic Regression, and Decision Tree.
In the studies mentioned above, high accuracy scores above 95% have been achieved. Unfortunately, for malware cases, recall and precision are more critical than accuracy. Errors in detection, both false negatives and false positives, can have adverse effects on the system and the user. Furthermore, recall and precision values must be improved since even a single false negative can lead to a computer being infected. Hence, zero tolerance for false negatives is crucial. One way to enhance the performance of machine learning algorithms to reach this goal is by feature selection. This research focuses on how to improve the performance of machine learning algorithms by applying feature selection.
2. RESEARCH METHODOLOGY
This research involves several stages that are carried out sequentially. Figure 1 illustrates the process of conducting this research. The first step is dataset collection and preparation. There are still some issues with the collected data that need to be addressed. Therefore, it should be prepared before it is sent to the pre-processing stage. An in-depth discussion about dataset collection and preparation can be found in Subchapter 2.2. After collecting and preparing the dataset, it moves to pre-processing stage. Pre-processing is a crucial stage in data analysis where raw data is cleaned and transformed, ensuring it is in the optimal format for further analysis or model building. This stage includes class balancing, constant features removal, normalization, and feature selection. The detailed explanation of pre-processing stage can be found in Subchapter 2.3. Once the dataset has been pre-processed, it can be used for modelling. In this stage, kNN algorithm is applied. For an engaging exploration of the kNN algorithm, please refer to Subchapter 2.4. Before evaluating the performance of the kNN algorithm, validation must be conducted. To reduce the risk of overfitting, a 10-fold cross-validation is employed.
The discussion about validation and evaluation can be found in Subchapter 2.5 and 2.6 respectively.
Figure 1. The experimental procedure 2.1. Hardware and Software
One of the factors that determine the smoothness of the research process is the supporting instruments, which include both hardware and software. Good software, without adequate hardware support, will not be optimized.
Conversely, high-spec hardware without the right software will also not be very helpful. Therefore, both hardware
and software play crucial roles in supporting the smooth progress of research. In this research, the computer devices used have the following specifications:
Processor : Intel Xeon E5620
RAM : 16 GB
Harddisk : 3 TB
VGA : Radeon RX550
Meanwhile, the software used in this research includes Microsoft Excel and Orange Data Mining (https://orangedatamining.com/). Microsoft Excel plays a crucial role in the preprocessing stage, especially in Class Balancing and Constant Features Removal. On the other hand, Orange software is used in the stages of Normalization, Features Selection, Modeling, Validation, and Evaluation.
2.2. Dataset Collection & Preparation
In this stage, the malware dataset was downloaded from the UCI Machine Learning Repository with details as seen in Table 1. In the downloaded file, there are three dataset files, consisting of a goodware dataset, a malware dataset from VirusTotal, and a malware file dataset from VxHeaven. Each of these datasets contains records of file’s activities that have been executed in a virtual or sandboxed environment. The results of these activity recordings are organized in a tabular format, resulting in 1085 features or more.
Table 1. Details of the dataset used
Dataset Name Malware static and dynamic features VxHeaven and VirusTotal Data Set.
Number of Files 3 (consisting of goodware, malware from VirusTotal, and malware from VxHeaven files).
Number of Rows Goodware: 595; VirusTotal: 2955; VxHeaven: 2698.
Number of
Features Goodware: 1085; VirusTotal: 1087; VxHeaven: 1087 (excluding labels).
Missing Values None
In the goodware dataset, there are 595 records of file activities that are not indicated as malware. In this first dataset, the number of successfully extracted features is 1085. Then, in the second dataset, which is the malware from VirusTotal, there are 2955 records of malware activity with 1087 features. Meanwhile, in the third dataset, which is malware from VxHeaven, there are 2698 malware data with 1087 successfully extracted features.
These three datasets cannot be merged yet due to the difference in the number of features. Therefore, an intervention is needed, such as the removal of certain features, to ensure that each dataset shares similarity in terms of feature names, feature order, and the number of features. Some features removed from the malware dataset are vbaVarIndexLoad and SafeArrayPtrOfIndex. Both features have a constant value of ‘0’ and are not present in the goodware dataset. By removing these two features from the malware dataset, the number of features in both the malware and goodware datasets is now 1085. From each dataset, there is still one feature that can be removed, called filename, as this feature is not significant for the classification task to be performed. Finally, the remaining number of features ready for further processing is 1084.
The next step is to add labels to each of these datasets. All data in the goodware dataset is labeled as '0,' and all data in the malware dataset is labeled as '1.' After ensuring that all data in these three datasets have the same features and have had labels added, the three datasets can be merged into a single dataset, resulting in a total of 6248 data points with 1084 features and one label.
2.3. Pre-Processing
In this stage, the dataset prepared earlier will undergo further processing before being used for modeling. In the Pre-Processing stage, there are four steps that need to be taken, namely Class Balancing, Constant Features Removal, Normalization, and Feature Selection.
Class Balancing is a way to balance the number of classes in a dataset. Unaddressed class imbalances may result in incorrect predictions for the minority class. It’s crucial to accurately predict the minority class as inaccuracies can lead to severe consequences or substantial costs. The issue of class imbalance can be tackled using data-level and algorithm-level strategies, including oversampling and undersampling [15], [16].
Oversampling is a technique used to adjust the class distribution of a data set by increasing the size of the infrequent class. This is done to improve the model’s performance on the minority class. Whereas undersampling is a technique used to adjust the class distribution of a data set by reducing the size of the abundant class. This method is used to enhance the model’s performance by decreasing the potential of the majority class to dominate the learning algorithm.
In the malware dataset used in this research, there is a significant difference in the number of classes, with 5653 for malware and 595 for goodware, resulting in an Imbalanced Ratio (IR) of 1:9.5. According to [17], [18], an IR > 9 falls into the category of Medium Imbalance and therefore needs to be balanced. The class balancing method used in this research is Random Under Sampling (RUS).
In the journal of [17], the RUS method yielded quite good performance when applied with various machine learning algorithms for malware detection. With RUS, the number of malware data is significantly reduced to match the number of goodware data, which is 595. The reduction in the number of malware data is done randomly.
After the dataset has data with a balanced number for each class, the next step is to remove features with constant values (Constant Features Removal). Features with constant values mean they have only one value.
Therefore, these features will not provide significant impact due to the lack of information carried by each data point, especially regarding that particular feature. Hence, features with constant values can be removed. Out of the 1084 features ready for processing, it turns out that 933 features have constant values and need to be removed.
Thus, 151 features remain for the next process, which is Normalization.
Normalization is one of the most important steps in pre-processing. In the normalization phase, features are transformed to have the same range of values [19]. Consequently, data with high numerical values cannot dominate data with smaller numerical values for a particular feature. The main goal of normalization is to minimize bias [20]. There are three types of normalization, depends on the values [21]. (1) A MinMax Scaler, also known as MinMax Normalization, is suitable for numerical values. By applying this, all numeric values will fall within a range between 0 and 1. (2) Binarization is suitable for columns with two categorical variables, converting them into Boolean variables. For instance, the light sensor is adjusted such that OFF becomes 0 and ON becomes 1. (3) Hot encoding is used for columns with multiple categories. In this research, the normalization method used is MinMax Normalization.
With MinMax Normalization, all values of a feature will be transformed into values within the range of 0 and 1. Equation 1 is used to apply normalization using MinMax Normalization.
vi= ( xi − xmin
xmax− xmin (vmax− vmin)) + vmin (1)
In Equation 1 above, vi refers to the new value formed in row i, xi is the value in row i that will be normalized, xmin represents the minimum or smallest value of a feature, while xmax represents the maximum value of a feature. In addition, vmax is the new maximum value limit for a feature (which is 1), while vmin becomes the new minimum value limit for a feature (which is 0). The before and after implementation of MinMax Normalization on the dataset can be seen in Figure 2. All values are converted into new values within the range of 0 to 1.
Figure 2. Before (left) and after (right) MinMax Normalization
The final step in the pre-processing of this research is Feature Selection. Through the earlier Constant Features Removal step, the number of features was reduced from the original 1084 to 151 features. Because this
number was still considered too high, feature selection was performed to optimize the performance of machine learning algorithms. The Principal Component Analysis (PCA) method was used previously in [22] with the aim to reduce features in malware dataset. As a result, with 32 Principal Components (PC), the best accuracy and recall values were achieved in malware detection. The variance value was also ideal, exceeding 80%. The 32 PC obtained using the PCA method will be used as a reference in this research, which involves reducing features using other feature selection methods to retain the top 32 features. The following are the steps for PCA feature selection:
1. Calculate the covariance matrix (Equation 2).
𝐶𝑜𝑣(𝑥𝑦) = ∑ 𝑥𝑦 𝑛 − (𝑥̅)(𝑦̅) (2)
2. Compute eigenvalues (Equation 3)
(𝐴 − 𝜆𝐼) = 0 (3)
3. Calculate eigenvectors (Equation 4).
[𝐴 − 𝜆𝐼][𝑋] = [0] (4)
4. Determine new variables (Principal Components) by multiplying the original variables by the eigenvector matrix.
Next, using Information Gain, the top 32 features with the highest scores will be selected. Information Gain is one of the most common and frequently used feature selection methods [23]. The best features are determined based on their entropy values (Equation 5).
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) = ∑ −𝑃𝑐𝑖 𝑖𝑙𝑜𝑔2𝑃𝑖 (5)
After obtaining the entropy value, Information Gain can be calculated using Equation 6. The top 32 features with the highest Information Gain values can be seen in Table 2. The highest IG score is obtained by Minor_image_version, with 0.761. The highest IG score indicates the attribute that provides the most useful information for classification, making it the most significant for decision-making processes in models like decision trees and kNN as well. It means that this attribute contributes the most to the decision-making process by reducing uncertainty more than other attributes. ‘Minor_image_version’ having the highest IG score implies that it is the most informative or discriminative feature among all the features considered for the malware classification task.
It provides the most significant reduction in entropy and, therefore, is the best at splitting the data into separate classes.
𝐺𝑎𝑖𝑛 (𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑠) − ∑ |𝑆𝑣|
|𝑆|𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣)
𝑉𝑎𝑙𝑢𝑒𝑠(𝐴) (6)
Table 2. List of the top 32 features with the highest Information Gain values
No. Features IG Score
1. Minor_image_version 0.761
2. Minor_operating_system_version 0.716 3. Major_operating_system_version 0.639
4. Size_of_stack_reverse 0.626
5. Compile_date 0.593
6. Minor_linker_version 0.566
7. Major_image_version 0.555
8. Major_subsystem_version 0.523
9. Dll_characteristics 0.485
10. Minor_subsystem_version 0.477
11. CheckSum 0.408
12. Major_linker_version 0.395
13. Characteristics 0.389
14. Number_of_IAT_entires 0.257
15. Number_of_IAT_entires.1 0.257
16. Pushf 0.243
17. Size_of_stack_commit 0.239
18. Files_operations 0.221
19. .text: 0.205
20. Count_dll_loaded 0.202
21. SizeOfHeaders 0.195
22. Size_of_headers 0.195
23. SizeOfHeaders.1 0.195
24. Number_of_sections.1 0.194
25. Not 0.192
No. Features IG Score
26. Count_file_opened 0.180
27. Bt 0.131
28. Nop 0.129
29. Count_file_read 0.119
30. Number_of_imports.1 0.113
31. Int 0.110
32. Rol 0.109
2.4. Modelling
After the Pre-Processing stage is completed, the next step is to implement the machine learning algorithm. The algorithm used in this research is k-Nearest Neighbor (kNN). kNN algorithm is a supervised machine learning algorithm applicable to both classification and regression problems [24]. It is non-parametric, meaning it does not make any assumptions about the underlying data. kNN is often referred to as a lazy learner because it does not learn during the training phase. Instead, it stores the data points and learns during the testing phase. This learning is based on the distance between new data points and known data points. The steps involved in the KNN algorithm are:
1. Selection of the K value.
2. Calculation of the distance between all the training points and new data points.
3. Sorting of the computed distances in ascending order.
4. Selection of the first K distances from the sorted list.
5. Computation of the mode (for classification problems) or the mean (for regression problems) of the classes associated with these distances. The most used distance metric in kNN is Euclidean distance. The Euclidean distance formula is:
𝑑(𝑝, 𝑞) = √(𝑝1− 𝑞1)2+ (𝑝2− 𝑞2)2+ … + (𝑝𝑛− 𝑞𝑛)2 (7) Where, p and q are two points in the dataset. The KNN algorithm can use various distance metrics, including Minkowski Distance, Manhattan Distance, Euclidean Distance, Cosine Distance, and Jaccard Distance.
However, in this research, the distance metrices used is Euclidean distance (Equation 7).
The choice of K value is crucial. A low value of K may lead to overfitting, and a high value of K may lead to underfitting. In this malware detection experiment, using the kNN algorithm with a value of k = 3 produced the best performance when compared to k = 5 and k = 7. When determining the value of k, it is recommended to use odd numbers or prime numbers more than 2 [25]. The results of testing kNN with various k values will be discussed in Chapter 3. In this experiment, kNN is used because it has better performance compared to 2 other classification algorithms, namely Naïve Bayes and Logistic Regression (see table 3). With k = 3, kNN outperformed Naïve Bayes and Logistic Regression, with 95.8% accuracy and F1-Score. Naïve Bayes performed the worst when compared to kNN and Logistic Regression, since it can only obtain 92% for both accuracy and F1-Score.
Table 3. The performance comparison among 3 classification algorithms Accuracy F1-Score
k-Nearest Neighbor 95.8 % 95.8 %
Naïve Bayes 92.0 % 92.0 %
Logistic Regression 95.6 % 95.6 % 2.5. Validation
In the validation stage, the type of validation used is Cross-Validation, with a quantity of k = 10 (10-fold cross- validation). Cross-validation is useful for maintaining the performance of the classification algorithm. Cross- validation is a robust technique used in machine learning to assess the performance of a model on an independent data set and to tune model parameters. It’s particularly useful when the available data is limited. In k-fold cross- validation, the data is divided into ‘k’ subsets of roughly equal size. The model is trained on ‘k-1’ subsets, and the remaining subset is used as a validation set. This process is repeated ‘k’ times, with each subset serving as the validation set once. The model’s performance is then averaged over the ‘k’ trials to provide a more accurate measure of its effectiveness. When k equals 10, it’s known as 10-fold cross-validation. This method helps to give a more generalized model and prevents overfitting by providing a more reliable estimate of model performance [26], [27].
2.6. Evaluation
To measure the performance of the classification algorithm, confusion matrix will be used to obtain the performance score (Table 4). A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. It comprises four components: True Positives, True Negatives, False Positives, and False Negatives, which represent the correctly and incorrectly classified
instances. By comparing the actual and predicted classes, it provides a comprehensive summary of the model performance. In machine learning, it is an essential tool for evaluating metrics such as accuracy, precision, recall, and F1-score. However, in this research, the evaluation metrics that will be used are, Accuracy and F1-Score.
Table 4. Confusion Matrix
Classification Actual Class Positive Negative
Prediction Positive TP FP
Class Negative FN TN
TP (True Positive) is the number of times a model correctly predicts the positive class. Meanwhile, TN (True Negative) is how many times a model correctly predicts the negative class. On the other hand, FP (False Positive) refers to how many times a model is wrong in predicting the positive class, and FN (False Negative) is about how many times a model is wrong in predicting the negative class. Accuracy is used to determine how often the classification algorithm predicts correctly [28], [29]. To calculate the accuracy score, the formula can be seen in Equation 8.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁
(𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁) (8)
The second evaluation metric is the F1-Score. F1-Score is used to obtain a balance between Precision (Equation 9) and Recall (Equation 10) [28], [29]. The formula used to calculate the F1-Score can be seen in Equation 11.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 (9)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁 (10)
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =2 𝑥 (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑅𝑒𝑐𝑎𝑙𝑙)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (11)
3. RESULT AND DISCUSSION
In the downloaded dataset (from UCI machine learning repository), there are three dataset files that need to be merged. Those three files consist of a goodware dataset file, a malware dataset file from VirusTotal, and a malware dataset file from VxHeaven. Those files contain records of file’s activities that have been executed in a virtual environment, called sandbox. The results of these activity recordings are organized in a tabular format, resulting in 1085 (for goodware) and 1087 (for malware) features. Since those 3 datasets have different numbers of features, they cannot be merged. As a solution, two features (vbaVarIndexLoad and SafeArrayPtrOfIndex) from malware datasets were removed. Both features have a constant value of ‘0’ and are not present in the goodware dataset.
Once the three dataset files have the same number of features, they can be merged. In this stage, the total number of records is 6248, consisting of 595 records of goodware and 5653 records of malware. Because the number of classes is not balanced, it is necessary to carry out class balancing, which will be done at the pre-processing stage.
In the initial pre-processing stage, class balancing was performed to balance the classes between malware and goodware. Using the Random Under Sampling method, the 5653 data points with the malware label were reduced to match the number of data points labeled as goodware, which is 595. Therefore, the total number of data points after the class balancing process amounts to 1190.
Next, the process of constant features removal was executed to eliminate features with constant values. In the dataset used in this research, there are 1084 features (excluding labels/classes). In this constant feature removal stage, 936 features were successfully detected and reduced. Therefore, there are 148 features remaining, all of which have numerical values.
This total of 148 features needs to be further reduced to enhance the performance of the machine learning algorithm. The fewer features processed will naturally result in an improvement in the algorithm's speed.
Therefore, two feature selection algorithms were compared to obtain the best feature selection algorithm to be combined with the k-Nearest Neighbor classification algorithm.
The two feature selection algorithms used are: Principal Component Analysis (PCA) and Information Gain.
By using PCA, the best performance was achieved when the number of features was reduced to 32 features.
Therefore, this set of 32 features will be retained and serve as a reference when conducting feature selection using Information Gain.
The experimental results can be seen in Table 5 and Table 6. Testing was conducted using various variations of the kNN algorithm, with k = 3, k = 5, and k = 7. kNN is chosen because it has the best performance among 3 classification algorithms, includes Naïve Bayes and Logistic Regression (see Table 3). The best results were obtained when implementing kNN with k = 3. Next, the kNN algorithm was tested using a dataset in which features had been reduced, both using PCA and Information Gain. The best performance of kNN was achieved when the
dataset's features were reduced using Information Gain, with accuracy and F1-Score results reaching a score of 96.9%. Without feature selection, the best accuracy and F1-Score of the kNN algorithm reached 95.8% with k = 3 (see Table 4), which is similar to the values obtained when applying feature selection using PCA. By implementing feature selection (reduced to 32 features), at least the process can be done in a shorter time compared to the process with 1085 features. However, with feature selection based on Information Gain, not only processing speed can be improved, but accuracy and F1-Score performance can also significantly increase, reaching 96.9%. This result outperformed the random forest algorithm that used by
Table 5. The test results of kNN without feature selection Accuracy F1-Score
k = 3 95.8 % 95.8 %
k = 5 95.6 % 95.6 %
k = 7 95.2 % 95.2 %
Table 6. The test results of kNN with feature selection
PCA (32 Features) Information Gain (32) Accuracy F1-Score Accuracy F1-Score
k = 3 95.8 % 95.8 % 96.9 % 96.9 %
k = 5 95.6 % 95.6 % 96.2 % 96.2 %
k = 7 95.2 % 95.2 % 96.7 % 96.7 %
4. CONCLUSION
The diversity of malware types and their rapid spread pose a serious threat in today's digital world. Conventional fingerprint-based antivirus software is increasingly having difficulty when dealing with metamorphic and oligomorphic malware. To combat these types of malwares, machine learning is needed to strengthen antivirus detection capabilities. This research focuses on improving the performance of the kNN classification algorithm based on feature selection for malware detection. As a result, the performance of the kNN classification algorithm can be enhanced in two ways: by selecting the appropriate value of k (in this case, k = 3) and using the right feature selection method (in this case, Information Gain). By implementing these two methods, the accuracy and F1-Score of the kNN algorithm can be improved to reach 96.9%.
ACKNOWLEDGMENT
Thanks are extended to the Research and Community Service Institute (LPPM) and the Faculty of Computer Science, Universitas Dian Nuswantoro for their support in providing facilities and funding for this research.
REFERENCES
[1] N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. Damaševičius, “Windows PE Malware Detection Using Ensemble Learning,” Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.
[2] O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” IEEE Access, vol. 8, pp. 6249–
6271, 2020, doi: 10.1109/ACCESS.2019.2963724.
[3] F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,” IJCSIS, vol. 9, no. 1, 2011.
[4] C. S. Yadav and S. Gupta, “A Review on Malware Analysis for IoT and Android System,” SN COMPUT. SCI., vol. 4, no. 2, p. 118, Dec. 2022, doi: 10.1007/s42979-022-01543-w.
[5] A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,” Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.
[6] A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,” IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.
[7] S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,” Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569- 4.
[8] M. J. Hossain Faruk et al., “Malware Detection and Prevention using Artificial Intelligence Techniques,” in 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA: IEEE, Dec. 2021, pp. 5369–5377. doi:
10.1109/BigData52589.2021.9671434.
[9] B. Kundu, N. Gupta, and R. Seal, “Cyber Vulnerabilities in Smart Grid: A Review,” International Journal of Engineering Research, vol. 9, no. 11, 2021.
[10] F. A. Rafrastara, C. Supriyanto, C. Paramita, and Y. P. Astuti, “Deteksi Malware menggunakan Metode Stacking berbasis Ensemble,” JPIT, vol. 8, no. 1, pp. 11–16, 2023.
[11] S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,” 2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf
[12] P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,” IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.
[13] F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classification”, Accessed: Jun. 01, 2023. [Online].
Available: https://arxiv.org/ftp/arxiv/papers/1609/1609.07770.pdf
[14] I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,” Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi:
10.1016/j.procs.2020.03.110.
[15] J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,” Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.
[16] W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,” Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.
[17] F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” JPIT, vol. 8, no. 2, pp. 113–118, 2023.
[18] Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,”
Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.
[19] A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,” IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.
[20] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[21] Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,” IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.
[22] F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,” JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul.
2023, doi: 10.47233/jteksis.v5i3.854.
[23] Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,” IEEE Access, vol. 8, pp. 132911–132921, 2020, doi:
10.1109/ACCESS.2020.3009843.
[24] J. P. Mueller and L. Massaron, Machine learning for dummies, 2nd edition. Indianapolis: John Wiley & Sons, 2021.
[25] D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,” Neural Comput & Applic, vol. 35, no. 16, pp. 12195–
12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.
[26] G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,” Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.
[27] G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,” Machines, vol. 7, no. 4, p. 74, Dec.
2019, doi: 10.3390/machines7040074.
[28] G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,” in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.
[29] S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,” in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi:
10.1109/ICAC3N56670.2022.10074117.