Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection
Fauzi Adi Rafrastara1,*, Catur Supriyanto1, Afinzaki Amiral1, Syafira Rosa Amalia1, Muhammad Daffa Al Fahreza1, Foez Ahmed2
1Faculty of Computer Science, Department of Informatics Engineering, Universitas Dian Nuswantoro, Semarang, Indonesia
2Faculty of Engineering, Department of Information and Communication Engineering, University of Rajshahi, Rajshahi, Bangladesh
Email: 1[email protected], 2[email protected], 3[email protected],
4[email protected], 5[email protected], 6[email protected]
*Correspondence Author Email: [email protected]1
Abstract−Malware could evolve and spread very quickly. By these capabilities, malware becomes a threat to anyone who uses a computer, both offline and online. Therefore, research on malware detection is still a hot topic today, due to the need to protect devices or systems from the dangers posed by malware, such as loss/damage of data, data theft, account hacking, and the intrusion of hackers who can control the entire system. Malware has evolved from traditional (monomorphic) to modern forms (polymorphic, metamorphic, and oligomorphic). Conventional antivirus systems cannot detect modern types of viruses effectively, as they constantly change their fingerprints each time they replicate and propagate. With this evolution, a machine learning-based malware detection system is needed to replace the existence of signature-based. Machine learning-based antivirus or malware detection systems detect malware by performing dynamic analysis, not static analysis as used by traditional ones. This research discusses malware detection using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features is reduced using the Information Gain feature selection method. The performance of kNN with Information Gain will then be measured using the evaluation metrics Accuracy and F1-Score. To get the best score, some adjustments are made to the kNN algorithm, where 3 distance measurement methods will be compared to obtain the best performance along with the variations in the k values of kNN. The distance measurement methods compared are Euclidean, Manhattan, and Chebyshev, while the variations of k values compared are 3, 5, 7, and 9. The result is, kNN with the Manhattan distance measurement method, k = 3, and using information gain features selection method (reduction until 32 features remain) has the highest Accuracy and F1-Score, which is 97.0%.
Keywords: Malware Detection; K-Nearest Neighbor; Euclidean; Manhattan; Chebyshev; k Value
1. INTRODUCTION
Malware, a term derived from “malicious software,” refers to any software designed to cause harm or disruption to a computer system. It operates covertly, often without the user’s knowledge, and can have serious consequences [1], [2]. These harmful effects can range from unauthorized data access, data alteration, to even spying on the user’s activities. Malware is a tool of choice for cybercriminals, who exploit vulnerabilities in systems to gain unauthorized access [3], [4]. The types of malwares are diverse, each with its own mode of operation. This includes viruses that replicate themselves, worms that spread across networks, trojan horses that disguise as legitimate software, adware that displays unwanted ads, spyware that collects user data, rootkits that gain administrative control, bots that automate tasks, and ransomware that encrypts data and demands a ransom [2], [5]. Each type poses a unique threat to digital security.
The onslaught of malware is becoming more massive day by day. The rapid evolution of malware, coupled with its increasingly dangerous level of destruction, raises concerns for many people. Not only for the general public but also for IT teams that safeguard crucial data, both on personal computers and in companies. Not only attacking Windows-based PCs, but malware also targets other devices, such as tablets and smartphones, which notably have different operating systems [2]. Therefore, effective malware detection is essential.
Although still employed by the majority of antivirus software today, the signature-based detection method is increasingly considered obsolete [4], [6]. This is due to the evolving sophistication of malware, which has moved away from the older monomorphic paradigm and has adopted polymorphic, metamorphic, or even oligomorphic forms. Signature-based detection means that the antivirus identifies malware based on its fingerprint. This method is only suitable for monomorphic malware, because, even if the malware multiplies, they still have the same fingerprint. As a result, an antivirus only needs to store one fingerprint in its database to detect that malware.
The challenge arises when detecting polymorphic, metamorphic, or oligomorphic types of malware [4].
These kinds of malware can produce offsprings with fingerprints completely different from their parent. If, in one activity, a piece of malware replicates itself a thousand times, those thousand malware replicas (commonly referred to as children/offspring) will have a thousand truly unique fingerprints, each distinct from the others. Even more concerning, these offspring or replicated versions of the malware also can replicate themselves as much as possible, of course, with unique fingerprints as well. Therefore, if these malware offsprings continue to spread and reproduce, they can result in a very large number, even though they originate from a single parent. What if many different parents produce offspring in this manner? The numbers would surely be massive. This significant increase in numbers corresponds directly with the number of fingerprints that the antivirus must store in its database.
Consequently, it's certain that the antivirus database will grow larger, expanding rapidly and significantly.
Undeniably, this will impact the performance of the antivirus. With a larger database, the file size will inflate, and the scanning and detection process will also take longer.
Given the aforementioned challenges, research on behavior-based malware detection methods (behavior- based detection) cannot be overlooked. Sandboxes and machine learning algorithms are pivotal to this behavior- based detection. Research on malware is not limited to Windows-based devices but has expanded to other platforms, including Android. In their publications, [7] and [8] applied machine learning algorithms to detect malware on Android-based operating systems. [7] compared the abilities of five machine learning algorithms (Logistic Regression, kNN, Support Vector Classifier, Decision Tree, and MLP) to classify a file as either malware or goodware. The result showed the highest F1 score performance was owned by kNN with 85%. Meanwhile, [8]
compared a broader set of machine learning algorithms for Android malware cases, namely Linear SVM, Naïve Bayes, kNN, Decision Tree, Boosted Tree, Extra Trees, Random Forest, XGBoost, and Stacking. During tests on two different datasets, the Stacking algorithm outperformed the other eight with an F1 score of 95%.
Regarding malware detection on the Windows operating system, [9] used the Random Forest algorithm and achieved an accuracy score of 95.26%. [10], in their paper, also explored the use of machine learning for malware detection in the Windows OS environment. The highest accuracy score was again achieved by the Random Forest algorithm, with scores of 96.8% for binary classification and 95.69% for multiclass classification. Meanwhile, the highest F1 score was owned by Hard Voting with a score of 92%. The algorithms compared included kNN, SVM, Bernoulli Naïve Bayes, Random Forest, Hard Voting, Logistic Regression, and Decision Tree.
In another study, Windows-based malware detection was also pursued by [5], focusing on the use of ensemble methods to classify a file as either goodware or malware, especially ransomware. They used five classifiers in their experiments: SVM, Random Forest, kNN, XGBoost, and Neural Network. Among these five algorithms, the Neural Network boasted the highest accuracy score of 98%. Subsequently, the researchers applied these five algorithms combined with an ensemble-based bagging method. Voting was used to determine the classification results. As a result, they achieved an accuracy score of 98.7%, which is better than the standalone Neural Network.
The plethora of literature conducting experiments involving the kNN algorithm attests that kNN is one of the popular and competent classification algorithms. The research in [7] even demonstrated that kNN had the best performance in classifying a file as either goodware or malware. Hence, this study will focus on enhancing the performance of the kNN algorithm. Preceded by pre-processing, which involves class balancing, constant features removal, feature scaling, and feature selection, the performance of the kNN algorithm will be tested. A comparison of distance measurement methods will also be conducted to obtain the best-performing kNN.
2. RESEARCH METHODOLOGY
This study follows a series of steps conducted in a sequential manner, as depicted in Figure 1. The initial phase involves the collection and preparation of the dataset. Given that the collected data may have certain issues, it needs to be prepared prior to being forwarded to the pre-processing stage. A comprehensive discussion on this topic is available in Subchapter 2.1. Once the dataset is collected and prepared, it undergoes pre-processing. This vital step in data analysis involves cleaning and transforming raw data to ensure it’s in the best format for further analysis or model building. This includes class balancing, removal of constant features, feature scaling, and feature selection. Subchapter 2.2 provides a detailed explanation of this stage. After pre-processing, the dataset is ready for modeling. At this point, the kNN algorithm is implemented. To find the best performance of kNN, some adjustment on hyper parameters is conducted and discussed in Subchapter 2.3. Before the performance of the kNN algorithm can be evaluated, validation is necessary. To mitigate the risk of overfitting, a 10-fold cross-validation is utilized. Discussions about validation and evaluation are available in Subchapter 2.4 and 2.5, respectively.
2.1 Dataset Collection & Preparation
In this first stage of research, a zip file was downloaded from the UCI Machine Learning Repository website, with the dataset name: 'Malware static and dynamic features VxHeaven and Virus Total Data Set'. The detailed contents of this file can be seen in Table 1.
Table 1. Details of the dataset used in this research
Dataset Name Malware static and dynamic features VxHeaven and Virus Total Data Set.
Number of Files 3 (consisting of goodware files, malware from VirusTotal, and malware from VxHeaven).
Number of Rows Goodware: 595; VirusTotal: 2955; VxHeaven: 2698.
Number of
Features Goodware: 1085; VirusTotal: 1087; VxHeaven: 1087 (without label).
Missing Values None
Inside this zip file, there are three CSV files: one containing goodware data and the other two containing malware data from VirusTotal and VxHeaven, respectively. In the goodware dataset, there are 595 data entries
that describe the behavior of benign files (non-malware). The total number of features successfully extracted is 1,085. At this stage, a label column was added and populated with the value '0', indicating that all data in this file fall under the goodware category.
Figure 1. Research Stages
Meanwhile, there are two *.csv files containing malware activities, specifically from VirusTotal and VxHeaven. The number of data entries in the VirusTotal malware dataset is 2,955, while the VxHeaven malware dataset contains 2,698 entries. Both have the same number of features, amounting to 1,087 features. At this stage, a label column was added to both of these dataset files, and populated with the value '1', indicating that all data in these two files fall under the malware category.
The three datasets could not be merged initially due to differing feature counts. To consolidate them, some features had to be removed. A feature named 'filename' from the goodware dataset was discarded since it was not deemed significant for classification activities. Conversely, in the two malware dataset files, three features needed to be removed: 'vbaVarIndexLoad', 'SafeArrayPtrOfIndex', and 'filename'. The first two mentioned features were not present in the goodware dataset; hence they were eliminated to standardize the feature count across all datasets.
Now, all three datasets have the same number of features, specifically 1,084 features and one label column.
Consequently, they can be merged to form a single dataset with a total of 6,248 entries.
2.2 Pre-Processing
In the pre-processing phase, there are four key tasks to be undertaken: class balancing, constant features removal, feature scaling, and features selection. Class balancing is a crucial step in pre-processing data for machine learning models. It ensures that each class within the dataset is represented equally, preventing the model from being biased towards the majority class. Techniques such as oversampling, undersampling, or Synthetic Minority Over- sampling Technique (SMOTE) can be used to achieve class balance. Without class balancing, models may perform poorly on minority classes, leading to inaccurate predictions. Therefore, class balancing is an essential step in building robust and accurate machine learning models.
In this dataset, there are 5,653 entries for the malware class, while the goodware class has 595. The Imbalanced Ratio (IR) of these classes is 1:9.5. According to [11], an IR > 9 can be categorized as Medium Imbalance and therefore needs to be balanced. This disparity in data volume between the two classes can lead to bias, where the majority class tends to dominate [12], [13]. Therefore, it's essential to balance these two classes.
The class balancing method adopted in this study is Random Under Sampling, where data from the majority class (labeled as malware) is randomly eliminated until it matches the number of the minority class (labeled as goodware), which is 595 [14].
The next step in pre-processing is constant features removal, which involves eliminating features that have constant values or a variance of zero. Constant features are those that have the same value in all samples, providing no variance or information to the model [15]. Removing these features reduces the dimensionality of the dataset, making the model more efficient and easier to interpret. Therefore, constant feature removal is a key step in optimizing machine learning models for better performance and interpretability. In this dataset, there are 936 features with constant values. As these features do not impact the classification outcome, they are removed, leaving 148 features available for further processing.
The third stage in the pre-processing phase is feature scaling, which standardizes the range of values across all features [16]. The method used for feature scaling in this study is MinMax Normalization with a lower bound of 0 and an upper bound of 1. MinMax normalization is a technique used in data pre-processing to scale numerical data [17], [18]. It transforms the values in a feature to a scale between 0 and 1, maintaining the original distribution and relationships between values. This is particularly useful for algorithms that are sensitive to the scale of the input features, such as k-Nearest Neighbors (kNN) and neural networks. By using MinMax normalization, we can ensure that all features contribute equally to the model, regardless of their original scale. Through this normalization, the values across all features will be transformed to fall between 0 and 1. The formula for MinMax Normalization can be seen in Equation 1.
vi= ( xi − xmin
xmax− xmin (vmax− vmin)) + vmin (1)
In Equation 1 above, vi refers to the newly formed value of the i-th data. x denotes the data to be normalized.
xmin represents the smallest value of a feature, while xmax is its highest value. Furthermore, vmax is the new upper boundary for a feature's value (which is 1), and vmin sets the new lower limit (which is 0). Figure 2 presents a comparison of the dataset before and after the application of normalization. All the values are transformed to fall within the 0 to 1 range.
The fourth step, and the final one in this pre-processing phase, is feature selection. The remaining 148 features will be further reduced to simplify computational processes without diminishing the accuracy and F1- Score performance in classification. The method employed for feature selection is information gain. Information Gain is one of the most commonly used and popular methods for feature selection [19]. Information gain is a concept that helps us understand which features are most useful when training a machine learning model like kNN.
It’s like a score that tells us how much a feature can help improve our predictions. The higher the information gain, the more helpful the feature is. So, it’s a handy tool when deciding which features to include in our model. The best features are determined by their entropy value (as shown in Equation 2).
Entropy (S) = ∑ −Pc ilog2Pi
i (2)
After determining the Entropy values, the Information Gain score can be calculated using Equation 3.
Gain (S, A) = Entropy(s) − ∑ |Sv|
|S| Entropy(Sv)
Values(A) (3)
The total number of features used is 32. This amount of 32 was derived from previous research, where the optimal number of features when using the Principal Component Analysis method was found to be 32 features [20]. The top 32 features with the highest information gain scores can be seen in Table 2. The performance of the kNN algorithm will also be compared between implementations using feature selection and those without feature selection.
Figure 2. Before normalization (left) and after normalization (right)
2.3 Modelling
The Modelling phase involves the application of a classification algorithm to the prepared dataset. The algorithm employed in this research is the k-Nearest Neighbor (kNN). kNN is a popular classification algorithm due to its straightforward operation yet its ability to produce good performance [17]. The k-Nearest Neighbors (kNN) algorithm is a simple, yet powerful machine learning method. It works by classifying new data points based on their similarity to known data points in the training set. The ‘k’ in kNN refers to the number of nearest neighbors the algorithm considers when making its prediction. Unlike many other machine learning algorithms, kNN doesn’t build a model but uses the entire dataset in the prediction phase, which makes it a type of instance-based learning.
Despite its simplicity, kNN can be remarkably effective and is particularly useful for classification and regression problems. At least two aspects need to be considered when using the kNN algorithm: the value of k and the method of distance measurement. In this study, the performance of the kNN algorithm will be compared using four variations of k values (k = 3, k = 5, k = 7, and k = 9) and three variations of distance measurement methods (Euclidean, Manhattan, and Chebyshev).
The Euclidean distance is a measure that calculates the straight-line distance between two points. This concept is similar to a bird flying directly from point A to point B. The simplicity and intuitiveness of the Euclidean distance make it beneficial. It finds applications in various fields, including machine learning, where it helps identify the most similar data points. For instance, in a recommendation system, it can assist in finding the most similar users or items based on their attributes. It is also used in image processing and computer vision for tasks like object detection and pattern recognition. In this research, we applied the Euclidean distance as distance metrics in kNN algorithm and compared it with other distance metrics. The formula for Euclidean distance is shown on equation 4 [21].
d(x, y) = √∑ni=1(xi− xi)2 (4)
In this formula, d(x,y) represents the Euclidean distance between two points x and y, each having n dimensions. This formula is essentially the Pythagorean theorem.
The Manhattan distance is a method of calculating the total absolute difference between the coordinates of two points in a space with multiple dimensions. It’s named after the grid-like street layout of Manhattan, where the distance between two points is the sum of the vertical and horizontal distances. The Manhattan distance has several advantages. It’s less sensitive to outliers compared to the Euclidean distance. It’s also been found to work better with high-dimensional data. In machine learning, it’s used in algorithms like k-nearest neighbors and k- medoids. It helps to find the most similar data points based on their attributes. The formula for Manhattan distance is shown in equation 5 [21].
d(x, y) = ∑ni=1|xi− yi| (5)
In this formula, d(x, y) represents the Manhattan distance between two points x and y, each having n dimensions. For each dimension i, the absolute difference between the corresponding coordinates of the two points, xi and yi, is calculated: |xi - yi|.
The Chebyshev distance is a metric that calculates the maximum absolute difference between the coordinates of two points in a space with multiple dimensions. This distance is named after Pafnuty Chebyshev, a Russian mathematician. In machine learning, it’s used in algorithms like k-nearest neighbors and k-medoids. The formula for Chebyshev distance is shown on equation 6 [22]:
d(x, y) = max
I=1,2,…n|xi− yi| (6)
In this formula, xi and yi are the coordinates of the two points in the i-th dimension. The absolute differences between the corresponding coordinates are calculated, and the maximum of these differences is the Chebyshev distance.
Table 2. List of 32 features used in this research.
No. Features IG Score
1. Minor_image_version 0.761
2. Minor_operating_system_version 0.716 3. Major_operating_system_version 0.639
4. Size_of_stack_reverse 0.626
5. Compile_date 0.593
6. Minor_linker_version 0.566
7. Major_image_version 0.555
8. Major_subsystem_version 0.523
9. Dll_characteristics 0.485
10. Minor_subsystem_version 0.477
11. CheckSum 0.408
No. Features IG Score
12. Major_linker_version 0.395
13. Characteristics 0.389
14. Number_of_IAT_entires 0.257
15. Number_of_IAT_entires.1 0.257
16. Pushf 0.243
17. Size_of_stack_commit 0.239
18. Files_operations 0.221
19. .text: 0.205
20. Count_dll_loaded 0.202
21. SizeOfHeaders 0.195
22. Size_of_headers 0.195
23. SizeOfHeaders.1 0.195
24. Number_of_sections.1 0.194
25. Not 0.192
26. Count_file_opened 0.180
27. Bt 0.131
28. Nop 0.129
29. Count_file_read 0.119
30. Number_of_imports.1 0.113
31. Int 0.110
32. Rol 0.109
2.4 Validation
Validation is the stage where the dataset is divided into two parts: the training set and the testing set. The validation phase is a prerequisite before moving on to the evaluation stage. There are two popular methods for conducting validation: split validation and cross validation. In this research, the validation technique used is cross validation, with k = 10 (10-fold cross validation). The value of k refers to the number of iterations to be conducted. With k = 10, this means the dataset will be divided into 10 groups, each of equal size. In this case, testing will be conducted 10 times (iterations). Each of these groups will be involved nine times as training data and once as testing data.
The cross-validation method is considered superior to split validation as it can minimize the occurrence of overfitting [23], [24].
2.5 Evaluation
To measure the performance of a classification algorithm in malware detection cases, two main metrics will be used: accuracy and F1 Score [25], [26]. Accuracy serves to measure how well a model makes correct predictions. In other words, it’s the proportion of true results (both true positives and true negatives) among the total number of cases examined. On the other hand, the F1-score is another important measure used in machine learning, especially for tasks where both precision (how many selected items are relevant) and recall (how many relevant items are selected) are important. The F1-score is the harmonic mean of precision and recall, providing a balance between the two, where both false positives and false negatives have equally significant impacts.
Accuracy = TP+TN
(TP+FP+TN+FN) (8)
To measure accuracy, the formula used is as depicted in equation 8, which is the total correct predictions of a model divided by the total predictions overall, both correct and incorrect. TP (True Positive) represents the number of times a model correctly predicts the positive class, while TN (True Negative) is about how many times a model correctly predicts the negative class. Meanwhile, FP (False Positive) represents the number of times a model incorrectly predicts the positive class, and FN (False Negative) refers to how many times a model incorrectly predicts the negative class. To obtain the F1 Score (equation 9), it's necessary to first compute the Precision metric (equation 10) and Recall (equation 11).
F1 Score =2 x (Precision x Recall)
Precision+Recall (9)
Precision = TP
TP + FP (10)
Recall = TP
TP + FN (11)
Precision is defined as the ratio of true positive predictions (TP) compared to the total predictions that are positive (TP and FP). Meanwhile, Recall is defined as the ratio of true positive predictions (TP) compared to the total actual positive data (TP and FN).
3. RESULT AND DISCUSSION
The experiments in this study were conducted 24 times, with variations based on the inclusion or exclusion of feature selection, the value of k, and the distance measurement method. In determining the value of k, it is recommended to use odd numbers or prime numbers more than 2 [27]. Therefore, the values of k used in this experiment were 3, 5, 7, and 9. Meanwhile, the distance measurement methods employed were Euclidean, Manhattan, and Chebyshev. Each of these variations was executed on datasets with and without feature selection.
The feature selection method used was information gain, resulting in 32 features after reduction.
Table 3. Performance of k-NN with k = 3
k = 3
Distance Measurement Methods Without Features Selection With Features Selection Accuracy F1-Score Accuracy F1-Score
Euclidean 95.5% 95.5% 96.6% 96.6%
Manhattan 96.7% 96.7% 97.0% 97.0%
Chebyshev 93.8% 93.8% 95.5% 95.5%
The results of the experiments conducted are presented in Tables 3 to 6. In Table 3, the experiment was based on a k value of 3. In this experiment, the highest value was obtained using feature selection and Manhattan as the distance measurement method, which was 97.0%, both for accuracy performance and F1-Score. Notably, the use of the kNN algorithm with information gain (32 features) as a feature selection method outperformed all experiments that used all features (without feature selection).
Table 4. Performance of k-NN with k = 5
k = 5
Distance Measurement Methods Without Features Selection With Features Selection Accuracy F1-Score Accuracy F1-Score
Euclidean 95.5% 95.5% 96.1% 96.1%
Manhattan 96.6% 96.6% 96.6% 96.6%
Chebyshev 92.9% 92.9% 95.4% 95.4%
In the subsequent experiment where k = 5 (refer to Table 4), the highest score was achieved by kNN using the Manhattan distance measurement method, with scores of 96.6% for both accuracy and F1-Score. In this case, feature selection's use did not influence the algorithm's performance results. However, this score is still superior to the use of other distance measurement methods, with or without feature selection.
Table 5. Performance of k-NN with k = 7
k = 7
Distance Measurement Methods Without Features Selection With Features Selection Accuracy F1-Score Accuracy F1-Score
Euclidean 95.5% 95.3% 96.6% 96.6%
Manhattan 96.6% 96.6% 96.8% 96.8%
Chebyshev 92.9% 92.9% 95.1% 95.1%
Table 6. Performance of k-NN with K = 9
k = 9
Distance Measurement Methods Without Features Selection With Features Selection Accuracy F1-Score Accuracy F1-Score
Euclidean 95.5% 95.5% 96.2% 96.2%
Manhattan 96.0% 96.0% 96.9% 96.9%
Chebyshev 92.7% 92.7% 95.1% 95.1%
In the experiment with k = 7 (Table 5), kNN with the Manhattan distance measurement method and feature selection obtained the highest score compared to other combinations, which was 96.8%. This score is also superior to the use of Manhattan without feature selection. Meanwhile, in the last experiment, for k = 9, the highest score was once again achieved by the kNN algorithm with a combination of Manhattan and feature selection. The score obtained was 96.9%, better than any other combination, including Manhattan without feature selection.
From the tests conducted, it can be concluded that the highest score was achieved when kNN used k = 3, feature selection with information gain of 32 features, and the Manhattan distance measurement method, with a score of 97.0%. Another conclusion that can be drawn is that, based on the tests conducted, the Manhattan method always produced a better classification performance score than the other two methods, Euclidean and Chebyshev, whether tested without feature selection or with it. The lowest scores were always obtained by the Chebyshev method for any value of k, with or without feature selection.
The graph shown in Fig. 3 clearly illustrates that the performance produced by the Euclidean (blue) and Manhattan (yellow) methods surpasses that of Chebyshev (green). The graph also shows that the use of feature selection mostly results in a significant performance increase (light colour = without feature selection, dark colour
= with feature selection). The WoFS code in the distance measurement method name means Without Feature Selection. The highest score was achieved by kNN with k = 3, the Manhattan distance measurement method, and feature selection (dark yellow).
Thus, for malware detection cases, it is recommended to use the kNN method with a value of k = 3, Manhattan as the distance measurement method, and Information Gain as the feature selection method.
Figure 3. Performance graph of the kNN algorithm with various k values, distance measurement methods, and use of feature selection
4. CONCLUSION
Malware possesses rapid evolution and dissemination capabilities. Given these abilities, it is unsurprising that malware is a menace for anyone using computers, both offline and online. The evolution of malware from traditional to modern forms has rendered signature-based detection ineffective. As an alternative, machine learning-based detection has become the choice of experts in tackling modern malware. In this study, the algorithm used for machine learning-based malware detection is the k-Nearest Neighbor. Analysis was performed on various experimental scenarios, ranging from the inclusion or exclusion of feature selection, the choice of the value of k, and the distance measurement method employed. The values of k tested were 3, 5, 7, and 9. The distance measurement methods used were Euclidean, Manhattan, and Chebyshev. Each combination was then tested on datasets with and without feature selection. In total, 24 experiments were conducted with these scenario combinations. The results showed that the highest performance score was obtained in the scenario with k = 3, using the Manhattan distance measurement method, and with feature selection (information gain, 32 features). The highest value achieved was 97.0%, both in terms of accuracy performance and F1-Score. Therefore, the recommendation for using the kNN algorithm for malware detection is to use k = 3, the Manhattan method for distance measurement, and feature selection using information gain.
ACKNOWLEDGMENT
Special thanks are extended to the Institute of Research and Community Service (LPPM) and the Faculty of Computer Science, Universitas Dian Nuswantoro for their support in facilities and funding for this research.
REFERENCES
[1] N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. Damaševičius, “Windows PE Malware Detection Using Ensemble Learning,” Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.
[2] O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” IEEE Access, vol. 8, pp. 6249–
6271, 2020, doi: 10.1109/ACCESS.2019.2963724.
[3] A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,” Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.
[4] A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,” IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.
[5] S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,” Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569- 4.
[6] F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,” IJCSIS, vol. 9, no. 1, 2011.
[7] S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,” 2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf
[8] P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,” IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.
[9] F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classification”.
[10] I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,” Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi:
10.1016/j.procs.2020.03.110.
[11] Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,”
Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.
[12] J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,” Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.
[13] W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,” Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.
[14] F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” JPIT, vol. 8, no. 2, pp. 113–118, 2023.
[15] Y. Prihantono and Kalamullah Ramli, “Model-Based Feature Selection for Developing Network Attack Detection and Alerting System,” J. RESTI (Rekayasa Sist. Teknol. Inf.), vol. 6, no. 2, pp. 322–329, Apr. 2022, doi:
10.29207/resti.v6i2.3989.
[16] Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,” IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.
[17] A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,” IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.
[18] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
[19] Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,” IEEE Access, vol. 8, pp. 132911–132921, 2020, doi:
10.1109/ACCESS.2020.3009843.
[20] F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,” JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul.
2023, doi: 10.47233/jteksis.v5i3.854.
[21] D. M. Saputra, D. Saputra, and L. D. Oswari, “Effect of Distance Metrics in Determining K-Value in K-Means Clustering Using Elbow and Silhouette Method,” in Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019), Palembang, Indonesia: Atlantis Press, 2020. doi:
10.2991/aisr.k.200424.051.
[22] O. A. Mohamed Jafar and R. Sivakumar, “Distance Based Hybrid Approach for Cluster Analysis Using Variants of K- means and Evolutionary Algorithm,” RJASET, vol. 11, no. 8, pp. 1355–1362, Sep. 2014, doi: 10.19026/rjaset.8.1107.
[23] G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,” Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.
[24] G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,” Machines, vol. 7, no. 4, p. 74, Dec.
2019, doi: 10.3390/machines7040074.
[25] G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,” in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.
[26] S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,” in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi:
10.1109/ICAC3N56670.2022.10074117.
[27] D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,” Neural Comput & Applic, vol. 35, no. 16, pp. 12195–
12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.