Comparison of Support Vector Machine and Random Forest Method on Static Analysis Windows Portable Executable (PE) Malware Detection

(1)

Comparison of Support Vector Machine and Random Forest Method on Static Analysis Windows Portable Executable (PE) Malware

Detection

Hazim Ismail¹, Rio Guntur Utomo^2,*, Marastika Wicaksono Aji Bawono²

1Informatics Faculty, Informatics, Telkom University, Bandung, Indonesia

2Informatics Faculty, Information Technology, Telkom University, Bandung, Indonesia Email: ¹[email protected], ^2,*[email protected],

3[email protected]

Correspondence Author Email: [email protected]

Abstract−Malware has emerged as a significant concern for computer system security, as it spreads rapidly and adversely affects system performance. Detecting malware has become crucial, and one of the methods utilized is Machine Learning classification, which learns the characteristics of an application without executing it. In this study, the author evaluates the efficacy of malware detection in the static analysis of Windows Portable Executable (PE) files using the Support Vector Machine (SVM) and Random Forest algorithms. The author employs a dataset containing both malware-related PE files and safe applications to train the SVM and Random Forest models to classify PE files as either malware or safe. The objective is to determine the most effective machine learning algorithm for malware detection in PE files. The research compares the performance of both algorithms to identify the superior one for malware detection. The results indicate that the Random Forest algorithm achieves an impressive accuracy of 98.53%, while the SVM algorithm performs slightly lower with an accuracy of 97.14%.

Keywords: Malware Detection; Support Vector Machine; Random Forest; Machine Learning; Windows Portable Executable

1. INTRODUCTION

A malicious program that can perform destructive functions is Malware. It can be used in several negative ways, such as destroying existing resources on the computers, gaining financial gain, stealing privacy or confidential data and taking advantage of computing system resources to make services unavailable. Consequently, it is crucial to find effective measures for thwarting malware attacks. One approach is to identify and intercept malware files that infiltrate device users [1]. As indicated by source [2], a considerable number of malware attacks utilize files in the portable executable (PE) format, mainly because the majority of executable files on Windows are in this format.

There are two main categories of malware detection methods: static malware detection and dynamic malware detection[3]. Static malware detection involves classifying samples as either malicious or benign without executing them. In contrast, dynamic malware detection identifies malware based on its behavior when it is executed [4]. While static malware detection does not run the malware itself, it proves more effective in analyzing large-scale data compared to dynamic malware detection methods [5], [6], [7].

In the process of static analysis, analysts can retrieve essential data from the PE file, including entry points, function names, library dependencies, and embedded resources. This valuable information aids in comprehending the malware's operations, recognizing behaviors or patterns linked to specific malware families or versions, and devising efficient strategies for detection and countermeasures.

In a study conducted by Ijaz [8] in 2019, the accuracy of static analysis for PE malware detection was compared across various methods, including Random Forest, Decision Tree, Linear Ridge Classifier, Bagging Classifier, AdaBoost Classifier, Tree Classifier, and Gradient Classifier. The research utilized a total of 92 datasets, and the results revealed that Random Forest achieved an accuracy of 97.36%. Similarly, Barlam [9] conducted a static analysis study of PE malware detection in 2019, using the SVM method on 1416 datasets. The findings indicated an accuracy of 92.9% for this approach. And the study led by Akhtar [10], SVM was utilized for static malware detection, employing a dataset consisting of 17,394 samples. The results revealed an accuracy rate of 98%.

Given the high accuracy achieved in previous research using SVM and Random Forest algorithms for various tasks, including malware detection, it is crucial to highlight the absence of a direct comparison between these two methods, particularly for static analysis in Windows PE malware detection. To address this research gap, the author of this study aims to conduct a thorough investigation of Windows PE malware detection through static analysis using both Support Vector Machine and Random Forest methods. The primary goal is to ascertain which approach offers higher accuracy and proves more effective in detecting Windows PE malware using the same dataset.

he main goal of this research is to thoroughly assess the advantages and drawbacks associated with two widely used methodologies Support Vector Machine (SVM) and Random Forest algorithms in the context of Windows PE malware detection. To accomplish this objective, the study employs both algorithms on a consistent dataset, enabling an impartial and direct evaluation of their performance metrics. Through methodical testing and careful analysis, the investigation seeks to unveil the unique strengths and weaknesses of these algorithms,

(2)

providing insights into their effectiveness in addressing the particular complexities presented by Windows PE malware.

The results of this research are expected to make a significant contribution to the field of malware detection.

By offering valuable insights into the performance and suitability of the Support Vector Machine and Random Forest methods, particularly in the context of static analysis for Windows PE malware, this study will assist in identifying the most suitable algorithm for this specific task. Ultimately, these insights will lead to improved accuracy and effectiveness of Windows PE malware detection systems.

2. RESEARCH METHODOLOGY

Support Vector Machines and Random Forest are algorithms for the development of malware detection models.

Details on how an algorithm is included in a methodology to reach the aim of this research are given below.

2.1 Method for achieving the research objectives

Figure 1. Methodology to achieve the objectives The following is an explanation of each stage.

1. Dataset

The researchers used the PE dataset from the online repository Elastic Malware Benchmark for Empowering Researchers [11], which was collected in 2018. Prior to model construction, the dataset underwent preprocessing to extract relevant features. It consists of 50.000 records, and the following variables were chosen for analysis: numstrings, avlength, printables, entropy, paths, urls, registry, MZ, size,vsize,has_debug,exports_counts,imports_counts,has_relocations,has_resources,has_signature,has_tls,sy mbols,coff.timestamp,optional.major_image_version,optional.minor_image_version,optional.majorlinker_ve rsion,optional.minor_linker_version,optional.major_operating_system_version,optional.minor_operating_sys tem_version,optional.major_subsystem_version,optional.minor_subsystem_version,optional.sizeof_code,and optional.sizeof_headers.

Figure 2. Dataset Variables

(3)

2. Preprocessing

In this phase, the researchers undertake a data cleaning process to optimize the performance of the constructed model. This involves several essential steps, including the removal of duplicate data, handling outliers, and eliminating unnecessary features [12], [13]. The initial step focuses on identifying and eliminating duplicate entries to prevent bias and ensure each data point contributes uniquely to the model's learning. Next, outliers, which are data points significantly deviating from the dataset's normal distribution, are addressed. Depending on the data's nature and requirements of the specific problem, outliers can be removed, transformed, or replaced with appropriate values. Lastly, irrelevant features that do not significantly contribute to the model's performance are removed. These irrelevant features can introduce noise and complexity, potentially leading to overfitting. By streamlining the model's learning process and enhancing interpretability, computational requirements are reduced. These data cleaning steps ensure that the model is trained on high-quality, reliable data, devoid of redundancies, outliers, and unnecessary noise, laying the foundation for the development of a high-performing model.In this stage, an examination is performed to identify the features that exert the greatest impact.

Figure 3. Importances features

Based on figure 4, it is apparent that the notable features in this dataset consist of: has_debug, avlength, option.sizeof_headers, entropy, has_signature, optional.major_linker_version, optional.sizeof_code, coff.timestamp, paths, and printables. These features are highly influential in the detection process.

3. Dataset Spliting

After the data has undergone preprocessing, the next stage involves dividing the dataset into distinct subsets for the purpose of training and testing. The aim of this dataset splitting is to objectively measure and evaluate the model's performance. By segregating the data into separate subsets, the model can be trained on the training set and then tested on the testing set, which it has not encountered before. The Training Set is a subset exclusively used to train the machine learning model. It contains the majority of the data, typically around 70%

of the total dataset. During training, the model learns patterns and relationships within this subset, enabling it to comprehend and generalize these patterns for the purposes of prediction or classification. The Test Set, on the other hand, is a separate subset exclusively used to evaluate the final performance of the trained model.

The data in this subset is not employed during the training or model optimization processes. By using unseen data for evaluation, the test set assesses the model's ability to generalize and predict outcomes for new, previously unseen data. The results obtained from the testing subset indicate how effectively the model can be applied in real-world scenarios. To ensure an unbiased evaluation of the model's performance, dataset splitting should ideally be performed after preprocessing, and it should be done in a random or stratified manner. The recommended data split ratio is 70:30, where 70% of the data is used for training and 30% for testing. Research [14] has shown that this 70:30 ratio yields better performance for each algorithm compared to other ratios.

4. Model For Detecting Malware

In this stage, after dividing the dataset, the process of malware detection takes place using the respective subsets of data. The training dataset plays a critical role as it serves to train the model, allowing it to grasp and comprehend the patterns and characteristics of malware [15]. By exposing the model to a diverse range of malware samples within the training dataset, it acquires the necessary knowledge to effectively identify and categorize malicious software. Following the training process, the model is then put to the test using the

(4)

dedicated testing dataset. This separate dataset, which has remained unseen during the training phase, acts as a reliable benchmark to evaluate the performance and effectiveness of the trained model. By subjecting the model to this independent evaluation, we can assess its ability to accurately detect and classify malware instances that it has not encountered before. By leveraging the training dataset to educate the model and the testing dataset to evaluate its performance, this stage ensures that the developed malware detection model has undergone comprehensive training and evaluation. This approach enhances the model's reliability, enabling it to effectively identify and combat malware in real-world scenarios.

5. Comparative Analysis

In this stage, the performance of the developed malware detection models is compared using effective performance metrics [16] . Evaluating and comparing these performance metrics provides valuable insights into the strengths and weaknesses of each algorithm, aiding in making informed decisions about their suitability for malware detection. Various performance metrics, such as accuracy, precision, recall, and F1 score, commonly used in classification tasks, are employed to measure the algorithms' effectiveness. These metrics offer a comprehensive understanding of how well the models classify data and identify the target outcomes.

By utilizing these performance metrics, an objective comparison of the malware detection models is conducted, enabling the determination of which algorithm exhibits superior performance in detecting malware. This comparison assists in selecting the most suitable algorithm that demonstrates optimal accuracy, precision, recall, and F1 score. Ultimately, this stage facilitates informed decisions about the effectiveness of the models, aiding in the choice of the most efficient algorithm for detecting malware, thereby yielding the best results.

2.2 Theoretical of Support Vectror Machine

Support Vector Machine (SVM) is a versatile method suitable for classification or regression tasks. Originally designed for linear problem-solving, SVM has evolved to handle non-linear problems by identifying optimal hyperplanes [17]. Support Vector Machine (SVM) is a non-parametric method commonly employed in data classification and image processing. The accuracy of this method depends on the selection of parameters and kernels. Users can adjust the parameters, and each parameter in the kernel has a distinct influence on the performance [18].

Figure 4. Architecture SVM

The accuracy of this approach relies on the parameters and kernel employed. The subsequent kernel is frequently utilized in modeling Support Vector Machines:

a) The linear kernel, designed for classifying linear data, can be computed using the following formula:

𝑓(𝑥) = 𝑤^∗𝑥 + 𝑏 (1)

b) The Polynomial Kernel, employed for non-linearly separated data, can be computed using the following equation:

𝑓(𝑥) = (𝑔𝑎𝑚𝑚𝑎^∗(𝑥^∗𝑥^′) + 𝑐𝑜𝑒𝑓0)^𝑑 (2)

c) The Radial Basis Function (RBF) Kernel is applied to address classification challenges in data that defy linear separation. Its calculation involves the following equation:

𝑓(𝑥) = 𝑒𝑥𝑝(−𝑔𝑎𝑚𝑚𝑎^∗||𝑥^∗𝑥^′||²) (3)

(5)

2.3 Theoretical of Random Forest

Figure 5. Random Forest Architecture

Random Forest falls within the realm of supervised learning classification methods, relying on labeled data in the training process. It marks an evolution from the Classification and Regression Tree (CART) method through the integration of techniques like bootstrap aggregating and random feature selection. In the construction of each tree, the Random Forest method employs information gain and the Gini index for its computations.As implied by its name, Random Forest constructs multiple trees, resembling the creation of a forest. Each tree is built using a randomly chosen subset of the training data. During the classification phase, each tree independently provides its optimal class prediction. Once the ensemble of trees is established, a majority voting scheme is implemented to determine the final decision. This decision is derived from the most frequently predicted class among the trees forming the ensemble. Potency of Random Forest lies in its capability to mitigate overfitting, enhance generalization, and bolster accuracy by amalgamating predictions from multiple trees. This ensemble methodology positions Random Forest as a potent and adaptable tool in the domain of machine learning, particularly for classification tasks. [19], [20].

2.4 Metrics

When appraising the performance of a classification system, it is crucial to adopt a methodology for measurement.

One commonly used approach is the confusion matrix, a valuable tool that yields essential performance metrics, including accuracy, precision, recall, and the F-measure. These metrics collectively provide a thorough evaluation by taking into account true positives, true negatives, false positives, and false negatives in the assessment process.

[21]. A 2x2 confusion matrix is categorized as follows:

Table 1. Confusion Matriks Prediction Class Positive Negative True/1 TP (True Positive) TN (True Negative) False/0 FP (False Positive) FN (False Negative)

Based on the data presented in Table 1, the evaluation of the tested machine learning model's performance requires values from the confusion matrix, including True Negative (TN), False Positive (FP), False Negative (FN), and True Positive (TP). Utilizing these values, we can calculate the machine learning model's performance using various performance measurement metrics such as:

1. Accuracy

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^{𝑇𝑃+𝑇𝑁}

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (4)

2. Precision 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ^𝑇𝑃

𝑇𝑃+𝐹𝑃 (5)

3. Recall 𝑅𝑒𝑐𝑎𝑙𝑙 = ^𝑇𝑃

𝑇𝑃+𝐹𝑁 (6)

(6)

4. F-Measure

𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =2 ×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 (7)

3. RESULT AND DISCUSSION

In this research, a dataset comprising 50,000 malware samples is utilized. The dataset is partitioned, allocating 70% for training data and the remaining 30% for testing purposes. Subsequently, the developed Support Vector Machine (SVM) algorithm is applied for testing, and the efficacy of the SVM algorithm in detecting malware is quantified through the construction of a confusion matrix. The success rate of the SVM algorithm demonstrates its effectiveness in classifying malware. Figure 2 displays the corresponding confusion matrix, providing a visual representation of the algorithm's performance in terms of true positives, true negatives, false positives, and false negatives. This achievement underscores the successful development of a malware classification algorithm using the SVM method..

Figure 6. Confusion Matriks Support Vector Machine

Derived from the results, it's clear that the SVM algorithm model performs exceptionally well. Registering a True Negative value of 7261 and only 207 False Negatives, the SVM algorithm showcases a strong proficiency in discerning non-malware instances, reducing the chances of mislabeling non-malware as malware. Moreover, the figures for True Positive (7311) and False Positive (221) underscore the SVM's remarkable effectiveness in identifying malware, with a minimal occurrence of misclassifications where malware is inaccurately identified as non-malware.

Once the SVM algorithm model has demonstrated effective performance, the dataset is subjected to additional testing using the random forest algorithm model. This sequential evaluation allows for a comprehensive comparison of the two algorithms on the same dataset, facilitating a nuanced understanding of their respective strengths and weaknesses.

In Figure 3, each quadrant of the confusion matrix is visually presented, providing a clear depiction of how well the Random Forest algorithm performed in distinguishing between different classes. This visual representation aids in the interpretation of specific metrics such as precision, recall, and accuracy, offering a more intuitive understanding of the algorithm's strengths and areas for improvement.

Figure 7. Confusion matriks Random Forest

Upon conducting tests using the identical dataset on the Random Forest model, notable improvements were observed. True Negative yielded a count of 7366, accompanied by only 104 False Negatives. This performance enhancement becomes evident when juxtaposed with the results of SVM testing. Examining True Positive, the model achieved a commendable 7414, while the count for False Positive was 116. The Random Forest model showcases enhanced proficiency in accurately detecting malware, exhibiting fewer errors compared to SVM.

Furthermore, the model demonstrates superior competence in identifying non-malware instances, with a minimized margin of error compared to SVM. This suggests that the Random Forest algorithm, when applied to

(7)

the given dataset, not only excels in malware detection but also excels in minimizing misclassifications, establishing its robustness in distinguishing between malicious and non-malicious entities.

Utilizing the insights garnered from the confusion matrix results presented in Figure 2 and Figure 3, a comprehensive set of performance metrics for each tested algorithm has been systematically organized in Table 2, outlined below:

Table 2. Comparison Algorithm Support Vector Machine and Random Forest Algorithm Accuracy Precision Recall F-Measure

Random Forest 98,53% 98,45% 98,61% 98,54%

Support Vector Machine 97,14% 97,06% 97,24% 97,14%

Table 2 furnishes a comprehensive performance comparison between the two algorithms, offering a detailed overview of their effectiveness in malware detection. The results are further visually depicted in Figure 4, providing a graphical representation that enhances the interpretability of the performance metrics.

Figure 8. Algorithm Comparison

Based on the results presented in Table 2 and Figure 4, the SVM algorithm achieves an Accuracy performance of 97.14%, whereas the Random Forest algorithm attains a higher performance of 98.53%. Regarding the Precision metric, the SVM algorithm exhibits a performance of 97.06%, whereas the Random Forest algorithm achieves a slightly better performance of 98.45%. For the Recall metric, the SVM algorithm achieves a performance of 97.24%, while the Random Forest algorithm demonstrates a higher performance of 98.61%. Lastly, concerning the F-Measure metric, the SVM algorithm obtains a performance of 97.14%, while the Random Forest algorithm performs better with a score of 98.54%.

Based on Ijaz's research [8], the accuracy of static analysis PE malware detection using the Random Forest algorithm was found to be 97.36%. In Barlam study [9], static analysis PE malware detection with the SVM algorithm achieved an accuracy of 92.9%. And in Akhtar study [10] SVM Algorithm achieved accuracy 98 %. In contrast, the findings in Table 2 of this research indicate that the Random Forest algorithm exhibits better accuracy performance compared to the previous study, achieving an accuracy of 98.53%. However, the SVM algorithm in this research shows lower accuracy compared to the previous study, with an accuracy of 97.14%. This disparity can be attributed to the use of a larger dataset in this research. The Random Forest algorithm's superior performance on a larger dataset suggests that it is a more suitable choice for achieving a malware detection system with a higher level of accuracy. Consequently, the Random Forest algorithm has proven to outperform the SVM algorithm in detecting malware, making it a preferred option for accurate malware detection systems.

The system design and testing process undertaken by the author have unveiled pivotal development and testing phases essential for establishing the effectiveness and resilience of the malware detector. These critical stages encompass the inception of a prototype for the malware detector, the application of testing procedures utilizing Deep Learning algorithms, and subjecting the prototype to comprehensive performance evaluations to gauge its accuracy and efficiency in detecting malware within PE files.

The initial phase requires the meticulous creation of a prototype for the malware detector, transforming the theoretical system design into a tangible software solution. This prototype serves as an early iteration of the detector, laying the foundation for subsequent refinements and enhancements.

Following the implementation of the prototype, an intensive testing regimen is employed, leveraging advanced Deep Learning algorithms. Renowned for their capacity to discern intricate patterns within data, these algorithms play a crucial role in enhancing the malware detector's detection capabilities. This process involves training the model on a vast dataset of labeled PE files, enabling it to adeptly recognize patterns indicative of malware presence.

Furthermore, the developed prototype undergoes meticulous performance evaluations. This phase involves subjecting the detector to diverse scenarios and datasets, facilitating a comprehensive assessment of performance

(8)

metrics such as accuracy, precision, recall, and false positive rate. The primary objective is to scrutinize the detector's ability to accurately identify malware in PE files while minimizing instances of both false positives and false negatives.

Through the strategic execution of these developmental and evaluative phases, the author aims to refine and optimize the malware detector's performance. The goal is to cultivate a resilient and dependable solution capable of effectively and accurately detecting malware within PE files, thereby making a significant contribution to the evolving landscape of cybersecurity.

4. CONCLUSION

The performance testing results of the machine learning models designed for malware detection reveal that the Random Forest algorithm consistently outperforms the SVM algorithm across all testing metrics. Notably, the Random Forest algorithm achieved an impressive accuracy rate of 98.53%, surpassing the SVM algorithm's accuracy of 97.14%. This notable difference in accuracy underscores the superiority of the Random Forest algorithm in the field of malware detection. As a result, the Random Forest algorithm stands out as the preferred option for malware detection in comparison to the SVM algorithm, which displays inferior performance on the same dataset. Choosing the Random Forest algorithm ensures a more dependable and efficient malware detection system, particularly when dealing with large datasets. Its capacity to handle extensive data without sacrificing performance further enhances its appeal and suitability for this task. A suggestion for further research is to combine malware detection using multiple machine learning models integrated into a single algorithm model.

REFERENCES

[1] J. Singh and J. Singh, “A survey on machine learning-based malware detection in executable files,” Journal of Systems Architecture, vol. 112. Elsevier B.V., Jan. 01, 2021. doi: 10.1016/j.sysarc.2020.101861.

[2] A. Kumar, K. S. Kuppusamy, and G. Aghila, “A learning model to detect maliciousness of portable executable using integrated feature set,” Journal of King Saud University - Computer and Information Sciences, vol. 31, no. 2, pp. 252–

265, Apr. 2019, doi: 10.1016/j.jksuci.2017.01.003.

[3] R. Chanajitt, B. Pfahringer, and H. M. Gomes, “Combining Static and Dynamic Analysis to Improve Machine Learning- based Malware Classification,” in 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, Oct. 2021, pp. 1–10. doi: 10.1109/DSAA53316.2021.9564144.

[4] A. G. Kakisim, M. Nar, N. Carkaci, and I. Sogukpinar, “Analysis and Evaluation of Dynamic Feature-Based Malware Detection Methods,” 2019, pp. 247–258. doi: 10.1007/978-3-030-12942-2_19.

[5] A. Shalaginov, S. Banin, A. Dehghantanha, and K. Franke, “Machine learning aided static malware analysis: A survey and tutorial,” in Advances in Information Security, vol. 70, Springer New York LLC, 2018, pp. 7–45. doi: 10.1007/978- 3-319-73951-9_2.

[6] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware,” in 2021 IEEE Security and Privacy Workshops (SPW), IEEE, May 2021, pp. 78–84. doi:

10.1109/SPW53761.2021.00020.

[7] R. Sihwail, K. Omar, and K. A. Z. Ariffin, “A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis,” Int J Adv Sci Eng Inf Technol, vol. 8, no. 4–2, pp. 1662–1671, 2018, doi: 10.18517/ijaseit.8.4-2.6827.

[8] M. Ijaz, M. H. Durad, and M. Ismail, “Static and Dynamic Malware Analysis Using Machine Learning,” in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), IEEE, Jan. 2019, pp. 687–691. doi:

10.1109/IBCAST.2019.8667136.

[9] N. Balram, G. Hsieh, and C. McFall, “Static Malware Analysis Using Machine Learning Algorithms on APT1 Dataset with String and PE Header Features,” in 2019 International Conference on Computational Science and Computational Intelligence (CSCI), IEEE, Dec. 2019, pp. 90–95. doi: 10.1109/CSCI49370.2019.00022.

[10] M. S. Akhtar and T. Feng, “Malware Analysis and Detection Using Machine Learning Algorithms,” Symmetry (Basel), vol. 14, no. 11, p. 2304, Nov. 2022, doi: 10.3390/sym14112304.

[11] H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,”

Apr. 2018.

[12] J. Luengo, D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, Big Data Preprocessing. Cham: Springer International Publishing, 2020. doi: 10.1007/978-3-030-39105-8.

[13] P. Mishra, A. Biancolillo, J. M. Roger, F. Marini, and D. N. Rutledge, “New data preprocessing trends based on ensemble of multiple preprocessing techniques,” TrAC Trends in Analytical Chemistry, vol. 132, p. 116045, Nov. 2020, doi:

10.1016/j.trac.2020.116045.

[14] B. Vrigazova, “The proportion for splitting data into training and test set for the bootstrap in classification problems,”

Business Systems Research: International Journal of the Society for Advancing Innovation and Research in Economy, vol.

12, no. 1, pp. 228–242, 2021.

[15] Z. Cui, F. Xue, X. Cai, Y. Cao, G. Wang, and J. Chen, “Detection of Malicious Code Variants Based on Deep Learning,”

IEEE Trans Industr Inform, vol. 14, no. 7, pp. 3187–3196, Jul. 2018, doi: 10.1109/TII.2018.2822680.

[16] A. Chatzimparmpas, R. M. Martins, K. Kucher, and A. Kerren, “StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics,” IEEE Trans Vis Comput Graph, vol. 27, no. 2, pp.

1547–1557, Feb. 2021, doi: 10.1109/TVCG.2020.3030352.

(9)

[17] I. M. Mubaroq and E. B. Setiawan, “The Effect of Information Gain Feature Selection for Hoax Identification in Twitter Using Classification Method Support Vector Machine,” Indonesia Journal on Computing (Indo-JC), vol. 5, no. 2, pp.

107–118, 2020.

[18] D. Maulina and R. Sagara, “Klasifikasi artikel hoax menggunakan support vector machine linear dengan pembobotan term frequency–Inverse document frequency,” Jurnal Mantik Penusa, vol. 2, no. 1, 2018.

[19] C. Irawan, T. Mantoro, and M. A. Ayu, “Malware Detection and Classification Model Using Machine Learning Random Forest Approach,” in 2021 IEEE 7th International Conference on Computing, Engineering and Design (ICCED), IEEE, Aug. 2021, pp. 1–5. doi: 10.1109/ICCED53389.2021.9664858.

[20] D. Kuswanto, Husni, and M. R. Anjad, “Application of Improved Random Forest Method and C4.5 Algorithm as Classifier to Ransomware Detection Based on the Frequency Appearance of API Calls,” in 2021 IEEE 7th Information Technology International Seminar (ITIS), IEEE, Oct. 2021, pp. 1–6. doi: 10.1109/ITIS53497.2021.9791836.

[21] J. Xu, Y. Zhang, and D. Miao, “Three-way confusion matrix for classification: A measure driven view,” Inf Sci (N Y), vol.

507, pp. 772–794, Jan. 2020, doi: 10.1016/j.ins.2019.06.064.