Improving Multiclass Classification in Intrusion Detection Using Clustered Linear Separator Analytics

(1)

Improving Multiclass Classification in Intrusion Detection Using Clustered Linear Separator Analytics

Azween Abdullah Ramachandran Ponnan David Asirvatham School of Computing and IT School of Communication School of Computing and IT

Taylors University Taylors University Taylors University

Subang Jaya, Malaysia Subang Jaya, Malaysia Subang Jaya, Malaysia [email protected] [email protected] [email protected]

Abstract - This research proposes a new ensemble classification technique called Clustered Linear Separator Analytic (CLSA) for intrusion detection system. The approach makes use of both linear separation technique as well as cluster based approach. Genetic Linear Discriminant Analysis (GLDA) is used for features transformation and optimum subset selection while CLSA is used for classification. CLSA is a much improved version of SVM and LDA as it performed both binary and multi-class classification when combined rather than when employed independently. Comparative experiments were performed with SVM (Support Vector Machine) with its variant kernels and CLSA to evaluate the robustness and classification accuracy of the proposed approach. Detection rate (DR) were used to evaluate the performance of the proposed system. The results indicate that the combination of these approaches produce an efficient IDS compared with state-of-the-art approaches.

Keywords -Clustered Linear Separator Analytics, Genetic Linear Discriminant Analysis, Multi Class Classification, Intrusion Detection System

I. INTRODUCTION

Genetic Linear Discriminant Analysis (GLDA) which has a low false alarm and high detection rate while maintaining low cost and shorter detection time [1]. Firstly, what is glaring and appear a weak underbelly is that Intrusion detection systems (IDS) have several issues and weak points which make them vulnerable, such as the need for regular updating, low detection capability of unknown attacks, high rate of false alarms, extra ordinary resource consumption and many others. The second issue is that feature selection and classification are designed and tested independently for accuracy. Some classification algorithms work better with certain feature selection algorithm and not others and vice-versa and again with certain number of features. This leads to false claims of robustness, efficiency and accuracy. Feature selection and classification go hand in hand, and the proof of this hypothesis is one of the aims of this research. The third drawback of the existing IDS approaches is the direct input of raw features to the classifier which causes many issues like high rate of false alarms, low rates of detection and accuracy. In some cases, features are transformed and subset of features is given as input to the

classifier. Some issues with regard to optimum subset selection arise in such cases which causes problems such as the loss of important information. Hence, in this research we propose to use GLDA which has a low false alarm and high detection rate while maintaining low cost and shorter detection time. GLDA is an ensemble approach that uses both genetic algorithm and linear discriminant analysis for the first stage binary classification [1].

In the intrusion detection realm, there are two pertinent processes. The first is feature selection and the second is feature set classification. In this research, the study is focused on a new refined approach to classification that extracts pertinent properties of LDA and SVM and a new classification algorithm is proposed that would refine classification in comparison to using the algorithms individually. In this algorithm, the idea of margin maximization using the above algorithms is explored to give a new large margin classifier called LSA. A new algorithm called Clustered Linear Separator Analytic (CLSA) is then proposed that would first separate data by maximizing the margin using LSA and then, using a new clustering algorithm called clustered k-SVM to cluster data within the same margin to give separate within-margin data refinement is undertaken.

Section II discusses some of the related works. Section III explains the Clustered k-means algorithm used in this research. The ensemble design is explained in section IV and section V explains the experimental set-up and results while section VI concludes the paper with future research.

II. RELATEDWORKS

Typically, newly proposed and existing intrusion detection systems’ performances are evaluated using cyber security benchmark datasets. For example, genetic algorithms and support vector machines were hybridized to create intrusion detection systems. The hybrid of the genetic algorithm and support vector machine was evaluated using the KDD Cup 1999 dataset [3].

2018 8th International Conference on Intelligent Systems, Modelling and Simulation

(2)

[4] used neural networks to design an intrusion detection system which was evaluated using this benchmark dataset.

[5] reviewed the application of data mining techniques to create intrusion detection systems, finding that researchers relied heavily on the KDD and Computer Science Department, University of New Mexico (UMN) cyber security benchmark datasets. However, the KDD and UMN benchmark datasets are no longer relevant to the modern computer age because of significant advances in computer technology [2].

[6],[21],[22],[23],24],[25] described how a one-class support vector machine was used on ADFALD to detect intrusion. It was found that the Adduser and Meterpreter were relatively easier to be detected than the Hydra-FTP and Hydra-SSH. It was concluded that the one-class support vector machine was not robust against all kinds of cyber attacks. [7] attempted to extract information from ADFA-LD for the development of a new host-based anomaly detection system using ADFA-LD. The features analysed in their study include the length, common patterns and frequencies of system call traces. It was found that there is an acceptable level of performance with some types of attack. However, the complex behaviour of the modern computer system was not fully understood. [6] used K-nearest neighbour and k- means clustering based on the ADFA-LD to explore the potential of reducing the dimensionality of frequency vectors and the optimal distance function was identified. [2]

,[27],[28] proposed the ADFA Linux (ADFA-LD) cyber security benchmarks datasets for evaluation of intrusion detection system . The host operating system for the gen of ADFA-LD was Ubuntu-Linux version 11.04. Its configuration offers various functions including the sharing of files, database, remote access, as well as a web server. The Ubuntu 11.04 has Linux kernel 2.6.38 which is fully patched.

FTP, SSH and MySQL 14.14 is enabled, based on default ports. Apache 2.2.17 and PHP 5.3.5 are installed for providing web-based services. Furthermore, TikiWiki 8.1 was installed as a collaborative web tool. The data structure for ADFA-LD is as follows: normal training data has 833 traces, normal validation data has 4,373 traces and attack data has 10 attacks/vector. The proposed ADFA-LD cyber security benchmark dataset has a closer resemblance between the attack and normal dataset, unlike the KDD cyber security dataset. In addition, the ADFALD actually represents the updated and modern cyber-attacks of current situations.

III. CLUSTERED K-SVM

A. K-means clustering

In unsupervised learning, clustering determines the possible feature subsets available in the dataset which have similar behavior and can be considered as a group known also as cluster. The points in a single cluster are different from the points in other clusters nearby and have matching

labels or dimension with other points of same cluster [1],[29],[30]. Clustering works well in identifying the patterns label when no predefined details are available from the dataset and the unsupervised system has to create a possible group by itself. Clustering is based on a data-driven approach as compared to supervised approaches which are based on model-driven concepts [8],[18],[19],[20].

To develop an unsupervised system by utilizing cluster analysis, the following prerequisites are required:

a) Data records with selected attributes in processed form with/without labels.

b) Defining appropriate distance measure depending upon the record synchronization.

c) Desired size number for cluster formation.

d) Identifying possible clusters and their data representation in different dimensions.

e) Stability and validation assessment.

B. Cluster Formation

In real world problems, datasets do not have essential details which are required to determine the system accuracy, so certain parameters are assumed to assist the classification process. k-means clustering focuses on the internal correlation and overall variance measurement of the data records. Data patterns are identified by their behavior and are flexible enough to determine possible relevancy with other patterns for cluster formation. After the cluster is determined, a profile is built for every single cluster by evaluating related details such as centroid, number of patterns in the cluster and their labels [9]. The centroid is the center point of an individual cluster and is measured using specified distance measures. The list of parameters evaluated after cluster formation are as follows:

x Number of clusters in the dataset.

x Cluster with most and least number of data records.

x Optimization of centroid and distance measures.

x Disjoint or overlapping cluster information.

The clustering procedure highlights the effort of minimizing the distance between the data of a cluster and increasing the distance between the data of different clusters. Such regulations produce disjoint clusters where every single data pattern of the dataset belongs to only one cluster a process known as hard clustering. The cluster size of the input data can be specific depending upon the possible classes in the data or it can also be optimized by applying a certain range to evaluate cluster effect. It can be modeled in iterations where centroids are also configured to select the best possible data from the cluster, which is a central point and represents the overall cluster behavior.

C. K-means algorithm

The step wise procedure of k-means algorithm is as follows:

Remove data labels and normalize data records for the selected attributes.

D = [nr,nc]

(3)

nr is the total number of rows and nc is the total number of columns. After eliminating last column having class labels, it becomes:

c = nc-1;

D = [nr,c]

Select starting cluster size (default k=2) and increase the cluster size k (k<=5) to evaluate possible groups available in the data.

Select different distance measures (default: squared Euclidean) for calculating the sum of distance, to determine centroids in clusters.

After computing patterns and selected parameters, dissimilar patterns are divided into different clusters.

Assigning patterns in specific clusters with similar behavior is done in iteration to achieve a minimum sum of the distance.

Once the given set of clusters is formed, the centroids are selected by determining certain pattern in each cluster, which have a minimum distance from other patterns of the same cluster.

Each cluster is assigned an index and is identified by a centroid as a center with respective coordinates.

D. K-means significance for supervised classification K-means is popular clustering algorithm which can be ensemble with supervised methods to overcome local minima for optimization approaches as compared to a global method which increases the computation time and cost. k- means work efficiently on large unbalanced datasets with minimized variance for points in an individual cluster as compared to maximized variance among different clusters overall. Finding appropriate cluster size can be computed by trying possible sizes for the data within a specific range. k- means is fast enough to optimize computationally expensive approaches [10],[11],[12]. Distance methods which are commonly used for measuring centroids for the clusters are squared Euclidean (default), city, cosine and correlation.

E. Cluster Formation

In real world problems, datasets do not have essential details which are required to determine the system accuracy, so certain parameters are assumed to assist the classification process. k-means clustering focuses on the internal correlation and overall variance measurement of the data records. Data patterns are identified by their behavior and are flexible enough to determine possible relevancy with other patterns for cluster formation. After the cluster is determined, a profile is built for every single cluster by evaluating related details such as centroid, number of patterns in the cluster and their labels [9],[13],[14],[15]. The centroid is the center point of an individual cluster and is measured using specified distance measures. The list of parameters evaluated after cluster formation are as follows:

Number of clusters in the dataset.

Cluster with most and least number of data records.

Optimization of centroid and distance measures.

Disjoint or overlapping cluster information.

The clustering procedure highlights the effort of minimizing the distance between the data of a cluster and increasing the distance between the data of different clusters. Such regulations produce disjoint clusters where every single data pattern of the dataset belongs to only one cluster a process known as hard clustering. The cluster size of the input data can be specific depending upon the possible classes in the data or it can also be optimized by applying a certain range to evaluate cluster effect. It can be modeled in iterations where centroids are also configured to select the best possible data from the cluster, which is a central point and represents the overall cluster behavior.

IV. ENSEMBLESYSTEMDESIGN

Figure 1 is implemented using the ensemble approach of proposed Clustered k-SVM. This model is modified from the traditional architecture of intrusion detection systems to develop a novel classification method which can predict possible labels of the processed data samples grouped in clusters to determine their performance measures.

Performance Estimation SVM Classification Using

5x2 Cross Validation NSL KDD

Dataset Feature Preprocesisng K-means Clustering Clustered k-SVM

Figure 1: System Design Clusterd k-SVM Mechanism

To perform proposed Clustered k-SVM classification, the clustered feature set is initialized in the 5x2 crossvalidation process [4]. Each of these clusters has its unique index which defines all the data patterns of respective cluster. The procedure for classification of a clustered data is as follows:

Fetch all the clusters in of certain cluster size of specific feature set. Each cluster is trained and tested individually by identifying with its index.

(4)

The cluster patterns are processed in the classification system for 10 iterations of training and testing. At the start of every iteration, data patterns of selected cluster are shuffled and divided into two sets; training and testing.

In the training phase, default values of the parameters are provided along with the train set to tune the parameters to evaluate their optimal values where the system is trained with no local minima or classifier overhead. Grid Search is used for optimizing cost C and gamma γ parameters by giving a range of their exponential values for parameters. Two grid searches namely loose grid and fine grid.

Loose Grid range C=(2^(-5),2^(-3),…,2^15 )and γ=(2^(-15),2^(-13),…,2^3 )

This loose grid will provide certain values of C and γ which is further used in another fine grid with smaller range and a best suited pair of C and γ for selected feature subset and the kernel used is determined [26].

The selected parameter values, trained data and trained labels are given to the train model, also a process known as model selection. Once the system is trained with suitable values, the other half of the cluster data patterns is tested by tuning the system with selected parameter values.

In 5x2 cross-validation, 10 iterations of training and testing are computed and the mean of the performance measures for these iterations is calculated.

Performance evaluation for a clustered feature set of a certain size is done by taking the average of the prediction measures computed for all the included clusters.

All the selected kernel functions are used for the clusters of selected feature set to determine the best possible results in terms of accuracy, sensitivity and specificity at specific parametric values

V. EXPERIMENTALSET-UP

Different experiments were conducted independently for different scenarios using GLDA with SVM kernels and CLSA. Basically, three SVM kernels which are Polynomial, Sigmoid and RBF made up a part of these experiments as each kernel represented a different SVM classifier. The polynomial kernel was further divided into different sub kernels based on its degree value. There were in total five cases and in each scenario, different sets of NSL-KDD were given as input to check the classification performance of different kernels. In the first scenario, the dataset was given as input without preprocessing steps to reinforce the importance of preprocessing. All the experiments were executed under a single machine to minimize the chances of performance fluctuation as the machines have different configurations. SVM and CLSA were used for the classification and their classification accuracies were

investigated. The hardware specifications of the machine used for the experiments were 64-bit OS Windows 10, 6GB Memory, Intel®Core ™, [email protected].

It was found that the state-of-the-art cyber security benchmark datasets KDD and UNM are no longer reliable, because their datasets cannot meet the expectations of current advances in computer technology. As a result, a new ADFA Linux (ADFA-LD) cyber security benchmark dataset for the evaluation of machine learning and data mining- based intrusion detection systems was proposed in 2013 to meet the current significant advances in computer technology. ADFA-LD requires improvement in terms of full descriptions of its attributes.

A. Benchmarking of results

Table 1 shows the performance comparison with other approaches. With the detection rate of 99.7%, GLDA-CLSA has demonstrated better performance compared to the next closest result of [16],[17],[30] which uses SVM for classification and GA-PCA for feature selection. Both researches use 10 features, but due to differing algorithms, it can be concluded that even though the same 10 features were used, the differences of feature selection and classification algorithm does have a significant effect on the detection rate.

Table 1: Performance comparison with other approaches Approach(s) Detection rate

(%)

(GLDA + CLSA) 99.7

(PCA + GA + SVM) [Ahmad, Abdullah & Alghamdi, 2013b]

99.6 (MLP + PCA) [Ahmad, 2011] 98.57 (GA + SVM) [Kim et al., 2015] 98 (SVM) [Osareh& Shadgar,

2011]

83.2 (MLP) [Osareh & Shadgar,

2017]

82.5 (PCA + NN) [Lakhina et al,

2016]

92.2 (RBF/Elman) [Tong, Wang &

Haining, 2009]

93 (ART1, ART2, SOM) [Amini et

al., 2016]

97.42, 97.19, 95.74 Table 2 shows the different parametric values using different number of features. The Features-10 shows a significant improvement compared with the other features.

We can conclude that the number of features do have an impact on the false alarm rate. What we may not be able to conclude is whether the same features will give the same result with other combinations of feature selection and classification algorithms.

From Figure 2, it can be seen that the DR value for number 11 is higher for both the CLSA and SVM, whereas the number 38 is the lowest for both the cases.

(5)

Chi-square test was done to determine the relationship between CA and types of CLSA and CA and SVM respectively. Since the P value (0.0000) is less than the significance level (0.05) for both the cases, the null hypothesis is rejected for both the cases. Thus there is a relationship between CA and types of CLSA and there is also a relationship between CA and types of SVM.

Figure 2: Comparison of detection rate with different classification algorithms

Table 2: GLDA-CLSA performance on different feature set

Feature set

Features- 10

Features- 12

Features- 22

Features-

38 Raw-38

False

alarm 7 11 24 79 11,455

Epochs 1,000 1,000 1,000 1,000 1,000 Time 1:16:14 1:36:01 2:08:18 2:39:04 2:21:17 Features

size 564 KB 2.17 MB 5.15 MB 8.37 MB 8.37 MB False

positive 0 0 24 79 11,455

False

negative 7 11 0 0 0

True

positive 12,807 12,811 12,776 12,721 1,345 True

negative 7,193 7,189 7,224 7,279 18,655

B. Preliminary clustering results using Clustered k-SVM For the determination of clustering of the datasets , we used KDD Dataset which has 38 features based on different network details. These features are further processed for dimensionality reduction and optimal subset selection. Prior to classification for identification of possible classes (normal or intrusive), clustering algorithm have been applied on the selected dataset.

The significance of using clustering is to determine different groups of network records and then apply binary classification on the patterns of each cluster separately which overall enhances the system performance.

To optimize the intrusion detection, various cluster sizes have been selected to restrict the grouping of the patterns on the basis of their similarity and minimum sum of distances for a centroid of each cluster to represent the coordinates of respective cluster. These centroids of the clusters are used as an index to perform classification in cluster wise order.

As shown in figure 3 below, the dataset is clustered in different sizes ranging from 2 to 5 as denoted with colors.

Once the patterns are divided, each cluster is presented individually as a subset to classifier for training and testing. More experiments will need to be done to test for clustering accuracy of the different types of attack types.

Figure 3: Clustering of datasets in different clusters VI. CONCLUSION AND FURTHER RESEARCH

In this research, we propose CLSA which consists of a two-step process. Firstly, LSA, considered as a large margin linear decision classifier that encompasses pertinent properties of both LDA and SVM, separates datasets into intrusive/non-intrusive classes. Secondly, Clustered k-SVM was then proposed that clusters datasets within the intrusive

(6)

class into clusters (classes) of intrusive datasets that have varying properties. Global properties of the SVM classifier such as estimated mean and covariance of class distributions were utilized in the standard framework. The model proposed has several positive implications and extensions such as CLSA can be viewed as a mechanism for combining a priori knowledge about class distributions with labelled data. Furthermore, CLSA can be extended to semi- supervised learning where global characteristics of the data estimated using labelled samples are incorporated.

ACKNOWLEDGMENT

This research is funded by Taylors University Flagship Program the Fundamental Research Grant Scheme 2018.

REFERENCES

[1] Azween Bin Abdullah, Long Zheng Cai (2015). Improving Intrusion Detection using Genetic Linear Discriminant Analysis, International Journal of Intelligent Systems and Applic.tions in Engineering, 3(1), 2015;p.34-39

[2] 6. Creech G, Hu J. Generation of a new IDS test dataset: Time to retire the KDD collection. In: 2013 IEEE Conference on Wireless Communications and Networking (WCNC), 2013; p. 4487-4492.

[3] 13. Moradi M, Zulkernine M. A neural network based system for intrusion detection and classification of attacks. In: Proceedings of the 2004 IEEE international conference on advances in intelligent system-theory and applications, 2004; p. 142-152.

[4] 14. Kim DS, Nguyen HN, Park J. Genetic algorithm to improve SVM based network intrusion detection system, AINA 2005. In: 19th International Conference on Advanced Information Networking and Applications, 2005; 2. p.155-158.

[5] 15. Helali RGM. Data mining based network intrusion detection system: A survey, Novel Algorithms and Techniques in Telecommunications and Networking. pp.501-505. J Inf Tech 2010;

7:47-57.

[6] 21. Xie M, Hu J, Slay. Evaluating Host-based Anomaly Detection System: Application of The One-class SVM Algorithm to ADFA- LD, [Online]. Available: http://icnc- fskd.xmu.edu.cn/doc/invitedSession/FSKD_5_Miao.pdf. Accessed, November 5, 2014..

[7] 22. Xie M, Hu J. Evaluating host-based anomaly detection system: A preliminary analysis of ADFA-LD, In: 2013 6th International Congress on Image and Signal Processing (CISP), 2013. pp. 1711- 1716. Finley, T. and Joachims, T. (2008). Supervised k-Means Clustering.

[8] Bagirov, A. M., Rubinov, A. M., Soukhoroukova, N. V., &

Yearwood, J. (2003). Unsupervised and Supervised Data Classification via Nonsmooth and Global Optimization. Top, 11(1), 1-75.

[9] Ben-Hur, A. & Guyon, I. (2003). Detecting Stable Clusters Using Principal Component Analysis. In M. Brownstein & A. Khodursky (Eds.), Functional Genomics, Vol. 224, 159-182: Humana Press.

[10] Guan, X., Guo, H., & Chen, L. (2010, 16-18 April 2010). Network intrusion detection method based on Agent and SVM. Paper presented at the The 2nd IEEE International Conference on Information Management and Engineering (ICIME), 2010.

[11] Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2002).

Support vector clustering. J. Mach. Learn. Res., 2, 125-137.

[12] Chiang, M.-T. & Mirkin, B. (2010). Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads. Journal of Classification, 27(1), 3-40.

[13] Chen, W.-H., Hsu, S.-H., & Shen, H.-P. (2005). Application of SVM and ANN for intrusion detection.Comput.Oper. Res., 32(10), 2617- 2634

[14] Marchette, D. (2001). Computer Intrusion Detection and Network Monitoring, A Statistical Viewpoint. New York: Springer.

[15] Hongle, D., Shaohua, T., Mei, Y., & Qingfang, Z. (2009, 7-8 Nov.

2009). Intrusion Detection System Based on Improved SVM Incremental Learning. Paper presented at the International Conference on Artificial Intelligence and Computational Intelligence (AICI '09).

[16] Khan, L., Awad, M.,& Thuraisingham, B. (2007). A new intrusion detection system using support vector machines and hierarchical clustering. The International Journal on Very Large Data Bases, 16(4), 507–521.

[17] Ahmad, A. R., Khalid, M. B., Yusof, R., & Viard-Gaudin, C. (2003).

Comparative Study of SVM Tools for Data Classification. Paper presented at the First Regional Malaysia-France Workshop on Image Processing In Vision Systems and Multimedia Communication, IMAGE2003, Sarawak, Malaysia.

[18] Mulay, S. A., Devale, P. R., & Garje, G. V. (2010). Intrusion Detection System Using Support Vector Machine and Decision Tree.

International Journal of Computer Applications, 3(3), 40-43.

[19] Fan, R.-E., Chen, P.-H., & Lin, C.-J. (2005). Working Set Selection Using Second Order Information for Training Support Vector Machines. J. Mach. Learn. Res., 6, 1889-1918.

[20] Buciu, I., Kotropoulos, C., & Pitas, I. (2006). Demonstrating the stability of support vector machines for classification. Signal Processing, 86(9), 2364-2380.

[21] Huang, C.-L. & Wang, C.-J. (2006). A GA-based feature selection and parameters optimizationfor support vector machines. Expert Systems with Applications, 31(2), 231-240.

[22] Menon, A. K. (2009). Large-Scale Support Vector Machines:

Algorithms and Theory: UCSD.

[23] Wang, J. (2010). Consistent Selection of the Number of Clusters via Crossvalidation. Biometrika, 97(4), 893-904.

[24] Chen, J. & Ye, J. (2008). Training SVM with indefinite kernels.

Paper presented at the Proceedings of the 25th international conference on Machine learning.

[25] Chen, R. C. & Chen, S. P. (2008). Intrusion Detection Using a Hybrid Support Vector Machine Based on Entropy and TF-IDF. . International Journal of Innovative Computing, Information and Control (IJICIC) 4(Number 2), 413-424.

[26] Hsu, C., Chang, C., & Lin, C. (2010). A Practical Guide to Support

Vector Classication.Retrieved from http://www.csie.ntu.edu.tw/~cjlin/.

[27] Gao, M., Tian, J., & Xia, M. (2009). Intrusion Detection Method Based on Classify Support Vector Machine. Paper presented at the Proceedings of the 2009 Second International Conference on Intelligent Computation Technology and Automation - Volume 2.

[28] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.

[29] Kausar, N., Belhaouari Samir, B., Sulaiman, S. B., Ahmad, I., &

Hussain, M. (2012). An Approach Towards Intrusion Detection Using PCA Feature Subsets and SVM. Paper presented at the 2012 International Conference on Computer & Information Science (ICCIS).

[30] Ahmad, I., Abdullah, A., Alghamdi, A., & Hussain, M. (2011b).

Optimized intrusion detection mechanism using soft computing techniques. Telecommunication Systems.doi:10.1007/s11235-011- 9541-1.