PCA-ANN: Feature Selection based Hybrid Intrusion Detection System in Software Defined Network

Assistant Professor, Department of Computer Science and Engineering (CSE) University International University (UIU), Dhaka, Bangladesh. Professor and Head of the Department of Computer Science and Engineering, Dhaka International University, Bangladesh.

Motivation

However, DL methods require large datasets and significant computing power to train models, which can be expensive in the central controller. Principal Component Analysis (PCA) [24] is a statistical method that reduces the dimensionality of large datasets while preserving most of the variance in the data.

Problem Statement

The inSDN database [17] has opened new avenues for researchers working on IDS in SDN due to its impressive number of features and cases, and it can be considered as big data. It identifies the most important features, or principal components, and combines them to create a new, smaller set of features that retains most of the variance in the original data set.

Objective

Thesis Contributions

None of the previous work has explicitly identified them in this dataset before, to our knowledge. We have obtained state-of-the-art results on the dataset with our proposed model and have discussed the possible reasons for the reduced performance of the computationally heavier model.

Organization of the Thesis

This chapter presents a comprehensive review of existing knowledge, which will serve as the basis of the study. Many promising works can be found for intrusion detection in SDN, some of the most promising works that used different methodologies have been reviewed in the next section.

Software-Defined Networking (SDN)

Centralizing the control plane within SDN empowers network administrators to dynamically manage and configure the network through software, eliminating the dependency on individual device configurations. The programmability of the network enables the implementation of complex and computationally heavy algorithms, while also promoting dynamic network structures.

Attack vulnerability in SDN

South Interface Attacks: This is one of the attacks on the communication links from the controller in SDN. North Interface Attacks: This is another attack on communication links from the controller in SDN.

Figure 2.2: Different attack points in SDN

Intrusion Detection Systems (IDS)

By distributing IDS components across the network, it becomes possible to capture and analyze network traffic in real time, enabling rapid detection and response to potential security threats [2]. SDN provides a flexible and programmable framework that enables efficient deployment and management of IDS components throughout the network infrastructure [27]. As a result, IDS components can jointly analyze network traffic and share insights, leading to improved accuracy and efficiency in detecting and mitigating security incidents.

IDS components can leverage the programmable capabilities of SDN switches and controllers to define and enforce security policies in real time. In this work, network intrusion detection systems (NIDS) based on AML and using anomaly-based method are developed. Many promising works can be found for SDN intrusion detection, some of the most promising works that used different machine and deep learning methodologies have been reviewed in this section.

Figure 2.4: Overview of IDS Deployment in SDN

RELATED WORK

Literature on Deep Learning Based Methodologies

1] Development of a hybrid Intrusion Detection System for SDN by combining Convolutional Neural Network and Long Short-Term Memory Network. The model achieved an accuracy of 96.32% in detecting zero-day attacks on the InSDN dataset. Software-Defined Networking (SDN) and the challenge of detecting Distributed Denial of Service (DDoS) attacks in SDN environments.

1] propose a hybrid Intrusion Detection System for SDN by combining Convolutional Neural Network and Long Short Term Memory Network. The model achieves an accuracy of 96.32% in detecting zero-day attacks on the InSDN dataset. 31] focused on the security issues of software-defined networks (SDN) and the challenge of detecting DDoS (Distributed Denial of Service) attacks in SDN environments.

Literature on Shallow ML Methodologies

Their work proved that relying on DPI as the sole means of extracting attack characteristics faced limitations due to the increasing use of encryption by malicious actors to avoid payload analysis. For unencrypted traffic, payload features based on tri-gram frequency and TF-IDF are extracted, while for encrypted traffic, TLS cipher suites are used as salient features. However, relying on DPI as the sole means of extracting attack characteristics faced limitations due to the increasing use of encryption by malicious actors to avoid payload analysis.

3] investigated the effectiveness of ML algorithms, including KNN, DT and SVM, for detecting DDoS attacks in SDN. 6] used an SVM-based IDS solution to detect various malware intrusions, selecting core features based on the Entropy-based Information Gain (IG) method. [32] introduced an RF-based model to identify Man In The Middle (MITM) attacks by selecting nodes based on context and predefined policies.

Gap Analysis

We deduced that using a lightweight method such as PCA before feeding the data to the Neural Network (NN) model should reduce the load on the model and allow it to train itself efficiently while reducing complexity. to lower. We will address this gap by explicitly extracting key features from the dataset using PCA before feeding the data to the NN model for training. This chapter presents the systematic approach used to investigate and address the research objectives.

It provides a detailed description of the research design and analysis techniques used to obtain the results in a later section. By following a rigorous methodology, the study aims to ensure the accuracy, reliability and reproducibility of the research results. The methodology chapter serves as a blueprint for the readers, outlining the step-by-step process for conducting the research.

Figure 4.1: Workflow of the proposed model

Dataset

The dataset is divided into three groups: the first contains flow characteristics of normal traffic, the second contains flow characteristics of attacks on their switch, and the third contains attack data of attacks on their Metasploitable-2 server, including popular application services such as HTTPS, HTTP, DNS , email, FTP and SSH. The dataset is labeled with a 'Label' column that contains the traffic type and class, where normal traffic is classified as 'Normal' and attack traffic is classified as 'DoS'. 17], the features in the dataset can be divided into network identifiers, packet-based attributes, byte-based attributes, inter-arrival time attributes, flow timer attributes, flag attributes, flow descriptors attributes, and sub-flow descriptor attributes.

In Figure 4.3, the first few rows and columns of the data set are observed. The CSV files in the dataset are categorized into the three groups (Normal, attack on switch, attack on metasploitable - 2 server). The dataset exploration started by reading the three CSV files into a python file using the pandas library and combining them into a single dataFrame [30] object. Most of the features in the dataset are found to be highly skewed, and can lead to erroneous results if the model is not specified correctly.

Figure 4.2: Distribution of classes in the inSDN Dataset

Data Pre-Processing

Feature Selection

Classification Model

The dataset is first converted to a pyTorch tensor to be used in the model, and then split into training and validation sets using the train-test split function from Scikit-learn. The model consists of two fully connected layers, and the output of the first layer is passed through a sigmoidal activation function. The CrossEntropyLoss function is defined to be used as a loss function, and the SGD optimizer is used to optimize the model parameters during the trial after that.

For each batch, the gradients of the loss function with respect to the model parameters are calculated using backpropagation, and the optimization is used to update the model parameters. In the binary classification problem, the output layer has two neurons, and the multi-class classification model has seven neurons in the output layer corresponding to the six attack classes and normal traffic. The binary classification model is only trained for 100 epochs and the multi-class classification model is trained for 500 epochs.

Figure 4.7: Steps in defining and training the neural network classifier model using python

Evaluation Metrics

Evaluation Environment

Binary Classification Model

The confusion matrix for our feature selection-based hybrid binary classification model is shown in Figure 5.2. Of the 22614 normal traffic instances, 22558 were correctly classified as such and of the 90870 attack instances only 15 were misclassified. To compare our results with other models, we compared the results with the CNN model, the LSTM model, the CNN+LTSM model, and the GAN-based hybrid model that combined GAN, DAE, and RF algorithms that were developed by Ding et al.

It can be observed that our model outperformed the other models in terms of accuracy and achieved a value of 0.999. The recall and precision values of our model were also high at 0.999 and 0.997, respectively. Our model performs comparable to Ding et al.'s [15] Hybrid model, which has the best achieved results in the literature.

Multi-class Classification Model

The values in the diagonal of the confusion matrix represent the number of traffic correctly classified by the multi-class classification model. Of the 22614 instances of normal traffic, 22576 instances were correctly classified as such. Of the various attack traffics, only 18 were classified as normal traffic in total, while all others were correctly classified as attack traffic.

The amount of misclassification between different attack traffic is also very low, with up to 92 BFA attack traffic being classified as probe traffic. The results were compared with the CNN model, the LSTM model, the CNN+LTSM model and the GAN-based hybrid model of Ding et al. [15] that combined GAN, DAE and RF algorithms. These results indicate that the proposed model outperformed existing models, including the state-of-the-art hybrid model, in classifying multi-class data.

Figure 5.3: Confusion Matrix for Multi-class Classification Model

Discussion

Conclusion

Limitation

We expect the results to be better in settings similar to the literature it has been compared to.

Future Work

An Intrusion Detection System in Software-Defined Networks Using Machine and Deep Learning Techniques - A Comprehensive Survey,” TechRxiv, 2021, preprint. Karaarslan, "Using Machine Learning Algorithms for a Flow-Based Anomaly Detection System in Software-Defined Networks". Keller, “Machine Learning-Based Ransomware Detection Using sdn,” Proceedings of the 2018 ACM International Workshop on Security in Software-Defined Networks and Network Functions Virtualization, 2018.

Rao, “Security and privacy attacks during data communication in software-defined mobile clouds,” Computer Communications , vol. Javaid, "A deep learning-based ddos detection system in software-defined networks (sdn)," EAI-Adopted Transactions on Safety and Security, vol. Goswami, “Machine Learning Based Threat Aware System in Software Defined Networks,” 26th International Conference on Computer Communication and Networks (ICCCN), p.