Deep Generative Learning Models for Cloud Intrusion Detection Systems

(1)

Deep Generative Learning Models for Cloud Intrusion Detection Systems

Ly Vu¹, Quang Uy Nguyen¹, Diep N. Nguyen², Dinh Thai Hoang², and Eryk Dutkiewicz²

1 Le Quy Don Technical University, Hanoi, Vietnam

2School of Electrical and Data Engineering, University of Technology Sydney, Australia

F

Abstract—Intrusion detection (ID) on the cloud environment has received paramount interest over the last few years. Among the latest approaches, machine learning based ID methods allow us to discover unknown attacks. However, due to the lack of malicious samples, con- structing a cloud ID system (IDS) that is robust to a wide range of unknown attacks remains challenging. In this paper, we propose a novel solution to enable robust cloud IDSs using deep neural networks. Specif- ically, we develop two deep generative models to synthesize malicious samples on the cloud systems. The first model, Conditional Denosing Adversarial AutoEncoder (CDAAE), is used to generate specific types of malicious samples. The second model is a hybrid of CDAAE with the K-nearest Neighbor algorithm (CDAEE-KNN) to generate samples that further help improve the accuracy of a cloud IDS. The synthesized samples are merged with the original samples to form the augmented datasets. Three machine learning algorithms are trained on the augmented datasets and their effectiveness is analyzed. The experiments conducted on six popular IDS datasets show that our proposed techniques significantly improve the accuracy of the cloud IDSs compared with the baseline technique and the state of the art approaches.

Keywords-Intrusion detection, deep learning, Generative models, Conditional Denoising Adversarial AutoEncoder, cloud systems.

1 INTRODUCTION

Due to the huge economic benefit, cloud computing market has been experienced an unprecedented development over the last 5 years. Today the global cloud computing market is worth $180B in revenue with the annual market growth of 24 percent [1]. Nevertheless, the wide adoption of cloud computing also resulted in the cloud systems being very vulnerable to many types of cyber attacks. The security of the cloud environment is therefore an increasing concern of both service providers and end-users [2]. Among several approaches to protecting the cloud systems, Intrusion De- tection Systems (IDSs) plays a crucial role in early detecting and preventing security attacks [3], [4].

There are two main techniques for identifying the malicious actions on the cloud environment: non-machine learning approaches and machine learning approaches [5], [6]. Non-machine learning approaches including virtual machine-based methods, signature-based methods [5], [6]

rely on the characteristics of cloud malicious behaviours to identify the attacks. The advantage of these methods is the short processing time and the ability to accurately

detect previously known attacks. The shortcoming is that they highly depend on the knowledge about the signature of attacks and are unable to detect new/unknown types of attacks. Recently, machine learning-based approaches have been developed to overcome the limitation of non-machine learning approaches [5].

However, using machine learning to build a trustworthy and robust IDS on the cloud remains practically challenging.

One of the main reasons is the lack of labeled malicious samples. On the cloud environments, majority of collected traffic samples are normal and only few samples are intrusions.

Subsequently, most of the intrusion datasets on the cloud are imbalanced. When being trained on the skewed datasets, the predictive model developed using conventional machine learning algorithms could be biased and hence inaccurate.

This is because machine learning algorithms are usually designed to improve the accuracy by reducing the error.

Thus, they do not take into account the class distribution/proportion or the balance of classes.

A possible solution to address the above imbalance problem of cloud IDS datasets is collecting more malicious samples. However, on the cloud environments, collecting attack samples is extremely difficult due to the privacy and security concerns of cloud users [7]. Cloud providers tend to avoid divulging data that could compromise the privacy of their clients or privileged information of their systems. Therefore, re-sampling (over-sampling and under- sampling) methods are often used to balance the skewed datasets [8]. Nevertheless, these techniques have some intrinsic drawbacks. While under-sampling can potentially lose useful information, over-sampling is prone to cause over-fitting, when exact copies of the minority class are replicated randomly. Moreover, it does not really solve the fundamental “lack of data” problem.

In this paper, we propose a novel solution to address the “lack of malicious data” problem of cloud IDS datasets.

Specifically, we propose two models based on the Adver- sarial AutoEncoder (AAE) [9] to create synthesized malicious samples. The first model is Conditional Denoising Adversarial AutoEncoder (CDAAE) that can generate data samples of any specific attack. The second model is a hybrid between CDAAE and the K-Nearest Neighbor algorithm (CDAAE-KNN) to generate malicious samples that further improve the classification performance. The synthesized

(2)

malicious data is merged with the original data to form the augmented training datasets. Three conventional machine learning algorithms are trained on the augmented datasets and their performance (or effectiveness) is tested on six cloud IDS datasets. The experimental results show that our methods significantly improve the accuracy of machine learning based IDSs compared to the state of the art methods [9]–[13].

The rest of the paper is organized as follows. In Sec- tion 2 we review typical techniques used to detect attacks on the cloud and the approaches for handling imbalanced datasets. Section 3 presents the fundamental background of our approach. Our proposed model is discussed in Section 4.

In Section 5, we describe the datasets used to evaluate our proposed solutions. The performance of CDAAE and CDAAE-KNN is analyzed in Section 6. Finally, conclusions with future work are drawn in Section 7.

2 RELATED WORK

This section presents a brief review of the related work on cloud IDSs and the techniques for handling the imbalance problem.

2.1 Intrusion Detection Systems on the Cloud Environ- ment

The main purpose of cloud IDSs is to identify malicious behaviours on the cloud in early stage. On the cloud environment, Distributed Denial of Service (DDoS) attack is one of the most serious attacks that affects the budget, resource management, and service quality. Thus, many solutions have been proposed to detect and lessen DDoS attacks. Sahi et al. [14] proposed a new approach for detecting and preventing DDoS TCP flood attacks. They used two attributes of packets including source IP address and destination IP address to classify whether the sources are associated with malicious users. Kira et al. [15] presented a two steps method to detect DDoS attacks. First, they built a filter based on DDoS signatures to identify the abnormal packets. Second, a clustering algorithm is used to detect the abnormal packets that pass the filter in the first step.

Another common attack on the cloud is cache-based side-channel attacks (CSCA). Due to the sharing of ex- pensive hardware resources among many customers, these attacks are more feasible on the cloud environment than the traditional network. To handle this, Chouhan et. al. [16]

developed a detection technique using Bloom Filter. Bloom Filter reduces the performance overhead to minimum level.

This technique was proved to be effective in predicting the cache behaviour caused by CSCA. Additionally, the method to detect Structured Query Language (SQL) injections was also introduced to prevent unauthorized accesses to resource on the cloud environment [17]. This method analyzes the syntax regulation of SQL statement in order to generate a rule tree and uses the generated tree to detect attacks.

The limitation of the existing intrusion detection methods in the cloud environment is that they are ineffective to simultaneously detect various intrusions. Most of the existing solutions aim to identify only one specific attack.

To mitigate this limitation, machine learning based IDS

for the cloud environment has recently received a great attention [18]–[20]. Nguyen et al. [18] proposed an ensemble model of the Principal Component Analysis (PCA) technique and the Restricted Boltzmann Machine (RBM) to enhance the detection rate in identifying attacks on mobile cloud computing environment. In this model, PCA is used for dimension reduction and RBM is applied for feature selection. However, using both dimension reduction and feature selection might potentially increase the running cost to predict a new data sample. Pandeeswari and Kumar [19]

proposed an anomaly detection system for the hypervisor layer on the cloud. Their model is a hybrid of Fuzzy C- Means clustering algorithm and Artificial Neural Network (FCM-ANN) in which Fuzzy C-mean is used to divide the dataset into different clusters of similar samples and ANN is trained on each of these clusters. The results from the Artificial Neural Network (ANN) models are then aggre- gated to form the final prediction score. Dey et al. [20] proposed a machine learning based intrusion detection scheme for mobile clouds based on data fusion. They designed a two-layered IDS including multi-layer traffic screening and decision-based Virtual Machine (VM) selection. The experimental results showed that their solution is effective in detecting DDoS and Man-in-the-Middle (MITM) attacks.

2.2 Handling the Imbalance Problem

Techniques for handling the imbalance problem can be grouped into two categories: cost-sensitive learning and data re-sampling. The first group operates at the algorithmic level by assigning higher miss-classification costs to the minority class than to the majority [21]–[24]. Zhang et al. [22]

divided the data into smaller balanced subsets using an intelligent sampling technique. After that, a cost-sensitive SVM learning is applied to deal with the imbalance problem.

Li et al. [24] proposed an approach for dealing with imbalanced data by setting a higher weight to the minority class during training an ensemble machine learning algorithm (Adaboost). Recently, a number of new approaches have been proposed to address the imbalance problem in deep neural networks. Chung et al. [23] introduced a new re- gression loss function for multiclass deep learning to handle the imbalance problem. Wang et al. [25] and Raj et al. [26]

proposed a loss function that independently calculates error for classes and then averages them together. Khan et al. [21]

proposed a cost-sensitive deep neural network which can automatically learn robust feature representations for both the majority and minority classes.

The second group aims to alter the training data distribution to balance the majority and minority classes, usually by random under-sampling the major or over-sampling the minor [8], [27]. However, these techniques have some potentially downsides. While under-sampling can lose useful information about the majority class data, over-sampling does not actually increase any information since the exact copies of the minority class are replicated randomly [27]. To overcome this limitation, Synthetic Minority Over-sampling Technique (SMOTE) [10] generates the synthesized minority class by extrapolating and interpolating minority samples from the neighborhood ones. This has the effect of creating clusters around each minority observation. An extension

(3)

of SMOTE is SMOTE-SVM [28] that synthesizes samples located in the borderline between classes by only generating samples for the support vectors of a SVM model trained on the original dataset.

Ensemble methods (hybrid methods) combine a re- sampling method with a classifier to discover the distribution of the majority and minority class. BalanceCascade (BC) [11] develops an ensemble of classifiers to systemati- cally select which majority class samples to under-sample.

The number of classifiers are applied in sub-datasets which include majority samples and minority samples with a balanced rate. The majority samples will be removed if they are classified correctly from classifiers. EasyEnsemble [11]

learns different aspects of the original majority class in an unsupervised manner. First, a number of balanced training sets are created using a random under-sampling technique.

Next, a model is trained on each dataset and the prediction is combined as in the bagging technique [29]. Tomek Link [30] removes samples from the negative class that are close to the positive region in order to return a dataset that presents a better separation between the two classes.

Although SMOTE-SVM, BalanceCascade, and EasyEnsemble have been widely adopted and their effectiveness has been well-evidenced [8], [31], the shortcoming of these methods is that the distribution of the original data can be lost [32]. In other words, the generated data samples may have a very different distribution from the original data. Subsequently, the performance of machine learning algorithms trained on the augmented dataset may be hurt. In this paper, we propose a novel method for generating the synthesized malicious samples of various attacks on the cloud environment. Particularly, the synthesized samples of our method are proved to be stronger correlated to the original data distribution compared with those of the previous approaches. The detailed description of our method is presented in Section 4.

3 FUNDAMENTALBACKGROUND

This section describes three fundamental generative models of our proposed method. The described models include Generative Adversarial Network, Variantional Au- toEncoder, and Adversarial AutoEncoder.

3.1 Generative Adversarial Network

A Generative Adversarial Network (GAN) [33] has two neural networks which are trained in an opposite way (Fig. 1(a)). The first neural network is a generator (Ge) and the second neural network is a discriminator (Di).

The discriminatorDiis trained to maximize the difference between a fake samplex˜(comes from the generator) and a real samplex(comes from the original data). The generator Geinputs a noise sample z and outputs a fake samplex.˜ This model aims to fool the discriminatorDiby minimizing difference between˜xandx.

L_GAN =E_x[logDi(x)] +E_z[log(1−Di(Ge(z)))]. (1) The loss function of GAN is presented in Eq. (1) in which Di(x) is the the probability of Di to predict a real data

instance x is real, Ge(z) is the output ofGe when given noisez,Di(Ge(z))is the probability ofDito predict a fake instance (Ge(z)) is real, Ex andEz are the expected value (average value) over all real and fake instances, respectively.

Di is trained to maximize this equation while Getries to minimize its second term. After the training process, the generator (Ge) of GAN can be used to generate synthesized data samples for IDS datasets. However, since two neural networks are trained in the opposite way, there is no guar- antee that both networks are converged simultaneously [34], [35]. As a result, GAN is often difficult to train.

An extension of GAN is Auxiliary Classifier Generative Adversarial Network (ACGAN) that can generate samples for a specific label [12]. ACGAN generates samples taking into account external information, i.e., class labels. The class label information concatenated with the random noisezis used to forceGeto generate a particular type of data sample.

N(0,1) z Ge x˜

x

Di

Real

F ake (a) Generative Adversarial Network (GAN).

x En

µ

σ

z De x˜

(b) Variational AutoEncoder (VAE).

x En

µ

σ

˜

z De x˜

Di N(0,1) z

Real

F ake (c) Adversarial AutoEncoder (AAE).

Fig. 1: Structure of generative models (a) GAN, (b) VAE, and (c) AAE.

3.2 Variational AutoEncoder

A Variational Auto Encoder (VAE) [36] also consists of two parts: an encoder and a decoder (Fig. 1(b)). The encoder inputs a data point x and outputs a latent representation or a bottleneckz. The latent representation of the encoder, denoted as q(z|x), is presented by a Gaussian probability density (µ is the mean andσ the standard deviation). We can sample from this distribution to get noisy values of the representationz. The encoder aims to learn an efficient compression of the data into the lower-dimensional latent representation. The decoder, denoted byp(˜x|z), inputs the latent vectorzand outputsx. The VAE is trained to recon-˜ struct the input at the output. In other words,x˜is expected

(4)

to be similar tox. The VAE loss function includes two terms in Eq. (2). The total loss of the VAE is the summation over the loss at each data point.

LV AE =−Eq(z|x)[logp(x|z)] +D_KL(q(z|x)||p(z)). (2) The first term in Eq. (2) is the reconstruction error (RE) or the expected negative log-likelihood ofx. The expectation is estimated by sampling z from the latent representation of the VAE given the input sample x. If the decoder’s output does not reconstruct the original data well, it will incur a large cost in this loss term. The second term is a regularizer. This is the Kullback-Leibler (KL) divergence between the encoder’s distributionq(z|x)and the expected distributionp(z). This divergence measures how closeqis to p[37]. In our implementation,p(z)is specified as a standard normal distribution with mean (µ) zero and variance (σ) one, denoted by p(z) = N(0,1). If the encoder outputs the representation z that are different from a standard normal distribution, it will receive a penalty in the loss. An extension of VAE is Conditional VAE (CVAE) [13] that uses the label information in the input of the encoder. The input of the encoder of CVAE is a concatenation of a class label with samplexinstead of only samplex. The advantage of CVAE is that it can be used to generate data samples for any specific class.

3.3 Adversarial AutoEncoder

One drawback of VAE is that it uses the KL divergence to impose a prior on the latent space, p(z). This requires that p(z) is a Gaussian distribution. In other words, we need to assume that the original data follows the Gaussian distribution. Adversarial AutoEncoder(AAE) avoids using the KL divergence to impose the prior by using adversarial learning. This allows the latent space,p(z), can be learned from any distribution [9].

Fig. 1(c) shows how AAEs work in detail. Training an AAE has two phases: reconstruction and regularization. In the reconstruction phase (RE), a latent sample ˜zis drawn from the generator Ge. Sample ˜z is then sent to the decoder (denoted by p(x|˜z)) which generates ˜x from ˜z. The reconstruction loss is computed as the error betweenxand

˜

x(Eq. (3)) and this is used to update the encoder and the decoder.

L_RE=−Ex[logp(x|˜z)]. (3) In regularization phase (RG), the discriminator receives˜z from the generatorGeandzis sampled from the true prior p(z)¹. The generator tries to generate the fake sample,˜z, similarly to the real sample,z, by minimize the second term in Eq. (4). The discriminator then attempts to distinguish between˜zand zby maximizing this equation. Interesting note here is that the generator portion of the adversarial network is also the encoder portion of the AutoEncoder.

Therefore, the training process in the regularization phase is the same as that of GAN.

1. Although AAE can learn from any distribution, in this paper, we implemented a simple version in whichp(z)is specified as a standard normal distribution, N(0,1).

LRG=E^z[logDi(z)] +E^x[log(1−Di(En(x)))]. (4) An extension of AAE is Conditional Adversarial Au- toEncoder (CAAE) [9] where the label information is concatenated with the latent representation (z) to form the input ofDi. The class label information allows CAAE to generate the data samples for a specific class. Another version of AAE is Denosing AAE (DAAE) [38] that attempts to match the in- termediate conditional probability distribution q(˜z|xnoise) with the priorp(z)wherex_noiseis the corrupted version of the inputx. Because DAAE reconstructs the original input data from the corrupted version of input data (input data with noise), its latent representation is often more robust than the representation in AAE [39].

4 PROPOSED SOLUTION

This section proposes two models for generating malicious samples for the cloud IDSs: CDAAE and CDAAE-KNN.

4.1 Conditional Denoising Adversarial AutoEncoder The DAAE model presented in Section 3 is only trained in the unsupervised learning mode that does not consider the class label of data samples. Therefore, this model cannot be used to generate the data samples for a specific class. To address this issue, we propose a Conditional Denoising Ad- versarial AutoEncoder (CDAAE) which can generate data samples of any specific class. This model will be used to generate the malicious samples for each type of attacks on the cloud environment. Fig. 2 describes CDAAE which is an ensemble of a Denosing AutoEncoder (DAE)² and an Adversarial Network. In CDAAE,xnoiseis generated from xusing the Gaussian corruption function [38] as follows.

f(x_noise|x) =N(x, σ²I), (5) where the mean and standard deviation values of the Gaus- sian noise arexandσ, respectively.

x Noise function

xnoise

y

En(Ge) µ

σ

˜

z De

y

˜ x

z Di N(0,1)

Real

F ake Fig. 2: Structure of CDAAE.

Two networks in CDAAE are jointly trained in two phases, i.e., the reconstruction phase and the regularization phase. In the reconstruction phase, the encoder of CDAAE (En), receives two inputs, i.e., a noisy data x_noise and a 2. Using a Denoising AutoEncoder instead of an AutoEncoder allows CDAAE to learn more robust features from the original data [39].

(5)

class label y, and generates the latent representation or the bottleneck ˜z. The latent representation ˜z is used to guide the decoder of CDAAE to reconstruct the input at its output from a desired distribution, i.e., the standard normal distribution. LetWEn,WDe,bEn, andbDebe the weight matrices and the bias vectors of Enand De, respectively, and x = {x¹, x², . . . , xⁿ} be a training dataset wheren is the number of training data samples. The computations of EnandDebased on each training sample are presented in Eq. (6) and Eq. (7), respectively.

˜

zⁱ=fEn(WEn(xⁱ_noise|y^t+bEn), (6)

˜

xⁱ=f_De(W_De(zⁱ|y^t) +b_De), (7) where|is the concatenation operator andf_Enandf_De are the activation functions of the encoder and the decoder, respectively.y^tis the label of the data samplexⁱ,xⁱ_noise is generated fromxⁱby Eq. (5). The reconstruction phase aims to reconstruct the original dataxfrom the corrupted version xnoise by minimizing the reconstruction error (Rloss) in Eq. (8).

R_loss= 1 n

n

X

i=0

xⁱ−xˆⁱ²

. (8)

In the regularization phase, an adversarial network is used to regularize the hidden representation˜zof CDAAE.

The generator (Ge) of the adversarial network, which is also the encoder of the Denoising AutoEncoder (En), tries to generate the latent variable ˜z that is similar to sample z drawn from a standard normal distribution, p(z) = N(z|0,1). We define d_real and d_{f ake} as the outputs of Di with inputs z and ˜z, respectively. For each data point xⁱ, dⁱ_realanddⁱ_{f ake}are calculated as follows:

dⁱ_real=fDi(WDizⁱ+bDi), (9) dⁱ_{f ake}=f_Di(W_Diz˜ⁱ+b_Di), (10) wherefDiis the activation function ofDi.WDiandbDiare the weight matrix and the bias vector ofDi, respectively.

The discriminator (Di) attempts to distinguish the true distributionz∼p(z)from the latent variable˜zby minimizing Eq. (12) whereas the generatorGe(orEn) tries to generate˜zto fool the discriminatorDiby maximizing Eq.(11).

Gloss= 1 n

n

X

i=0

log(dⁱ_{f ake}), (11)

Dloss=−1 n

n

X

i=0

log(dⁱ_real) + log(1−dⁱ_{f ake})

. (12) The total loss function of CDAAE is the combination of the three above loss as in Eq. 13 and CDAAE is trained to minimized this loss function.

LCDAAE=Rloss+Dloss−Gloss. (13)

4.2 Borderline Sampling with CDAAE-KNN

In machine learning, data samples around the borderline between classes are often prone to be misclassified. Therefore, these samples are more important for estimating the optimal decision boundary. Generating synthesized samples on this area potentially makes more benefit than performing on the whole minority class [40]. Nguyen et al. [28] uses Support Vector Machines (SVMs) to learn the boundary of classes and over-sampling the minor samples around these boundary. However, when dealing with imbalanced datasets, the class boundary learned by SVMs may be skewed toward the minority class, thereby increasing the misclassified rate of the minority class [41]

In this paper, we determine the borderline samples using the K nearest neighbor algorithm (KNN). Specifically, we propose a hybrid model between CDAAE and KNN (shorted as CDAAE-KNN) for generating malicious data for the cloud IDSs. In this model, KNN is used to select the generated samples by CDAAE that are close to the borderline.

Algorithm 1CDAAE-KNN algorithm.

INPUT:

X: original training set X˜: generated data samples m: number of nearest neighbours

d(x,y): Euclide distance between vectorxand vectory OUTPUT:

Xnew: new sampling set BEGIN:

1. Training CDAAE on X to have the trained decoder network (De)

2. SetX˜ contains minority samples generated byDe 3. Run KNN algorithm on setX∪X˜

foreachxi∈X˜ do

- Setm= number of majority class samples inknearest neighbours

- Setn= number of minority class samples inknearest neighbours

ifm≥α1andn≥α2then X_new=X∪ {xi} end if

end for return X_new END.

Algorithm 1 describes the details of CDAAE-KNN. The algorithm divides into two phases: generation and selection.

In the generation phase (Step 1 and 2), we train a CDAAE model on the original dataset X. Then, we use the decoder network De of CDAAE to generate new samples for the minority classes. The generated dataset is calledX˜. In the selection phase, the KNN algorithm is executed in the sample setX∪X˜ to find theknearest neighbours of the samples x_i ∈ X˜. For eachx_i, we calculate the number of nearest samples which belong to minority classesnand the number of nearest samples which belong to majority classesm. Ifm andnare larger than thresholdsα₁andα₂, respectively, the samplexiwill be considered as the borderline samples and it is added to the output setXnew. In our paper, parameters

(6)

α₁,α₂, andkare chosen by using the grid search technique in which the validation set is formed by randomly drawing 10% samples from the training set. The range of values to search by the grid search is presented in Table 1. By using the grid search technique, the best parameter values forα1, α2, andkare 5, 5, and 50, respectively.

Parameters Values α1 2, 5, 10, 15, 20 α2 2, 5, 10, 15, 20 k 20, 30, 40, 50, 60

TABLE 1: Parameters range of α1,α2, and k for the grid search technique.

A potential problem with CDAAE-KNN is that this approach may generate the misclassified samples, i.e., the generated samples for the minor class may be closer to the major class than the minor class. We have conducted an experiment to examine if this happens by calculating the number of neighbor samples of the borderline samples in CDAAE-KNN. According to results reported in Table 11, we found that most of the neighbors (about 85%) of the borderline samples belong to the same class with the borderline samples. Therefore, the probability to generate misclassified samples of CDAAE-KNN is relatively low.

5 EXPERIMENTAL SETTINGS AND EVALUATION

METHODS

This section describes the datasets and the experimental settings used in our paper.

5.1 Datasets

To evaluate the effectiveness of the proposed methods, we performed experiments on six well-known IDS datasets including one cloud IDS dataset, two network IDS datasets (NSL-KDD and UNSW-NB15) and three malware datasets from the CTU-13 dataset system. These datasets were used in the previous researches on network IDSs and cloud IDSs [18]. The number of samples in each class of these datasets is presented in Table 2, Table 3 and Table 4.

TABLE 2: Number of data samples of cloud IDS dataset.

Slowloris TCP land Ping of Death

Classes Number Classes Number Classes Number

Normal 3077 Normal 3072 Normal 3070

Attack 293 Attack 34 Attack 48

5.1.1 Cloud IDS Dataset

This dataset is published by Kumar et al. [42] as a part of developing their DoS attack on the cloud environment. They stimulated Infrastructure as a Service cloud environment using an open source cloud platform, Eucalyptus. They launched 15 normal Virtual machines (VMs), 1 target VM and 3 malicious VMs. The traffic between each VM was captured and 4 IP-header fields were recorded (i.e., source, destination, transferred bytes and protocol type). The pay- load information is ignored. The extracted dataset contains 24 time-based traffic flow features. In our experiments, we selected three types of Dos attacks including Ping-of-Death, TCP Land, and Slowloris attacks from this dataset.

TABLE 3: Number of data samples of network IDS datasets.

NSL-KDD UNSW-NB15

Classes Number Classes Number

Normal 67373 Normal 37000

Attack 58630 Attack 45332

DoS 45927 Generic 18871

U2L 52 Exploits 11132

R2L 995 Fuzzers 6062

Probing 11656 DoS 4089

Reconnaissance 3496

Analysis 677

Backdoor 583

Shellcode 378

Worms 44

TABLE 4: Number of data samples of malware datasets.

Menti NSIS.ay Virus

Classes No. Classes No. Classes No.

Normal 518904 Normal 292485 Normal 37000

Malware 230 Malware 1420 Malware 37000

5.1.2 NSL-KDD

NSL-KDD is an IDS dataset [43] which is used to solve some intrinsic problems of the KDD’99 dataset. The NSL-KDD dataset contains 148517 records in total, which is divided into the training set (125973 data samples) and the testing set (22544 data samples). Each sample has 41 features and is labeled as either a type of attacks or normal. The training set contains 24 attack types, and the testing set includes additive 14 types of attacks. The simulated attack samples belong to one of the following four categories: DOS, R2L, U2R, and Probing.

5.1.3 UNSW-NB15

UNSW-NB15 is created by utilizing the synthetic environment as the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre of Cyber Security (ACCS) [44]. The number of records in the training set and the testing set are 175,341 records and 82,332 records, respectively. There are nine categories of attacks which are Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shell- code and Worms. Each data sample has 49 features which have been generated by using the Argus, Bro-IDS tools and their twelve algorithms to analyze characteristics of network packets.

5.1.4 CTU13s

The CTU-13 dataset is a publicly available malware dataset that was captured in the Czech Technical University (CTU) University, Czech Republic, in 2011 [45]. The data includes normal traffic combined with many kinds of real-world botnets. There are thirteen botnet scenarios, and each of them involves a specific type of malwares with several protocols and different actions. We chose three scenarios in the dataset that correspond to three kinds of malware including Menti, NSIS.ay, and Virus. Each of scenario forms a tested dataset with 14 features.

5.2 Experimental Settings

The source code of the tested methods in this paper is available for download³. The generative models were imple-

3. https://github.com/lyvt/cdaae knn.

(7)

mented using a popular deep learning framework, Tensor- flow. The same network structure is used for all generative models. In each model, the decoder and the discriminator have two layers, and the encoder has three layers including the latent layer. The size of the latent layer is set based on the number of features of the datasets. Let n be the dimension of input data, then the number of neural in the latent layer isd1 +√

ne[7]. All networks were trained using Adam optimization [36] with the learning rate of10⁻³. After training, the decoder of CVAE, CAAE and CDAAE and the generator of ACGAN is used to generate synthesized malicious samples for the tested datasets.

The synthesized samples are then merged with the original data to form the augmented datasets. Three popular classification algorithms, i.e., SVM, decision tree (DT), and random forest (RF), are trained on the augmented datasets.

The implementation of these classification algorithms in a popular machine learning packet in Python Scikit learn [46]

is used. In order to minimize the impact of experimental parameters to the performance of the classification algorithms, we implement the grid search technique for each classifier algorithm. The range of values for each parameter tuned by the grid search technique is shown in Table 5. In our experiments, we used the default settings in the Scikit learn library in which 5-fold crossover validation is used to perform the grid search.

TABLE 5: Values of grid search for classifiers.

Classifiers Parameters

SVM kernel=rbf;gamma= 0.001,0.01,0.1,1.0 DT max−depth= 5,6,7,8,9,10,50,100 RF n−estimators= 5,10,20,40,80,150

Except NSD-KDD and UNSW-NB15, which have been divided into training and testing set, the other tested datasets are divided as follows: training set (70%) and testing set (30%). The number of samples in each attack is shown from Table 2 to Table 4. These tables show that the datasets are highly imbalanced: there are very few attack samples compared to normal samples. Two network IDS datasets (NSL-KDD and UNSW-NB15) are processed to form multi-class classification problems where each attack type is one class. Other datasets are labelled as the binary classification problems.

To find the appropriate values for the hyper-parameters of our proposed techniques, we use also applied the grid search techniques in which 10% of training dataset is used for the validation. Since CAAE, CDAAE and CDAAE-KNN have the same network architecture, we only applied the grid search to CAAE model and then transferred the best values of CAAE to CDAAE and CDAAE-KNN. In CAAE, we tuned two main hyper-parameters, i.e., the number of hidden layers of each part (i.e., En, De, and Di) and the number of neurons in the hidden layers. Moreover, En and De have the symmetry architecture, thus only the hyper- parameters of En need to be selected. The range of the values tuned by the grid search in CAAE is shown in Table 6. After the grid search, the number hidden layers and the number of neurons at each hidden layer of CAAE are presented in Table 7, and these values are used in both CDAAE and CDAAE-KNN.

Datasets Hyper-Parameters Values Clouds

Layers-En 1, 2, 3, 4 Layers-Di 1, 2, 3, 4 Neuron-En 20, 15, 12, 8 Neuron-Di 10, 20, 40, 60 NSL-KDD

Layers-En 1, 2, 3, 4 Layers-Di 1, 2, 3, 4 Neuron-En 35, 25, 20, 15 Neuron-Di 10, 20, 40, 60 UNSW-NB15

Layers-En 1, 2, 3, 4 Layers-Di 1, 2, 3, 4 Neuron-En 35, 25, 20, 15 Neuron-Di 10, 20, 40, 60 CTU13s

Layers-En 1, 2, 3, 4 Layers-Di 1, 2, 3, 4 Neuron-En 12, 10, 8, 6 Neuron-Di 10, 20, 40. 60

TABLE 6: Hyper-parameter ranges for the grid search technique for each dataset.

TABLE 7: The best hyper-parameters found by the grid search for CAAE.

Datasets Dimension En De Di

Clouds 24 24-12-6 12-24 6-20-40

NSL-KDD 41 41-15-8 15-41 8-20-40

UNSW-NB15 49 49-25-8 25-49 8-20-40

CTU13s 14 14-10-5 10-14 5-20-40

We divided our experiments into three sets. The first set is to evaluate the impact of using generative models to balance the skewed datasets. For this case, we used CDAAE to generate synthesized samples for the minor class to balance with the major class at various ratios. The performance of three classifiers that are trained on the datasets of different rate between the minor and the major class are measured and analyzed. The second set is to compare the effectiveness of our proposed models with the previous generative models. The performance of the classification algorithms trained on the augmented datasets generated by CDAAE and CDAAE-KNN is compared with the version trained on the original data (namely ORIGINAL) and four other approaches for addressing imbalanced data, i.e., SMOTE- SVM [10], BalanceCascada [11], ACGAN [12], CVAE [13], CAAE [9].

The third set is to assess the quality of the generated samples of CDAAE and compare it with other generative models. For each generative model, we calculate the Gaussian Parzen window distribution [33]. This metric presents the degree to which the generated samples of the model follow the original data distribution. The higher score implies that the generated samples have high probabilities to follow the original data distribution. The results in each experimental set are presented in Section 6.

5.3 Performance metrics

We use three popular performance metrics in the imbalance learning to measure the effectiveness of the proposed models. The reported metrics include F1 score, AUC score, and GEO score. These metrics are calculated based on Precision, Recall, False Positive Rate (FPR) and Specificity.

The Precision (or Confidence) score measures the ability of classifier that predicts positive samples as positive (Eq. (14)).

The Recall (or Sensitivity) score measures how many actual

(8)

positive observations are predicted correctly (Eq. (15)).FPR is the proportion of real negative cases that are incorrectly predicted (Eq. (16)), and Specificity is the rate of real negative cases that are correctly predicted (Eq. (17)).

Precision=Confidence= TP

TP+FP, (14) Recall=Sensitivity=TPR= TP

TP+FN, (15) FPR= FP

TN+FP, (16)

Specificity= TN

TN+FP, (17)

where TP and FPare the number of correct and incorrect predicted samples for the positive class, respectively, and TN and FN are the number of corrected and incorrect predicted samples for the negative classes. The advantage of these metrics is that they are very intuitive and easy to implement. However, they make no distinction between classes that are sometime insufficient to measure a classifier especially for imbalanced datasets [47].

5.3.1 F1-score

F1-score overcomes the disadvantage of Precision and Recall score. F1 score is Harmonic mean [47] of Precision and Recall where Harmonic mean is an appropriate way to average ratios. Precisely, F1 score is computed in Eq. (18).

F1-score is often considered as a reliable metric to evaluate the performance of classification algorithms in imbalanced datasets.

F1= 2×Precision×Recall

Precision+Recall. (18) 5.3.2 AUC score

AUC stands for Area Under Receiver Operating Character- istics(ROC) Curve [47] that is created by plotting the True Positive Rate (TPR) or Sensitivity against the False Positive Rate (FPR) at various threshold settings. A perfect classifier will score in the top left hand corner (FPR= 0,TPR= 100%).

A worst case classifier will score in the bottom right hand corner (FPR= 100%,TPR= 0). The space under ROC curve is represented as AUC score. This measures the average quality of classification model at different thresholds. A random classifier has AUC value of 0.5 and the value of AUC score for a perfect classifier is 1.0. Therefore, most classifier has the values of AUC score between0.5and1.0. 5.3.3 Geometric mean score

Geometric mean (GEO) is the root of the product of class- wise Sensitivity [48]. This metric tries to maximize the accuracy on each class while keeping this accuracy balanced.

For binary classification, GEO (Eq. (19)) is the squared root of the product of the Sensitivity or Recall (Eq. (15)) and Specificity (Eq. (17)). The best value is1and the worst value is 0. If at least one class is unrecognized by the classifier, GEO score will go to zero.

GEO= q

Sensitivity×Specificity. (19)

Fig. 3: Effect of using CDAAE to generate minor samples.

6 RESULTS ANDDISCUSSION

This section presents the results of three experiments and the discussions on these results.

6.1 Effect of Balancing Data

We investigate whether adding the synthesized samples to the minor classes can improve the performance of classifiers by conducting an experiment on Slowloris attack dataset of the cloud IDS dataset. The ratio between the minor class (attack data) and the major class (normal data) in the original data is 0.095. We trained CDAAE on the original dataset and use the trained model to generate the minor samples. We tested five scenarios in which the rate between the attack data and the normal data increases from0.195to 0.975(as shown in Fig. 3).

Fig. 3 presents the AUC scores of SVM, DT and RF on the tested data. It can be seen that the accuracy of three classifiers is consistently improved when the number of minor samples increases. For example, the AUC score of SVM is improved from0.906to 0.956when the balancing rate is increased from 0.095 to 0.975. More impressively, the AUC score of DT and RF is improved from 0.920 to 0.983and from0.813to 0.981, respectively, corresponding to the above balancing rate. This result evidences that using CDAAE to balance the skewed dataset in the cloud IDS can significantly improve the accuracy of the detection algorithms.

6.2 Performance Comparison

This subsection compares the effectiveness of our proposed models for generating synthesized attacks with the previous generative models. The tested generative models include:

SMOTE-SVM, BalanceCascade, ACGAN, CVAE, CAAE. Ta- ble 8 presents the results of three classifiers when they are trained on the original version and on the augmented versions of the cloud IDS dataset. It can be seen that most generative models can enhance the accuracy of SVM, DT and RF. Moreover, our proposed models are usually better than those of the other tested models. For example, the AUC score of SVM trained on the original version of PingOfDeath dataset is 0.972. This value is improved when SVM is trained on the augmented datasets of the generative models and reaches to0.999with our proposed models, i.e., CDAAE and CDAAE-KNN. Between the two

(9)

new models, the result of CDAAE-KNN is often better than that of CDAAE.

Table 8 also shows that machine learning algorithms often perform well on DoS attack dataset. The reason is that DoS attacks usually have the huge volume of traffic comparing to the normal traffic. Thus, the statistical features such asnumber of occurrence for an incoming IP,bytes received per unique IP are very distinguishable between the normal traffic and the attack traffic. Moreover, using generative models to generating synthesized attacks can enhance the accuracy of classifiers on this problem.

Table 9 presents the results of SVM, RF, and DT, on two network IDS datasets (NSL-KDD and UNSW-NB15).

This table evidences for the impressive performance of the classifiers when they are trained on the augmented datasets of CDAAE and CDAAE-KNN. First, we can observe that the classifiers perform well on NSL-KDD, but they seem ineffective on UNSW-NB15. The reason could be that UNSW- NB15 dataset is more complicated than NSL-KDD. In fact, UNSW-NB15 has more categories of attacks (ten) and each attack has fewer samples compared to NSL-KDD. Thus, when trained on the original dataset of UNSW-NB15, all three classifiers achieve GEO score at0. For NSL-KDD, this only happens with SVM. The GEO score at0evidences that the machine learning algorithms completely mis-classify some attacks. For example, on the NSL-KDD dataset, the classifiers usually cannot predict R2L and U2L attacks. On UNSW-NB15, the classifiers trained on the original version often can not predict Analysis, Backdoor, Shellcode, and Worms attacks.

Second, by using generative models to generate synthesized samples for the minor classes, the accuracy of the classifiers is improved considerably. On NSL-KDD dataset, the GEO scores of SVM, DT, and RF are improved from 0 to 0.760, 0.672, and 0.715, respectively, when they are trained on the augmented datasets by CDAAE-KNN. Those values are increased from0 to approximately0.514,0.533, and0.633, respectively, on the UNSW-NB15 dataset.

Third, the table also shows that the F1 score, AUC score, and GEO score of classifiers based on the generative models are usually higher than those of the traditional techniques (SMOTE-SVM and BalanceCascade). For example, comparing between CDAAE and SMOTE-SVM, the F1 score is increased from0.788,0.824,0.749 to0.834,0.844, 0.791, respectively, on NSL-KDD dataset corresponding to SVM, DT, RF. These values are from 0.542, 0.461, 0.709 to 0.692, 0.533, 0.722, respectively, on the UNSW-NB15 dataset. Among all techniques for synthesizing data, it can be seen that CDAAE-KNN often achieves the best results.

The last performance evaluation in this subsection is the comparison of tested models on the malware datasets. This result is presented in Table 10. It can be seen that the F1- scores of three classifiers are very high when they are trained on the original datasets. Therefore, these value are hardly improved by using generative models. However, the AUC and GEO scores of the classifiers are not as high as the F1 score. The reason is that the F1 score does not emphasize the false prediction of the minority classes compared to the other two scores. Particularly, SVM fails to correctly identify any malware samples, leading to0value of the GEO score on all three malware datasets.

Table 10 also shows that the GEO scores are greatly increased when applying sampling techniques. Further- more, our proposed models (CDAAE and CDAAE-KNN) are usually better than the other tested models. For example, CDAAE-KNN can improve the GEO score from 0.941 to 0.971 compared to CAAE for the SVM classifier applied on the CTU13.6-Menti dataset. These values on CTU13.12- NSIS.ay and CTU13.13-Virut are from 0.598 to 0.687 and from 0.956 to 0.972, respectively. The better performance of CDAAE based techniques (CDAAE, CDAAE-KNN) compared to CAAE proves that adding noise to the input of an AE helps to learn better representation from the original data. The other classifiers (DT and RF) also increase the performance on the malware datasets when the datasets augmented by CDAAE and CDAAE-KNN.

Overall, the results in this subsection show that the generative models can be used to generate meaningful samples for the minor classes on the cloud IDS. Moreover, our proposed models often achieve better results compared to the previous models. The reason for the better performance of the generative models compared to the traditional approaches (SMOTE-SVM and BanalceCascade) is that the traditional approaches create samples that do not follow the original data distribution, while this problem is mitigated by using the generative models. Additionally, the methods based on CDAAE (CDAAE, CDAAE-KNN) can lessen some limitations of the methods based on GAN (e.g. ACGAN) and the methods based on VAE (e.g. CVAE). Moreover, CDAAE-KNN also yields more successful results compared with CDAAE thank to the usage of KNN to select important samples which contribute more to the classifiers.

6.3 Generative Models Analysis

This subsection qualitatively evaluates the generative models by measuring whether their generated data converges to the true data distribution. We used the Parzen window method [33] to estimate the likelihood of the synthesized samples following the distribution of the original data.

For each generative model, we fit the generated samples by a Gaussian Parzen window and the log-likelihood of each generative model under the true distribution is reported [33]. The window size parameter of the Gaussian is obtained by cross validation on the validation set. Experi- mental results are reported in Table 11⁴where a higher value presents a better generative model.

It can be seen that the log-likelihood of all generative models is always greater than the value of SMOTE-SVM on all tested datasets. The reason is that the generative models are trained to learn the true distribution of the original data while SMOTE-SVM does not do so. Moreover, the log- likelihood of CDAAE is always the greatest value among all generative models. This evidences that the quality of the generated data of CDAAE is always better than the other models. This result explains why machine learning algorithms trained on the augmented datasets of CDAAE often achieve better performance compared to those trained on the augmented datasets of the other generative models.

4. For each model, this table presents mean and standard deviation of the log-likelihood of the generated samples.

(10)

TABLE 8: Result of SVM, DT and RF on the cloud IDS dataset Algorithm Augmented

datasets

Slowloris TCP land PingOfDeath

F1 AUC GEO F1 AUC GEO F1 AUC GEO

SVM

ORIGINAL 0.980 0.907 0.989 0.989 0.990 0.991 0.993 0.972 0.972 SMOTE-SVM [10] 0.981 0.922 0.921 1.000 1.000 1.000 0.999 0.973 0.973 BalanceCascade [11] 0.989 0.921 0.929 1.000 1.000 1.000 0.999 0.981 0.980 ACGAN [12] 0.986 0.912 0.910 1.000 1.000 1.000 0.999 0.980 0.979 CVAE [13] 0.985 0.912 0.909 1.000 1.000 1.000 0.999 0.979 0.979 CAAE [9] 0.985 0.912 0.908 1.000 1.000 1.000 0.998 0.980 0.978 CDAAE 0.987 0.955 0.955 1.000 1.000 1.000 0.999 0.999 0.999 CDAAE-KNN 0.999 0.999 0.999 1.000 1.000 1.000 0.999 0.999 0.999 DT

ORIGINAL 0.981 0.915 0.915 0.989 0.950 0.951 0.988 0.982 0.982 SMOTE-SVM [10] 0.976 0.960 0.957 0.999 0.999 0.999 0.999 0.973 0.973 BalanceCascade [11] 0.993 0.957 0.957 0.999 0.999 0.999 0.999 0.980 0.980 ACGAN [12] 0.990 0.956 0.956 0.999 0.959 0.957 0.999 0.999 0.999 CVAE [13] 0.992 0.957 0.958 0.999 0.960 0.958 0.999 0.999 0.999 CAAE [9] 0.982 0.956 0.958 0.998 0.962 0.960 0.998 0.999 0.998 CDAAE 0.992 0.982 0.983 0.999 0.999 0.999 0.999 0.999 0.999 CDAAE-KNN 0.999 0.998 0.999 0.998 0.999 0.999 0.999 0.998 0.999 RF

ORIGINAL 0.984 0.911 0.966 0.983 0.928 0.927 0.987 0.970 0.969 SMOTE-SVM [10] 0.992 0.976 0.976 0.999 0.999 0.999 0.999 0.973 0.973 BalanceCascade [11] 0.995 0.978 0.978 0.999 0.999 0.999 0.999 0.980 0.980 ACGAN [12] 0.994 0.977 0.977 0.999 0.959 0.959 0.999 0.980 0.981 CVAE [13] 0.994 0.977 0.978 0.999 0.960 0.960 0.999 0.981 0.981 CAAE [9] 0.994 0.977 0.977 0.998 0.960 0.959 0.998 0.983 0.982 CDAAE 0.994 0.981 0.980 0.998 0.999 0.998 0.998 0.999 0.998 CDAAE-KNN 0.999 0.999 0.999 0.999 0.999 0.999 0.998 0.999 0.998 TABLE 9: Result of SVM, DT and RF of on the network IDS datasets

Algorithm Augmented datasets

NSL-KDD UNSW-NB15

F1 AUC GEO F1 AUC GEO

SVM

ORIGINAL 0.772 0.570 0 0.413 0.129 0

SMOTE-SVM [10] 0.788 0.688 0.585 0.542 0.218 0.148 BalanceCascade [11] 0.797 0.620 0.615 0.590 0.304 0.220 ACGAN [12] 0.821 0.712 0.657 0.602 0.200 0.236 CVAE [13] 0.822 0.736 0.690 0.634 0.325 0.298 CAAE [9] 0.828 0.738 0.700 0.642 0.345 0.300 CDAAE 0.834 0.741 0.748 0.692 0.416 0.406 CDAAE-KNN 0.842 0.753 0.760 0.722 0.441 0.514 DT

ORIGINAL 0.821 0.430 0 0.426 0.221 0

SMOTE-SVM [10] 0.824 0.446 0.361 0.461 0.348 0.228 BalanceCascade [11] 0.798 0.522 0.476 0.497 0.486 0.319 ACGAN [12] 0.828 0.523 0.588 0.439 0.506 0.420 CVAE [13] 0.835 0.612 0.656 0.462 0.538 0.460 CAAE [9] 0.832 0.629 0.652 0.500 0.542 0.470 CDAAE 0.844 0.650 0.669 0.533 0.522 0.503 CDAAE-KNN 0.853 0.660 0.672 0.542 0.533 0.533 RF

ORIGINAL 0.742 0.760 0.442 0.413 0.357 0

SMOTE-SVM [10] 0.749 0.780 0.564 0.709 0.436 0.407 BalanceCascade [11] 0.752 0.793 0.589 0.702 0.439 0.415 ACGAN [12] 0.754 0.804 0.628 0.672 0.448 0.519 CVAE [13] 0.762 0.824 0.692 0.694 0.572 0.594 CAAE [9] 0.764 0.823 0.708 0.692 0.571 0.602 CDVAE 0.791 0.835 0.700 0.722 0.602 0.619 CDVAE-KNN 0.886 0.842 0.715 0.732 0.623 0.633

6.4 Processing Time Analysis

This subsection compares the processing time of all tested approaches. We measured the computational time for training and generating synthesized samples of the methods on two datasets, PingOfDeath and TCPLand⁵. All experiments were conducted on the same computing platform, i.e., Op- erating system: Ubuntu 18.04.2 LTS (64 bit), CPU: Intel(R) Core(TM) i5-5200U, and RAM memory 4GB. The results are presented in Table 12. It can be seen that all approaches based on deep neural networks require a significant time to train the models whereas the traditional approaches, 5. The results on the other datasets are similar and they are not showed in this paper due to the space limitation.

i.e., SMOTE-SVM and BalanceCascade do not require to train the generative models. However, for the generation time, i.e., the time to generate synthesized samples, the table shows that the difference between the traditional approaches and the deep networks is negligible. Among the tested deep network models, we can see that CDAAE-KNN often requires longer time to generate data than those of the others. However, the overhead is not significant. Moreover, both training and generating process are executed offline.

Therefore, using our proposed models to enhance the accuracy of the classifiers will not affect the detection time (i.e., time to detect attacks) of the classification algorithms.