distributed denial of service attack detection in

DNS-flood attack ..65 Figure 4.25: AUC analysis of ROC curves for Naive Bayesian machine learning model for. DNS-flood attack ..70 Figure 4.39: ROC curve AUC analysis for Decision Treemachine learning model for.

INTRODUCTION

DDoS Attack

Types of DDoS Attacks
Architecture of a DDoS Attack
DDoS Attack and Cloud Computing
Cloud Features
DDoS Attack Scenario in Cloud
DDoS Impact Characterization in Cloud
Challenges in Dealing with DDoS Attack

Problem Definition
Objective of the Thesis
Contributions
Organization of the Thesis

The perpetrator of a DDoS attack (the attacker) wants to disrupt services on the target computer. One of the most important consequences of a cloud DDoS attack is “economic losses”.

Table 1.1: Biggest DDoS attacks in terms of peak traffic rate

LITERATURE REVIEW

Review of Recent Research Work for DDoS Attack Detection

Most researchers in cloud computing and network monitoring rely on the predefined threshold and entropy techniques to detect network traffic anomalies. CTB is deployed on the edge routers as a guideline to be close to the source of the cloud network. To use the Cloud Trace Back Mark (CTM) tag in the CTB header, it is placed before the web server.

But the CTB model took relatively more time to identify the source of the attack [52]. An EDoS attack carried out by an attacker with a fake IP address belonging to the whitelist of the verification node would remain undetected. In [67], the authors believe that the strength of a DDoS attack can be determined by an ANN based on variations in the entropy of network traffic.

In [68] it is presented that the number of zombies behind a DDoS attack can be predicted with ANN against variations in entropy of the network traffic.

Summary

The result shows that ANN achieves the highest accuracy of 84.3%, which is not on point. However, the accuracy of detection deteriorated if two different types of attacks occur simultaneously in the same cloud environment.

METHODOLOGY

Machine Learning Basics

Emergence of Machine Learning
Applications and Usefulness of Machine Learning
Types of machine learning

While many machine learning algorithms have been around for a long time, the ability to automatically apply complex mathematical calculations to large data—over and over, faster and faster—is a recent development. Resurgent interest in machine learning is due to the same factors that have made data mining and Bayesian analysis more popular than ever. Most industries that work with large amounts of data have recognized the value of machine learning technology.

Some of the fields that use machine learning applications are shown in Figure 3.2 below. Through methods such as classification, regression, prediction, and gradient boosting, supervised learning uses models to predict label values in additional unlabeled data. Unsupervised learning is used against data that does not have historical labels Figure 3.4 shows the block diagram of unsupervised learning.

Semi-supervised learning is useful when the cost associated with labeling is too high to allow a fully labeled training process.

Figure 3.2: Application areas of machine learning [82].

Experimental Setup

Attack Generation
Dataset Collection

hping3 options allow users to specify the target server, a number of packets to be sent to the target, the target port, the spoofing attack source, selecting a random source, random destination, flooding to send requests to the target as quickly as possible , protocol types such as TCP, UDP, ICMP and many more options. Mausezahn allows an arbitrary sequence of bytes to be sent directly from the network interface card. The attack tools are parameterized to produce low-volume, medium-volume, and high-volume DDoS attacks.

We then generated two datasets from this by varying only the number of legitimate traffic and calling them Dataset A and Dataset B respectively. Moreover, for the experiment of the performance of our model in a condition where there will be more legitimate traffic than attack traffic, we considered dataset B. There are 6012298 instances in dataset B where the number of legitimate traffic is 5010314. First, we eliminate a few least important features compared to other recent related, state-of-the-art research work.

In order to be able to use feature selection to select the most important features in the next step, we first perform a one-time encoding of the "flag" attribute of the dataset and replace it with the extracted specific flags.

Figure 3.7: Typical Architecture of DDoS Attack on ownCloud.

Feature Engineering and MLA Selection

Recursive Feature Elimination (RFE)
k-Fold Cross-Validation
Classification

Decision Tree
Random Forest
Naïve Bayes
Logistic Regression
Stochastic Gradient Descent

Calculate the cumulative importance of the variables from the trained model. Rank the variables by the mean of the calculated importance. In step 1 of the algorithm, it fits the model to all features, 25 in our case. The model then ranks the importance of each feature in terms of predicting the value of the class variable.

In the second iteration, the second fold is used for testing, while the remaining ones are used to train the model, and so on. Cross-validation generally results in a less predictable prediction of the model than other techniques and is therefore commonly used [91]. Repeat steps 1 and 2 until we find the leaf node for each branch of the tree.

This process is repeated until the leaf node is reached with the predicted value of the class variable.

Figure 3.10: Decision Tree structure [93].

Evaluation Metrics

In stochastic gradient descent, a few samples are randomly selected for each iteration instead of the entire data set. Accuracy: is the ratio between the number of positive samples obtained and the total number of positive samples detected. Recall: is the ratio of the number of positive samples obtained to the total number of actual positive samples.

The true positive rate is also known as sensitivity, recall, or probability of detection in machine learning. The false positive rate is also called the fallout or false alarm probability [95], and can be calculated as (1 – specificity). In general, if the probability distributions for both detection and false alarms are known, the ROC curve can be generated by plotting the cumulative distribution function (i.e., the area under the probability distribution from −∞ to the discrimination threshold). detection probability on the y-axis versus the cumulative distribution function of the false alarm probability on the x-axis.

To avoid the accuracy paradox, the ROC plot is plotted between true positive and false positive rates and the AUC metric provides the true accuracy of the model for varying rates.

Summary

ROC and AUC curves: A Receiver Operating Characteristic Curve (ROC curve) is a graphical plot that illustrates the performance of a binary classifier system as long as its discrimination thresholds vary. The curve is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The accuracy is measured by the Area Under the ROC Curve and it is called AUC.

It helps avoid the accuracy paradox, a term that refers to the fact that the confusion matrix can provide values of true and false classifications at a single point of operation [96]. This condition leads to a paradox where the given accuracy value may not be valid for other operating points or changes in the model's performance. To avoid influence of high-valued features on low-valued features, the dataset was normalized using min-max normalization.

The key feature selection was performed using recursive feature elimination with cross-validation and calculating feature importance using the proposed algorithm in this study.

EXPERIMENTAL RESULTS

Feature Selection

Although RF uses more variables than the other classifiers, a low false alarm rate is a primary condition in DDoS detection systems.

Confusion Matrix and Performance for Dataset A (Legitimate traffic 79649)

In this case, RF turned out to be the best alternative algorithm for the DDoS attack detection system. From the confusion matrices we calculated the precision, accuracy, recall and f1 score of the models and listed in Table 4.8. However, considering only the degree of accuracy is not sufficient, especially when the data is unbalanced as in our case.

Table 4.3: Confusion matrix for Random Forest DNS-

Confusion Matrix and Performance for Dataset B (Legitimate traffic 5010314)

From the confusion matrices, we calculated the accuracy, precision, recall and f1 score of the models for data set B and listed in Table 4.14. Comparing the detection performance for Dataset A and Dataset B, we can see a slight degradation in performance metrics for Dataset B.

Table 4.11: Confusion matrix for Naïve Bayes DNS-

Comparative Analysis on Precision, Recall and F1-score

Of the five algorithms used, Random Forest shows better results in terms of precision, recall, precision and f1 score, closely followed by logistic regression.

Figure 4.3: Comparison of recall between the classifiers.

ROC-AUC Curve

ROC curves shown in Figure 4.6 to Figure 4.11, we can observe that we get the ideal value for AUC ie. For Logistic Regression we get AUC value of 0.99 for legitimate traffic and TCP-SYN-flooding attack as shown in Figure 4.13 and Figure 4.14 respectively. The rest of the ROC curves for Logistic Regression, as depicted in Figure 4.15 and Figure 4.18, we get the AUC score of 1.

For Naïve Bayes, we get some deviations in the ROC curve for legitimate traffic and a very little deviation for TCP-SYN flood attack as shown in Figure 4.20 and Figure 4.21, respectively. The rest of the ROC curves for Naïve Bayes as depicted in Figure 4.22 to Figure 4.25, we get AUC score of 1. For SGD, it is clearly observed that the AUC score is lower for HTTP flood attack than the other attacks as in Figure shown 4.33.

Figure 4.34 clearly shows that Decision Tree has a lower AUC score for detecting legitimate traffic for our dataset, similar to Naive Bayes.

Figure 4.7: AUC analysis of ROC curve for Random Forest machine learning model for TCP-SYN-flood attack

Comparison with the State-of-the-art Methods

Threats to Validity of the Model

To eliminate this problem, we performed cross-validation on our data during training. iii) Another issue is, in our research we have captured network packet for data generation instead of network flow.

Summary

CONCLUSION

Summary of the Work

Future Work

19] Idziorek, J., and Tannian, M., “Exploiting Cloud Utility Models for Profit and Loss,” in Proceedings of the 4th IEEE International Conference on Cloud Computing, p. L., “Using Firefly and Genetic Metaheuristics for Anomaly Detection Based on Network Flows,” in Proceedings of 11th Advanced International Conference on Telecommunications (AICT), 2015. C., “Defending against flooding-based distributed denial-of-service attacks: a tutorial”, in IEEE Communications Magazine, vol.

K., "Securing cloud computing environment against DDoS attacks," in IEEE International Conference on Computer Communications and Informatics (ICCCI), January, 2012, pp. A., "Feature Selection to Detect Botnets Using Machine Learning Algorithms." in International Conference on Electronics, Communications and Computing (CONIELECOMP), pp. S., “SYN flood attack detection in cloud environment based on TCP/IP master statistics characteristics,” in the Proceedings of 8th International Conference on Information Technology (ICIT), pp.

A., “Detection of DDoS attacks with feature engineering and machine learning: the framework and performance evaluation,” in International Journal of Information Security, vol.18, pp.