CHAPTER 3 METHODOLOGY
3.3 Feature Engineering and MLA Selection
3.3.3 Classification
A classification algorithm needs to be used which will maximize the accuracy of DDoS attack detection in cloud. For this purpose, a number of classification algorithms is required to be tested with for this dataset. In this work, we investigated and tested five different classifiers based on the dataset collected and described in the previous section for predicting a DDoS attack in the network. The models are Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), Logistic Regression (LR), and Stochastic Gradient Descent (SGD). Their description and working procedure will be discussed in this section.
This selection is based on their performance in DDoS attack detection in different research by other authors. Their performance will be presented in chapter 4 (result section). At the end, the best classifier for this dataset will be selected for the proposed method.
3.3.3.1 Decision Tree
To add a diversity of models to test our sorted lists returned by the feature selection methods in the previous step, we decided to add a Decision Tree classifier which is one of the widely used algorithms in classification and regression problems. A decision tree creates a tree like structure where each node of the tree represents a feature, each link represents a decision (rule) and each leaf represents a possible outcome. The root node contains all the observations and each of the internal nodes holds a feature test [93]. The decision is made in the top-down recursive method, and the leaf node category is returned
41
as result as shown in Figure 3.10. Compared to other classifiers, Decision Tree is the easiest to understand and interpret. The basic steps while implementing a Decision Tree are [97]:
a. Identify the best attribute and position it at the root of the tree.
b. Split the training dataset into subsets in a way that each subset only contains data with the same value from an attribute.
c. Repeat step 1 and 2 until we find the leaf node for each branch of the tree.
Now, to predict the value of a class variable using a Decision Tree, we start from the root of the tree and compare the values stored in the root attribute with the values of the instance’s attribute. Based on the results of this comparison, we choose a branch to follow and move on to the next node in the tree. This process is repeated until the leaf node is reached with the predicted value of the class variable.
Figure 3.10: Decision Tree structure [93].
3.3.3.2 Random Forest
A Radom Forest is basically an ensemble of many decision trees working together and trained using the bagging method. The bagging method is based on an underlying theory that the combination of multiple models increases the effectiveness of the resulting model [98]. Radom Forest can be used for both classification and regression problems but in our study, we have used it for classification since we are dealing with a binary class variable.
Figure 3.11 shows a decision tree with two trees. The results from the two trees are combined by the ensemble approach employed by the Random Forest to produce the result.
Root (Sample Data)
Internal Node
Output Value Output Value
Internal Node
Output Value Output Value
42
A Random Forest has almost the same hyper parameters as a Decision Tree in addition to the hyper parameters of a bagging classifier which controls the ensemble of trees. Random forest searches for the most significant attribute among a random subset of attributes rather than looking for the best attribute while splitting a node. This process creates a wide diversity, which generally results in a better model [94].
Figure 3.11: Decision Tree with two trees [94].
3.3.3.3 Naïve Bayes
Naïve Bayes is a common classifier used in many Machine Learning problems. It is based on Bayes theorem, which helps us to define the probability of an event based on some prior knowledge of certain conditions associated with that event. Naïve Bayes classifiers work based on the assumption that the features are independent of each other and this is the reason why they are called “Naïve” classifiers [6]. Bayes' theorem is stated mathematically as the following equation
43 P(A|B) = P(B|A)P(A)
𝑃(𝐵) (3.2)
A and B are some events and P(B) is not 0. P(A|B) is a conditional probability of A given that the event B is true. Similarly, P(B|A) is a conditional probability of B given A is true.
P(A) and P(B) are the independent probabilities of A and B without observing each other.
From a machine learning perspective, given a problem instance to be classified, represented by a vector X = {𝜒1,…,𝜒𝑛 } representing n independent features, Naïve Bayes assigns to this instance probabilities for each of K possible outcomes or classes 𝐶𝑘. Based on equation (3.2), we can now write equation 3.3 as:
P(𝐶𝑘|(𝜒1,…,𝜒𝑛)) (3.3)
P(𝐶𝑘|X) =P(X|𝐶𝑘)P(𝐶𝑘)
𝑃(𝑋) (3.4)
After transforming equation 3.4 based on the assumption that each feature 𝜒𝑖 is conditionally independent of every other feature 𝜒𝑗, we get equation 3.5.
P(𝐶𝑘|(𝜒1,…,𝜒𝑛)) ∝ P(𝐶𝑘, 𝜒1,…,𝜒𝑛)
= P(𝐶𝑘) P(𝜒1|𝐶𝑘) P(𝜒2|𝐶𝑘) P(𝜒3|𝐶𝑘) ...
= P(𝐶𝑘)∏𝑛𝑖=1P(𝜒𝑖|𝐶𝑘)
(3.5)
where ∝ defines proportionality. Some of the attributes in our Dataset have continuous values. We use the Gaussian Naïve Bayes classifier to account for continuous values.
Gaussian Naïve Bayes assumes that the values are distributed according to a Gaussian distribution and uses the mean and variance of each attribute in each class and is shown in equation 3.6.
P(x = v|𝐶𝑘) = 1
√2𝜋𝜎𝑘2𝑒−
(𝑣−𝜇𝑘)2
2𝜎𝑘2 (3.6)
44
Scikit-learn’s Gaussian NB library is used to implement Gaussian Naïve Bayes. The sorted list of features (from most important to least important) returned by the feature selection techniques is used as input for the Naïve Bayes classifier.
3.3.3.4 Logistic Regression
The logistic regression as a classifier is used to classify binary datasets with one more attribute to predict the target. Logistic regression analyzes the relationships between the attributes variables E F, EH, EI … EK and response variable L [93]. The response variable in case of classifying normal and anomaly traffic is L=0 and L=1 respectively. The response variable uses Bernmoulli distribution and the probability function is in equation 3.7.
𝑦 = [
𝑦1 𝑦2
⋮ 𝑦𝑖
⋮ 𝑦𝑛]
, x=
[
1 𝑥11 1 𝑥21
⋮ ⋮
𝑥12 ⋯ 𝑥1𝑑 𝑥21 ⋯ 𝑥2𝑑
⋮ ⋯ ⋮
1 𝑥𝑖1
⋮ ⋮ 1 𝑥𝑛1
𝑥𝑖2 ⋯ 𝑥𝑖𝑑
⋮ ⋯ ⋮
𝑥𝑛2 ⋯ 𝑥𝑛𝑑]
= [
𝑥1𝑇 𝑥2𝑇
⋮ 𝑥𝑖𝑇
⋮ 𝑥𝑛𝑇]
, β=
[ 𝛽0 𝛽1 𝛽2
⋮ 𝛽𝑑]
f(𝑦𝑖)=[𝑃𝑖]𝑦𝑖[1 − 𝑃𝑖]1−𝑦𝑖 (3.7)
The generic logistic model is in equation 3.8.
𝑃𝑖 = exp(𝛽0+ 𝛽1𝑥𝑖1+ ⋯ + 𝛽𝑑𝑥𝑖𝑑)
1 + exp(𝛽0 + 𝛽1𝑥𝑖1+ ⋯ + 𝛽𝑑𝑥𝑖𝑑)= 𝑒𝑥𝑖𝑇𝛽
1 + 𝑒𝑥𝑖𝑇𝛽 = 1 1 + 𝑒−𝑥𝑖𝑇𝛽 P(y=j|X=x)=𝐾1∑𝑖∈𝐴𝐼(𝑦𝑖 = 𝑗)
(3.8)
The Maximum likelihood method is mostly used in logistic regression as parameter estimation method the function is following:
L(β) = ∏ f(𝑦𝑖) = ∏[𝑃𝑖]𝑦𝑖[1 − 𝑃𝑖]1−𝑦𝑖
𝑛
𝑖=1 𝑛
𝑖=1
(3.9)
45 3.3.3.5 Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning. In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. The sample is randomly shuffled and selected for performing the iteration.