Classification - Feature Engineering and MLA Selection

CHAPTER 3 METHODOLOGY

3.3 Feature Engineering and MLA Selection

3.3.3 Classification

A classification algorithm needs to be used which will maximize the accuracy of DDoS attack detection in cloud. For this purpose, a number of classification algorithms is required to be tested with for this dataset. In this work, we investigated and tested five different classifiers based on the dataset collected and described in the previous section for predicting a DDoS attack in the network. The models are Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), Logistic Regression (LR), and Stochastic Gradient Descent (SGD). Their description and working procedure will be discussed in this section.

This selection is based on their performance in DDoS attack detection in different research by other authors. Their performance will be presented in chapter 4 (result section). At the end, the best classifier for this dataset will be selected for the proposed method.

3.3.3.1 Decision Tree

To add a diversity of models to test our sorted lists returned by the feature selection methods in the previous step, we decided to add a Decision Tree classifier which is one of the widely used algorithms in classification and regression problems. A decision tree creates a tree like structure where each node of the tree represents a feature, each link represents a decision (rule) and each leaf represents a possible outcome. The root node contains all the observations and each of the internal nodes holds a feature test [93]. The decision is made in the top-down recursive method, and the leaf node category is returned

as result as shown in Figure 3.10. Compared to other classifiers, Decision Tree is the easiest to understand and interpret. The basic steps while implementing a Decision Tree are [97]:

a. Identify the best attribute and position it at the root of the tree.

b. Split the training dataset into subsets in a way that each subset only contains data with the same value from an attribute.

c. Repeat step 1 and 2 until we find the leaf node for each branch of the tree.

Now, to predict the value of a class variable using a Decision Tree, we start from the root of the tree and compare the values stored in the root attribute with the values of the instance’s attribute. Based on the results of this comparison, we choose a branch to follow and move on to the next node in the tree. This process is repeated until the leaf node is reached with the predicted value of the class variable.

Figure 3.10: Decision Tree structure [93].

3.3.3.2 Random Forest

A Radom Forest is basically an ensemble of many decision trees working together and trained using the bagging method. The bagging method is based on an underlying theory that the combination of multiple models increases the effectiveness of the resulting model [98]. Radom Forest can be used for both classification and regression problems but in our study, we have used it for classification since we are dealing with a binary class variable.

Figure 3.11 shows a decision tree with two trees. The results from the two trees are combined by the ensemble approach employed by the Random Forest to produce the result.

Root (Sample Data)

Internal Node

Output Value Output Value

Internal Node

Output Value Output Value

A Random Forest has almost the same hyper parameters as a Decision Tree in addition to the hyper parameters of a bagging classifier which controls the ensemble of trees. Random forest searches for the most significant attribute among a random subset of attributes rather than looking for the best attribute while splitting a node. This process creates a wide diversity, which generally results in a better model [94].

Figure 3.11: Decision Tree with two trees [94].

3.3.3.3 Naïve Bayes

Naïve Bayes is a common classifier used in many Machine Learning problems. It is based on Bayes theorem, which helps us to define the probability of an event based on some prior knowledge of certain conditions associated with that event. Naïve Bayes classifiers work based on the assumption that the features are independent of each other and this is the reason why they are called “Naïve” classifiers [6]. Bayes' theorem is stated mathematically as the following equation

43 P(A|B) = P(B|A)⁡P(A)

𝑃(𝐵) (3.2)

A and B are some events and P(B) is not 0. P(A|B) is a conditional probability of A given that the event B is true. Similarly, P(B|A) is a conditional probability of B given A is true.

P(A) and P(B) are the independent probabilities of A and B without observing each other.

From a machine learning perspective, given a problem instance to be classified, represented by a vector X = {𝜒₁,…,⁡𝜒_𝑛 } representing n independent features, Naïve Bayes assigns to this instance probabilities for each of K possible outcomes or classes 𝐶𝑘. Based on equation (3.2), we can now write equation 3.3 as:

P(𝐶_𝑘|(𝜒₁,…,⁡𝜒_𝑛)) (3.3)

P(𝐶_𝑘|X) =P(X|𝐶_𝑘)⁡P(𝐶_𝑘)

𝑃(𝑋) (3.4)

After transforming equation 3.4 based on the assumption that each feature 𝜒_𝑖 is conditionally independent of every other feature 𝜒_𝑗, we get equation 3.5.

P(𝐶_𝑘|(𝜒₁,…,⁡𝜒_𝑛)) ∝ P(𝐶_𝑘, 𝜒₁,…,⁡𝜒_𝑛)

= P(𝐶𝑘) P(𝜒1|𝐶𝑘) P(𝜒2|𝐶𝑘) P(𝜒3|𝐶𝑘) ...

= P(𝐶𝑘)∏^𝑛_𝑖=1P(𝜒_𝑖|𝐶_𝑘)

(3.5)

where ∝ defines proportionality. Some of the attributes in our Dataset have continuous values. We use the Gaussian Naïve Bayes classifier to account for continuous values.

Gaussian Naïve Bayes assumes that the values are distributed according to a Gaussian distribution and uses the mean and variance of each attribute in each class and is shown in equation 3.6.

P(x = v|𝐶_𝑘) = 1

√2𝜋𝜎_𝑘²𝑒⁻

(𝑣−𝜇𝑘)²

2𝜎_𝑘² (3.6)

Scikit-learn’s Gaussian NB library is used to implement Gaussian Naïve Bayes. The sorted list of features (from most important to least important) returned by the feature selection techniques is used as input for the Naïve Bayes classifier.

3.3.3.4 Logistic Regression

The logistic regression as a classifier is used to classify binary datasets with one more attribute to predict the target. Logistic regression analyzes the relationships between the attributes variables E F, EH, EI … EK and response variable L [93]. The response variable in case of classifying normal and anomaly traffic is L=0 and L=1 respectively. The response variable uses Bernmoulli distribution and the probability function is in equation 3.7.

𝑦 = [

𝑦₁ 𝑦₂

⋮ 𝑦_𝑖

⋮ 𝑦_𝑛]

, x=

[

1 𝑥₁₁ 1 𝑥₂₁

⋮ ⋮

𝑥₁₂ ⋯ 𝑥_1𝑑 𝑥₂₁ ⋯ 𝑥_2𝑑

⋮ ⋯ ⋮

1 𝑥_𝑖1

⋮ ⋮ 1 𝑥_𝑛1

𝑥_𝑖2 ⋯ 𝑥_𝑖𝑑

⋮ ⋯ ⋮

𝑥_𝑛2 ⋯ 𝑥_𝑛𝑑]

= [

𝑥₁^𝑇 𝑥₂^𝑇

⋮ 𝑥_𝑖^𝑇

⋮ 𝑥_𝑛^𝑇]

, β=

[ 𝛽₀ 𝛽₁ 𝛽₂

⋮ 𝛽_𝑑]

f(𝑦𝑖)=[⁡𝑃𝑖]^𝑦^𝑖[⁡1 − 𝑃_𝑖]^1−𝑦^𝑖 (3.7)

The generic logistic model is in equation 3.8.

𝑃_𝑖 = exp⁡(𝛽₀+ 𝛽₁𝑥_𝑖1+ ⋯ + 𝛽_𝑑𝑥_𝑖𝑑)

1 + exp⁡(𝛽₀ + 𝛽₁𝑥_𝑖1+ ⋯ + 𝛽_𝑑𝑥_𝑖𝑑)= 𝑒^𝑥^𝑖^𝑇^𝛽

1 + 𝑒^𝑥^𝑖^𝑇^𝛽 = 1 1 + 𝑒^−𝑥^𝑖^𝑇^𝛽 P(y=j|X=x)=_𝐾¹∑_𝑖∈𝐴𝐼(𝑦^𝑖 = 𝑗)

(3.8)

The Maximum likelihood method is mostly used in logistic regression as parameter estimation method the function is following:

L(β) = ∏ f(𝑦_𝑖) = ∏[⁡𝑃_𝑖]^𝑦^𝑖[⁡1 − 𝑃_𝑖]^1−𝑦^𝑖

𝑛

𝑖=1 𝑛

𝑖=1

(3.9)

45 3.3.3.5 Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning. In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. The sample is randomly shuffled and selected for performing the iteration.

Dalam dokumen distributed denial of service attack detection in (Halaman 54-59)