3 Proposed Framework - Advances in Computer Science and Information Technology

The proposed framework is shown in figure1. This frame is having four layers and the work done towards finding the outliers in each layer is explained below.

Layer 1: Read the training tuples and are described them by n attributes. Each tuple represents a point in n dimensional space. Pre-process the dataset like taking samples and select features subset.

Layer 2: For each selected feature calculate the z-score and set a threshold value to separate the objects into normal and outliers. Here we do this for each attribute.

Outliers Detection as Network Intrusion Detection System 103 We take the majority vote to decide whether the object is outlier or normal. For ex- ample there are seven attributes and calculated z-scores and set the threshold values.

For a record or object four results telling that it is outlier and three results showing it is a normal object, then it is considered as an outlier or intruder because majority votes. At this layer all objects are separated into normal and outliers.

Layer 3: Use the Bayesian network classifier to classify outliers into different classes.

In a data set there may be two classes or multiple classes. If there are multiple classes, these are classified using Bayesian Network classifier. In our paper we used kddcup data set which is having five classes. There are four types of attacks, which are described in the dataset description.

Layer 4: Here we read test tuples one after one. We calculate the distance between the test tuple and the previously separated class objects. We determine its type based on its closer ness to the above objects. We used the K-nearest neighbor distance based ap- proach to determine the given object type. All this work is explained below in detail.

3.1 Sampling

Sampling can be used as data reduction technique [8] because it allows a large data set to be represented by a much smaller random sample or subset of data. Suppose that a large dataset, D, contains N tuples. We used simple random sample without replacement (SRSWOR).

Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. Attributes subset selection reduces the data set size by removing irrelevant or redundant attributes (dimensions). The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Mining on a reduced set of attributes has an addi- tional benefit. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand. We used Entropy method for features selection. The KDDCUP’99 has 41 features. Using the entropy method 11 features have been selected out of 41 features (attributes/ dimensions) and the selected attributes were listed below.

3.2 Features Selection Using Entropy

This is one of greedy feature selection methods, and conventional information gain [9] which is commonly used in feature selection for classification models. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy. It is defined as

The kddcup99 dataset has 41 features. Using the above formulae on the kddcup99 dataset, we selected only top 15 features based on the rank, and the features along their rank were shown in table2, based on entropy of the features which is shown in table1.

∑

−

=

t j p t

j p t

Entropy ( ) ( | ) log

₂

( | )

⁽¹⁾

104 D. Nagaraju et al.

Table 1. The features with their rank using Entropy method

Rank Feature Name EntropyRank Feature Name Entropy

1 src_bytes 1.73443 21 dst_host_srv_rerror_rate 0.29025

2 Service 1.60489 22 rerror_rate 0.27679

3 Count 1.53779 23 srv_diff_host_rate 0.26933

4 srv_count 1.17949 24 srv_rerror_rate 0.1916

5 dst_host_same_src_port_rate 1.10496 25 wrong_fragment 0.13855

6 protocol_type 0.98378 26 Hot 0.13141

7 dst_host_diff_srv_rate 0.96876 27 num_compromised 0.08992

8 dst_host_srv_count 0.96519 28 Duration 0.07216

9 dst_host_same_srv_rate 0.89407 29 num_failed_logins 0.02657

10 diff_srv_rate 0.86691 30 Land 0.01799

11 dst_bytes 0.85391 31 root_shell 0.01208

12 same_srv_rate 0.83267 32 is_guest_login 0.00911

13 Flag 0.82963 33 num_file_creations 0.00858

14 dst_host_serror_rate 0.64387 34 num_access_files 0.00734

15 logged_in 0.63707 35 num_root 0.00637

16 serror_rate 0.62279 36 num_outbound_cmds 0

17 dst_host_srv_serror_rate 0.51385 37 is_host_login 0

18 dst_host_count 0.5064 38 Urgent 0

19 srv_serror_rate 0.46685 39 num_shells 0

20 dst_host_srv_diff_host_rate 0.45478 40 su_attempted 0

21 dst_host_rerror_rate 0.35237 41

Table 2. Selected Features subset using Entropy method S.No Selected Feature S.No Selected Feature

1 src_bytes 9 dst_host_same_srv_rate

2 Service 10 diff_srv_rate

3 Count 11 dst_bytes

4 srv_count 12 same_srv_rate

5 dst_host_same_src_port_rate 13 Flag

6 protocol_type 14 dst_host_serror_rate

7 dst_host_diff_srv_rate 15 logged_in 8 dst_host_srv_count

Outliers Detection as Network Intrusion Detection System 105 3.3 Z-Score Method

There are many methods for data objects classification. This method is simple and easy to implement for finding outliers. In z-score method, first we calculate the mean and standard deviation values for each attribute, Ai. Using these values zscores can be calculated for each object [10]. It is define as

A

Dalam dokumen Advances in Computer Science and Information Technology (Halaman 130-133)