Intrusion Detection System using Random Forest

(1)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/332396674

Preprint · April 2019

CITATIONS

0

READS

5,043

1 author:

Saurabh Kumar

Vellore Institute of Technology University 3PUBLICATIONS 0CITATIONS

SEE PROFILE

All content following this page was uploaded by Saurabh Kumar on 19 April 2019.

The user has requested enhancement of the downloaded file.

(2)

Intrusion Detection System using Random Forest

¹

Saurabh Kumar,

²

Joy Pal,

³

Abhijeet Giri,

⁴

Aditya Raj,

⁵

Ritu Raj,

⁶

Mangayarkarasi R

School of Information Technology and Engineering Vellore Institute of Technology, Vellore

1

[email protected], ²[email protected], ³[email protected], ⁴[email protected], ⁵[email protected],

6[email protected]

ABSTRACT:

Now a days, it is very hard to prevent security breaches using current technologies. Thus, the result is that Intrusion Detection becomes an important problem in security of network and computer forensics. To ensure that communication of information is safe, many intrusion detection systems(IDSs) are developed which have several limitations in detecting intrusions. The rules of encoding intrusion is very time-consuming and depends upon the knowledge of known intrusions. Probing, Denial of Service (DoS), Remote to User (R2L) and User to Root(U2R) attacks are some of the attacks which affects large number of computers in the world daily. Detection of these attacks and prevention of computers from it is a major research topic for researchers throughout the world. In this paper we will implement and will highlight Intrusion Detection on the attacks mentioned above. Threats to these types are different and are potentially devastating. Till now, various methodologies have been proposed by the researchers for the

“Intrusion Detection System”, some of which are machine learning, DNA sequence, pattern matching, data minning as a technique for learning the attack and its different types and even encountered similar types of attacks when encountered in the future. Here, we are using one of machine learning algorithm “Random Forest” along with one data minig classification method called Decision Tree. Here, we have classified the problem based on features that have been selected as a parameter for evaluation. We have also used model evaluation and selection methods like accuracy, precision, recall, f-score along with Confusion Matrix which is a tool used for evaluating performance. And as a result we have shown the accuracy of our method for each type of attack.

Keywords: Intrusion Detection, Probing, DoS, R2L, U2R, DNA Sequence, Pattern Matching, Data Mining, Machine Learning, Random Forest, Decision Tree

1. INTRODUCTION

With the tremendous growth of network-based services and sensitive information on networks, network security is getting more important than ever. Although a wide range of security technologies such as information encryption, access control, and intrusion prevention are used to protect network-based systems, there are still many undetected intrusions. Given that, intrusion detection systems (IDS) for automatic monitoring of network activities for detecting network attacks play a vital role with respect to network security .An intrusion detection system (IDS) is a system that monitors network traffic for suspicious activity and issues alerts when such activity is discovered. While anomaly detection and reporting is the primary function, some intrusion detection systems are capable of taking actions when malicious activity or anomalous traffic is detected, including blocking traffic sent

from suspicious IP addresses. Although intrusion detection systems monitor networks for potentially malicious activity,

they are also prone to false alarms (false positives).

Consequently, organizations need to fine-tune their IDS products when they first install them. That means properly

configuring their intrusion detection systems to recognize

what normal traffic on their network looks like compared to potentially malicious activity.An intrusion prevention system (IPS) also monitors network packets for potentially damaging network traffic. But where an intrusion detection system responds to potentially malicious traffic by logging the traffic and issuing warning notifications, intrusion prevention systems respond to such traffic by rejecting the potentially malicious packets. Currently, many IDSs are rule-based systems where the performances highly rely on the rules identified by security experts. Since the amount of network traffic is huge, the process of encoding rules is expensive and slow. Moreover, security people have to modify the rules or deploy new rules manually using a specific rule-driven language. To overcome the limitations of rule-based systems, a number of IDSs employ data mining techniques. Data mining is the analysis of large data sets to discover understandable patterns or models. Data mining can efficiently extract patterns of intrusions for misuse detection, identify profiles of normal network activities for anomaly detection, and build classifiers to detect attacks. Data-mining- based systems are more flexible and deployable.

Random Forest:

This paper proposes new systematic frameworks that apply a data mining algorithm called random forests in misuse, anomaly, and hybrid detection. The random forests algorithm is an ensemble classification and regression approach, which is one of the most effective data mining techniques. The random forests algorithm has been used extensively in different applications. For instance, it has been applied to prediction, and probability estimation. However, the algorithm has not been applied in automatic intrusion detection. In our proposed system, the misuse component uses the random forests algorithm for the classification in intrusion detection, while the anomaly component is based on the outlier detection mechanism of the algorithm. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Figure 1: Showing Random Forest in simplified way

Figure 2: Workflow diagram of Random Forest

(3)

Advantage:

 It can be used for both classification and regression problems.

 Random forests are extremely flexible and have very high accuracy.

 Random forests have less variance than a single decision tree. It means that it works correctly for a large range of data items than single decision trees.

 They also do not require preparation of the input data.

You do not have to scale the data.

 It also maintains accuracy even when a large proportion of the data are missing.

Disadvantage:

 The main disadvantage of Random forests is their complexity. They are much harder and time-consuming to construct than decision trees.

 They also require more computational resources and are also less intuitive. When you have a large collection of decision trees it is hard to have an intuitive grasp of the relationship existing in the input data.

 In addition, the prediction process using random forests is time-consuming than other algorithms.

Decision Tree:

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.

The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data). The understanding level of Decision Trees algorithm is so easy compared with other classification algorithms. The decision tree algorithm tries to solve the problem, by using tree representation. Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label. Decision Tree works on the Sum of Product(SOP) form which is also knnown as Disjunctive Normal Form. In Decision Tree the major challenge is to identification of the attribute for the root node in each level.

This process is known as attribute selection. We have two popular attribute selection measures: Information Gain, Gini Index

Figure 3: Workflow diagram of Decision Tree

Decision Tree Algorithm Pseudocode:

1. Place the best attribute of the dataset at the root of the tree.

2. Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.

3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.

2. Dataset Description

This dataset is helpful for to check and to work with the system classifier. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks. The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.

Attacks fall into four main categories:

 DOS: denial-of-service, e.g. syn flood;

 R2L: unauthorized access from a remote machine, e.g.

guessing password;

 U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks;

 probing: surveillance and other probing, e.g., port scanning.

2.1 Denail of Service Attack (DOS):

A denial-of- service (DoS) is any type of attack where the attackers (hackers) attempt to prevent legitimate users from accessing the service. DoS attack is an attack meant to shut down a machine or network, making it inaccessible to its intended users. DoS attacks accomplish this by flooding the target with traffic, or sending it information that triggers a crash.

In both instances, the DoS attack deprives legitimate users of the service or resource they expected. There are two general methods of DoS attacks: flooding services or crashing services. Flood attacks occur when the system receives too much traffic for the server to buffer, causing them to slow down and eventually stop. Popular flood attacks include: Buffer overflow attacks, ICMP flood, SYN flood. DoS attacks can cause the following problems:Ineffective services, Inaccessible services, Interruption of network traffic, Connection interference.

2.2 Probing Attack:

Probing is an attack in which the hacker scans a machine or a networking device in order to determine weaknesses or vulnerabilities that may later be exploited so as to compromise the system. This technique is commonly used in data mining e.g. saint, portsweep, mscan, nmap etc. Relative to computer security in a network, a probe is an attempt to gain access to a computer and its files through a known or probable weak point in the computer system.

2.3 User to Root Attack (U2R):

These attacks are exploitations in which the hacker starts off on the system with

(4)

a normal user account and attempts to abuse vulnerabilities in the system in order to gain super user privileges e.g. perl, xterm. attack is a type of attack, in which an attacker who has an account in a computer system is an expert in misusing or using his or her privileges by creating weakness in computer mechanisms, by creating a bug in the operating system or software’s that are installed on the system.

2.4 Remote to Local Attack (R2L):

: Remote to Local attack is a type of attack in which an intruder gains access, either as a user of the system or as the root, from the remote system through network. From the majority R2L attacks, the attacker shatters into the computer system via the Internet.

TABLE 1: Details of Attacks

Type of Attack Attack Pattern

Denial of

Service

back, land, Neptune, pod, smurf, teardrop

Probe ipsweep, nmap,

portsweep, satan Root to Local ftp_write,

guess_passwd, imap, multihop, phf, spy, warezclient,

warezmaster User to Root buffer_overflow,

loadmodule,perl, rootkit

3. RELATED WORKS

Before we compare and contrast the most related data-mining- based IDSs with the proposed intrusion detection frameworks, we discuss some representative misuse, anomaly, and hybrid IDSs.

In paper[1], author outline three data-mining-based frameworks for network intrusion detection. Author apply the random forests algorithm in misuse, anomaly, and hybrid detection. To address the problems of rule-based systems, author employed the random forests algorithm to build patterns of intrusions. By learning over training data, the random forests algorithm can build the patterns automatically instead of coding rules manually. The proposed approaches are implemented in Java program using the WEKA environment and the Fortran 77 program. Author evaluated the implementations over different datasets obtained from the KDD’99 datasets, and the experimental results show that the performances of our approaches are better than the best KDD’99 results. In author’s misuse detection framework, patterns of intrusions are built in the offline phase, and the system can automatically detect intrusions in real time using the built patterns. To improve the accuracy of the system, author uses the feature selection algorithm and optimize the parameters of the random forests algorithm. Author also uses sampling techniques to increase the detection rate of minority intrusions in the framework. Since the misuse detec-tion cannot detect novel intrusions, author proposed a new approach in unsupervised anomaly detection.

In paper[2], Artificial Intelligence methods are gaining the most attention at present regarding its ability to learn and evolve, which makes them more precise and efficient in facing the huge number of unpredictable attacks. Hence, methodology based on Genetic Algorithm for detection of Probing, Denial of Service and Remote to user attacks is proposed. The proposed approach aims at gaining maximum detections of the Probing, R2L and DoS attacks with minimum false positive rate. Out of the total intrusions in testing dataset, detection of more than 97% of the intrusions is expected by this approach. This approach will be very useful for the attack detection in today‟s changing attack methodologies. If the

rules are updated dynamically with the firewall‟s log data, then this method will be very effective against new attacks.

In paper[3], author deals the Random Forest (RF) algorithm to detect four types of attack like DOS, probe, U2R and R2L.Here, author adopted 10 cross validation applied for classification. Feature selection is applied on the data set to reduce dimensionality and to remove redundant and irrelevant features.Author applied symmetrical uncertainty of attributes which overcomes the problems of information gain. The proposed approach is evaluated using NSL KDD data set.

Author also compared random forest modelling with j48 classier in terms of accuracy, DR, FAR and MCC. Our experimental result prove that accuracy, DR and MCC for four types of attacks are increased by our proposed method. For future work, we will apply evolutionary computation as a feature selection measure to further improve accuracyof the classifier.

In paper[4], author has presented a survey of the various data mining techniques that have been proposed towards the enhancement of anomaly intrusion detection systems. And, author also applied the classification methods for classifying the attacks (intrusions) on DARPA dataset. The results showing the performance of the Random Forest is better than other classifiers. But the time taken is more for Random Forest than other classifiers.

In paper[5], author have presented a Random Forest model for Intrusion Detection Systems (IDS) with a focus on improving the intrusion detection performance by reducing the input features. From the context of real-world applications, smaller numbers of features are always advantageous in terms of both data management and processing time. The obtained results indicate that the ability of the RF classification with reduced features (25 features) produces more accurate result than that of found from Random Forest classification with all features (41 features). Moreover, time required to process 25 features with RF is smaller than the processing time of RF with 41 features. The research in intrusion detection and feature selection using RF approach is still an ongoing area due to its good performance. The findings of this paper will be very useful for research on feature selection and classification.

These findings can also be applied to use RF in more meaningful way in order to maximize the performance rate and minimize the false positive rate.

In paper[6], author explains Random Forest Algorithm, the number of trees in the forest and the results from them are directly related, i.e. the more trees, the more accurate the result.

It is important to note that, decision-making using the gain or gain approach is not the same as creating a random forest. This paper presented an overview of the random forest algorithm and a survey of various techniques proposed by several researchers has also been summarized.

4. Proposed Methodology

Considering the problem of Intrusion Detection System(IDS) as a system that detect malicious induces attacks which act as an important part in various types of attack. In this problem, the primary aim is to detect intrusions caused by various attack like DOS, U2R, R2L, probe on the basis of feature selection measures where we first classify various features according to attack types. Then, we select best features for these attack and, after that we apply Random Forest and Decision Tree methodology to detect intrusion. In final step we measure the accuracy, precision, re-call and f-measure to examine the result that we achieved.

Figure 4: Describes workflow of proposed methedology

(5)

4.1 Data Preprocessing: In this step, One-Hot-Encoding method is used to transform categorical features to form a matrix of categorical features along with binary values. After that we distribute categories in services. Now, the features are transformed using LabelEncoder, to transform every category to a number. After that we split dataset into 4 sub categories each of separate attack type. Here, we provide a dummy name to each attack type as 0=normal, 1=DOS, 2=R2L, 3=U2R, 4=probe. After this feature scaling is done by splitting features into X and Y where X as a dataframe of features and Y as a series of output variables.

4.2 Feature Selection: Feature selection step is correlated to data pre-processing step where irrelevant features are removed which increases the accuracy. Feature Selection refers to identification of features that are strongly correlated to problem which are useful in prediction of class. In this project we used Univariate Features Selection method along with ANOVA F- test. When the features are selected, a recursive features elimination process is implied on acquired features to get only those features that are correlated to each other. After selecting relevant features ranking is done to provide priority to features for better accuracy.

Summary of features selected by Univariate Features selection:

Features selected based on ranking using Recursive Feature Elemination:

Recursive Feature Elemenation selects 13 features:

4.3 Building Model: To get a high accurate classifier which deals with real time data, a high performing model selection is required. In this project we have build an efficient model for ramdom forest(RF). Here, classifiers are trained for all features using trained dataset. To perform this model classification we used the predefined function of python called RandomForestClassifier(). Here, classification model is built for all types of attack. The algorithm used is Random forest modeling for Intrusion Detection System.

4.4 Experimental Results: All experiments are done in Python IDE called Anaconda. We used KDD’99 dataset for analysis which consist of 42 attributes where last attribute having class labels. To evaluate the performance of classifier we construct confusion matrix to find accuracy, precision, recall and f- measure. We also used cross-validation for classification.

Confusion Matrix:

A confusion matrix is a summary of prediction results on a

classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class.

This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the

errors being made by a classifier but more importantly the types of errors that are being made.

Definition of the Terms:

• True Positive (TP) : Observation is positive, and is predicted to be positive.

• False Negative (FN) : Observation is positive, but is predicted negative.

• True Negative (TN) : Observation is negative, and is predicted to be negative.

• False Positive (FP) : Observation is negative, but is predicted positive.

(6)

Classification Rate/Accuracy:

The accuracy is the proportion of the total number of

predictions that were correct. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.

It is determined using the equation:

Recall:

Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total

number of positive examples. High Recall indicates the class is correctly recognized (small number of FN).

Precision:

To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive (small number of FP).

Precision is given by the relation:

F-measure:

Since we have two measures (Precision and Recall) it helps to have a measurement that represents both of them. We calculate an F-measure which uses Harmonic Mean in place of

Arithmetic Mean as it punishes the extreme values more.

The F-Measure will always be nearer to the smaller value of Precision or Recall

Experimental results after the analysis of KDD’99 dataset in Python IDE(Anaconda)

Confusion Matrix using all features for each category:- DOS

Probe

R2L

U2R

Cross Validation through Accuracy, Precision, Recall, F- Measure:-

Confusion Matrix using 13 features for each category:-

DOS

Probe

R2L

U2R

Cross Validation through Accuracy, Precision, Recall, F- Measure:-

(7)

Illustration of cross validation score in comparison to number of features:-

5. Conclusion

Our paper deals with Intrusion Detection System(IDS) using Random Forest(RF) algorithm to detect four types of attack that are DOS, Probe, R2L, U2R by reducing the features. Here, we have adopted cross validation to check the correctness of applied algorithm for classification. The main motive of our feature selection process on KDD’99 dataset is to reduce dimensionality, to remove redundancy of data and removal of irrelevant features. ANOVA F-test is applied for Univariate Features Selection. For the context of real world, smaller number of features is advantageous in terms of both processing time and data management. The obtained result shows that Random Forest classification with reduced features is more accurate than that of found from all features. Our experimental result shows that accuracy, precision, recall, f-measure of these four attack types has increased by our proposed methodology. In future, if any other better algorithm for classification evolves that may further increase the accuracy of classifiers then we will apply that in modification of project.

References:

[1] Jiong Zhang, Mohammad Zulkernine, and Anwar Haque,

“Random-Forests-Based Network Intrusion Detection Systems”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 38,NO.5, SEPTEMBER 2008 [2] Swati Paliwal, Ravindra Gupta, “Denial-of-Service,Probing

&Remote to User (R2L) AttackDetection using Genetic Algorithm” , International Journal of Computer Applications (0975 – 8887) Volume 60– No.19, December 2012

[3] Nabila Farnaaz, M. A. Jabbar, “Random Forest Modeling for Network Intrusion Detection System”, Twelfth International Multi-Conference on Information Processing- 2016 (IMCIP-2016)

[4] Phyu Thi Htun, Kyaw Thet Khaing, “Anomaly Intrusion Detection System using Random Forests and k-Nearest Neighbor”, International Journal of P2P Network Trends and Technology (IJPTT) - Volume 3 Issue 1 January to February 2013

[5] Md. Al Mehedi Hasan, Mohammed Nasser, Shamim Ahmad, Khademul Islam Molla, “Feature Selection for Intrusion DetectionUsing Random Forest”, Journal of Information Security, 2016, 7, 129-140

[6] Kritika Singh, Bharti Nagpal, “Random Forest Algorithm in Intrusion Detection System : A Survey”, International Journal of Scientific Research in Computer Science, Engineering and Information Technology © 2018 IJSRCSEIT

| ISSN : 2456-3307

[7] S.A.Joshi, VarshaS.Pimprale, “Network Intrusion Detection System (NIDS) based on Data Mining”,

“International Journal of Engineering Science and Innovative Technology”, Vol. 2, No. 1, January 2013, ISSN. 2319 5967.

[8]ParveenSadotra, Dr. Chandrakant Sharma, “A Review On Integrated Intrusion Detection System In Cyber Security”,

“International Journal of Computer Science and Mobile Computing”, Vol. 5, No. 9, September 2016, pg.23-28

[9] C. Science and K. Mangalore, “A Two-tier Network based Intrusion Detection System Architecture using Machine Learning Approach,” “International Journal on Recent and Innovation Trends in Computing and Communication” Vol. 7, No. 6, pp. 42–47, 2016.

[10] A.R. Jakhale, G.A. Patil, “Anomaly Detection System by Mining Frequent Pattern using Data Mining Algorithm from Network Flow”, “International Journal of Engineering Research and Technology”, Vol. 3, No.1, January 2014, ISSN.

2278-0181.

[11] A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection”, “IEEE Communications Surveys and Tutorials”, Vol. 18, No. 2, pp. 1153–1176, 2016.

[12] K.-C. Khor, C.-Y. Ting, and S. Phon-Amnuaisuk, “A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection,”

“Applied Intelligence”, Vol. 36, No. 2, pp. 320–329, 2012 [13] A. A. Aburomman and M. B. IbneReaz, “A novel SVM- kNNPSO ensemble method for intrusion detection system”,

“Applied Soft Computing Journa”l, Vol. 38, pp. 360–372, 2016.

[14] Q. S. Qassim, A. M. Zin, and M. J. Ab Aziz, “Anomalies classification approach for network—based intrusion detection system”, “International Journal of Network Security”, pp. 1159–1171, 2016.

[15] Yogita B. Bhavasar, Kalyani C. Waghmare “Intrusion Detection System Using Data Mining Technique: Support Vector Machine”, “International Journal of Emerging

(8)

Technology and Advance Engineering”,Vol. 3, No. 3, March 2013.

View publication stats