View of CANCER CLASSIFICATION USING HYBRID PARTICLE SWARM OPTIMIZATION GENETIC ALGORITHM WITH IMPROVED TRANSDUCTIVE SUPPORT VECTOR MACHINE FOR GENE EXPRESSION DATA

(1)

CANCER CLASSIFICATION USING HYBRID PARTICLE SWARM OPTIMIZATION GENETIC ALGORITHM WITH IMPROVED

TRANSDUCTIVE SUPPORT VECTOR MACHINE FOR GENE EXPRESSION DATA

MS. N. KANCHANA,

MCA, M.Phil.(Ph.D), Assistant Professor and Part Time Ph.D. Research Scholar, Department of Computer Science, Dr.G.R.Damodaran College of Science, Coimbatore-

641014. Tamil Nadu, India. Email: kanchnat2005@yahoo.com DR.N.MUTHUMANI,

M.Sc.(CC), M.Phil, Ph.D,Associate Professor,

Department of Computer Applications,Sri Ramakrishna College of Arts and Science, Coimbatore-641006. Tamil Nadu, India.Email: muthummani_77@yahoo.com

Abstract:-Micro array data analysis has been effectively applied in a number of investigations over a wide range of biological disciplines. It comprises of cancer classification by class detection and prediction, recognition of the unknown effects of a specific therapy, recognition of genes suitable to a certain diagnosis or therapy, and cancer diagnosis. In the previous work, Enhanced Particle Swarm Optimization FireFly Algorithm–

Modified Artificial Neural Network (HPSOGA-MANN) approach is introduced for efficient cancer classification. But it has problem with lower accuracy of classification results. Hence the overall classification performance is degraded for the given datasets. To overcome these issues, in the proposed system, Hybrid Particle Swarm Optimization Genetic Algorithm–

Modified Artificial Neural Network (HPSOGA-ITSVM) is proposed. In this research, it has three main modules are such as preprocessing, gene selection and classification. The preprocessing step is performed by using improved kNN (IKNN) which is used to handle the missing values and redundancy values from the given gene dataset. The reduction of preprocessed dataset is given to the Feature Selection (FS) step. In the FS step, the important and relevant features are selected HPSOGA optimally. It is used to select most informative genes from the given gene array dataset. These gene features are classified with the hybrid classification method which utilizes ITSVM classifier. The ITSVM classification method is used to provide more accurate classification result by significant training features. The result proves that the proposed HPSOGA-ITSVM method has better performance in terms of higher classification accuracy, precision, recall, f-measure and lower time complexity, error rate.

Key words: Gene selection, microarray data, MPSOGA, ITSVM, cancer classification.

1. INTRODUCTION

Gene expression technology using DNA microarrays, allows for the monitoring of the expression levels of thousands of genes at once. As a direct result of recent advances in DNA microarray technology, it is now feasible to obtain gene expression profiles of tissue samples at relatively low costs.

Gene expression profiles provide important insights into, and further understanding of, biological processes.

Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. An important emerging medical application domain for microarray gene expression profiling technology is clinical decision support in the form of diagnosis of disease as well as prediction of clinical outcomes in response to treatment. Currently, cancer

diagnosis highly depends on a variety of histological observations, including immunohistochemical assays, which detect cancer biomarker molecules [1].

Most of gene expression data sets have a fairly small sample size compared to the number of genes investigated. This data structure creates an unprecedented challenge to some classification methodologies. Only several methods had been successfully applied to the cancer diagnosis problem in the previous studies, including support vector machine (SVM), k-nearest neighbors (KNN), back propagation neural networks (NN), and probabilistic neural networks (PNN).

Applying data mining techniques has been proved to be an effective approach towards knowledge discovery based on probe-level data sets consisting of millions of variables and often only

(2)

several or dozens of observations (i.e., the samples). However, since practitioners and researchers engaged in this domain stem from various background, one cannot be expected to be extremely familiar with the techniques and algorithms in data mining. This phenomenon is fair and the trend that more people become interested in this interdisciplinary area is even encouraging, because everyone perceives and processes information from his or her own academic or practice backgrounds, which actually foster the development of genetic engineering [2].

The microarray technique in concurrent measurement of the expression level in thousands of messenger RNA (mRNA) has been enabled. This has become feasible by mining the data; it is possible to recognize the dynamics of a gene expression time series in this manner. Researchers decreased the dimensionality of the data set by employing the Principal Component Analysis (PCA). An examination of the components has provided an approach to the underlying factors calculated in the experiments. The PCA has demonstrated that it is proved from their consequences, that all rhythmic content of data can be decreased to three main components [3].

Preprocessing step to tackle microarray data has rapidly become indispensable among researchers, not only to remove redundant and irrelevant features, but also to help biologists identify the underlying mechanism that relates gene expression to diseases. The set of techniques used prior to the application of a data mining method is named as data preprocessing for data mining and it is known to be one of the most meaningful issues within the famous knowledge discovery from data process [4]. Since data will likely be imperfect, containing inconsistencies and redundancies is not directly applicable for a starting a data mining process. It mentions the fast growing of data generation rates and their size in business, industrial, academic and science applications. The bigger amounts of data collected require more sophisticated mechanisms to analyze it.

Data preprocessing is able to adapt the data to the requirements posed by each data mining algorithm, enabling to

process data that would be unfeasible otherwise.

In [5], genetic algorithm is used for feature selection to reduce modeling complexity and training time of classification algorithms used in text classification task. It used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier hypothesis. Firstly; (i) it have developed a new objective function to maximize; (ii) then we choose candidate features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt) and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public available datasets.

Abeel et al. [6] are concerned with the analysis of the robustness of biomarker selection techniques. They proposed a general experimental setup for stability analysis that can easily be included in any biomarker identification pipeline. In addition, they also presented a set of ensemble feature selection methods improving biomarker stability and classification performance in four microarray datasets.

2. RELATED WORK

In [7] Martinez et al (2001) gives theoretical analysis on how well a two- stage algorithm approximates the exact LDA in the sense of maximizing the LDA objective function. The theoretical analysis motivates us to devise a new two- stage LDA algorithm. Our algorithm outperforms the PCA+LDA while both have the similar scalability. Furthermore, it provides an implementation of this algorithm on distributed system to handle large scale problems.

In [8] Alba et al (2007) compared the use of a Particle Swarm Optimization (PSO) and a Genetic Algorithm (GA) (both augmented with Support Vector Machines SVM) for the classification of high dimensional Microarray Data. Both algorithms are used for finding small samples of informative genes amongst thousands of them. A first contribution is to prove that PSOSVM is able to find interesting genes and to provide classification competitive performance. A second important contribution consists in the actual discovery of new and challenging results on six public datasets

(3)

identifying significant in the development of a variety of cancers (leukemia, breast, colon, ovarian, prostate, and lung).

In [9] Banerjee et al (2007) emphasized an evolutionary rough feature selection algorithm to preprocess the array data to obtain a reduced set of feature. The proposed algorithm uses redundancy reduction for effective handling of gene expression data enabling faster convergence. The reducts are generated using a rough set theory and they are a minimal set of non redundant features capable of distinguishing between all objects in a multi-objective framework. The proposed algorithm was implemented on three different cancer samples. Experiments using KNN classifiers showed that the proposed algorithm improved the performance of the classifier on the test set.

In [10], Subanya et al (2014) designed an effective algorithm that can remove irrelevant dimensions from large data and to predict more accurately the presence of disease. Artificial Bee Colony (ABC) based feature selection is incorporated and a wrapper classifier is used for classification. A Binary ABC (BABC) algorithm is used to find the best features in the disease identification. The fitness of BABC is evaluated using Naive Bayesian method. Results are validated using Cleveland Heart disease dataset taken from the UCI machine learning repository. The results indicate that, BABC–Naive Bayesian outperform the other methods.

In [11] Xu, et al (2013) used a New Artificial Bee Colony (NABC) algorithm, which modifies the search pattern of both employed and onlooker bees. A solution pool is constructed by storing some best solutions of the current swarm. New candidate solutions are generated by searching the neighborhood of solutions randomly chosen from the solution pool.

Experiments are conducted on a set of twelve benchmark functions. Simulation results show that this approach is significantly better or at least comparable to the original ABC and seven other stochastic algorithms.

Chişet al [12] presented clustering which is the most popular method that makes an attempt to separate data into disjoint groups such that same-group data points are similar in its characteristics with respect to a referral

point whereas data points of different- groups differs in its characteristics. Such described groups are called as clusters.

Thus clusters are comprised of several similar data or objects with respect to a referral point. Cluster is one of the most important methods in the disciplines of engineering and science, including data compression, statistical data analysis.

Duval et al [13] presents a memetic algorithm, called MAGS, to deal with gene selection for supervised classification of microarray data. MAGS is based on an embedded approach for attribute selection where a classifier tightly interacts with the selection process. The strength of MAGS relies on the synergy created by combining a problem specific crossover operator and a dedicated local search procedure, both being guided by relevant information from a SVM classifier.

Computational experiments on 8 well- known microarray datasets show that our memetic algorithm is very competitive compared with some recently published studies.

3. PROPOSED METHODOLOGY Hybrid Particle Swarm Optimization Genetic Algorithm–Modified Artificial Neural Network (HPSOGA-ITSVM) is proposed. In this research, it has three main modules are such as preprocessing, gene selection and classification. The preprocessing step is performed by using improved kNN (IKNN) which is used to handle the missing values and redundancy values from the given gene dataset. The reduction of preprocessed dataset is given to the Feature Selection (FS) step. In the FS step, the important and relevant features are selected HPSOGA optimally. It is used to select most informative genes from the given gene array dataset. These gene features are classified with the hybrid classification method which utilizes ITSVM classifier. The ITSVM classification method is used to provide more accurate classification result by significant training features.

3.1. Input data

The datasets are real gene expression data and gene samples generated using micro-arrays technology.

The results of both implementations are compared to the output from the classification algorithm. This gene expression data has been used to build cancer classifiers. A microarray

(4)

experiment is to monitor the expression level of genes. Patterns could be derived from analyzing the change in expression of the genes, and new insights could be gained into the underlying biology. In this section, basic terminologies, representations of the microarray data and the various methods by which expression data can be analyzed will be introduced. Microarrays measurements are carried out as differential hybridizations to minimize errors originating from DNA. The overall block diagram of the proposed system is shown in fig 1.

3.2. Preprocessing using IKNN

In this research, the k-NN algorithm is introduced to perform data preprocessing. The underlying idea of the k-NN algorithm has served of inspiration to tackle data imperfection. It distinguishes data imperfection that needs to be addressed such as noisy data, redundancy and incomplete data.

Preprocessing is an important step in the analysis of microarray data. The distance- based similarity idea of the kNN has been widely applied to detect and remove class noise. IKNN used to eliminate all potential noisy examples and it may change the class label of clearly erroneous examples.

Fig 1 Overall block diagram of the proposed system

The given dataset also contain missing values (MVs) in their attribute values.

Intuitively, a MV is simply a value for an attribute. Human or equipment errors are some of the reasons of their existence.

Once again, this imperfection on the data influences the mining process and its outcome. The simplest way of dealing with MVs is to discard the attributes that contain them. The imputation of MVs is a procedure that aims to fill in the MVs by estimating them. Data reduction is aim to obtain a smaller representative set of attributes from raw data without losing important information. This process allows alleviating data storage necessities as well as improving the later data mining process. This process may result on the elimination of noisy information, but also redundant or irrelevant data.

Algorithm 1: IKNN Step 1: Begin

Input: D={(x1, c1), (x2, c2),…(xn, cn)}

Step 2: For every labeled instance (xi , ci )

Step 3: calculate d(xi , x)

Step 4: Order d(xi , x) from lowest to highest, (i = 1, . . . , N)

Step 5: Compute missing values {

𝑥 = ^𝑛^𝑖=1^𝑤_𝑤^𝑖^𝑥^𝑖

𝑛 𝑖

𝑖=1 (1)

Fill the missing value } Compute redundancy {

𝑟_𝐴,𝐵 = 𝐴−𝐴 (𝐵−𝐵 )

(𝑛−1)𝜎_𝐴𝜎_𝐵 (2) Filter repeated values }

Step 6: Select the K nearest instances to x: 𝐷_𝑥^𝑘

Step 7: Assign to x the most frequent class in 𝐷_𝑥^𝑘

Step 8: End

This algorithm is used to provide more accurate gene dataset which is used to increase the gene classification accuracy higher rather than previous system.

3.3. Feature selection using HPSOGA In this research, gene selection is performed by using HPSOGA. It is focused to improve the relevant gene data and reduce the irrelevant genes from the given gene dataset.

The PSO is a computational approach that optimizes a problem in Gene expression dataset

Preprocessing using IKNN

Gene selection using HPSOGA Calculate fitness value using best chromosome in

Generate optimal genes PSO

Classification using ITSVM Perform training and

testing genes Classify more relevant

genes

Provide accurate classification results

(5)

continuous, multidimensional search spaces. PSO starts with a swarm of random particles. Each particle is associated with a velocity. Particles’

velocities are adjusted in order to the historical behavior of each particle and its neighbors during they fly through the search space. Thus, the particles have a tendency to move towards the better search space. The version of the utilized PSO algorithm is described mathematically by the following equations:

Each particle updates its own position and velocity according to formula (3) and (4) in every iteration.

𝑣_𝑖𝑑^𝑘+1= 𝜔𝑣_𝑖𝑑^𝑘 + 𝑐1𝛾1 𝑝_𝑖𝑑^𝑘 − 𝑥_𝑖𝑑^𝑘 +

𝑐₂𝛾₂ 𝑝_𝑔𝑑^𝑘 − 𝑥_𝑖𝑑^𝑘 + 𝛼(𝑟𝑎𝑛𝑑 −¹₂) (3)

𝑥_𝑖𝑑^𝑘+1 = 1 𝑠 𝑣^𝑖𝑑^𝑘+1 > 𝑟𝑎𝑛𝑑 (0,1) 0 𝑒𝑙𝑠𝑒

(4)

where the 𝑠 𝑣_𝑖𝑑^𝑘+1 is the sigmoid function S(𝑣_𝑖𝑑) = 1/(1 + exp(−vid )),i = 1, 2, 3 ... m, m is the number of particles in the swarm, 𝑣_𝑖𝑑^𝑘 𝑎𝑛𝑑 𝑥_𝑖𝑑^𝑘stand for the velocity and position of the i^th particle of the k^th iteration, respectively. 𝑝_𝑖𝑑^𝑘denotes the previously best position of particle i, 𝑝_𝑔𝑑^𝑘 denotes the global best position of the swarm. ω is the inertia weight, c1 and c2 are acceleration constants (the general value of c1 and c2 are in the interval [0 2]), γ1 and γ2 are random numbers in the range [0 1].

Each feature subset can be considered as a point in feature space.

The optimal point is the subset with least length and highest classification accuracy. The initial swarm is distributed randomly over the search space, each particle takes one position. The goal of particles is to fly to the best position. By passing the time, their position is changed by communicating with each other, and they search around the local best and global best position. Finally, they should converge on good, possibly optimal, positions since they have exploration ability that equip them to perform FS and discover optimal subsets.

The velocity of each particle is displayed as a positive integer; particle velocities are bounded to a maximum velocity Vmax. It shows how many of features should be changed to be same as the global best point, in other words, the velocity of the particle moving toward the

best position. The number of different features (bits) between two particles related to the difference between their positions.

After updating the velocity, a particle’s position will be updated by the new velocity. Suppose that the new velocity is V. In this case, V bits of the particle are randomly changed, different from that of Pg. The particles then fly toward the global best while still exploring the search area, instead of simply being same as Pg. The Vmax is used as a constraint to control the global exploration ability of particles. A larger Vmax provides global exploration, while a smaller Vmax increases local exploitation.

When Vmax is low, particles have difficulty getting out from locally optimal sections. If Vmax is too high, swarm might fly past good solutions. Objective function is computed as follows

(𝑋_𝑖)=𝜙. γ 𝑆^𝑖 𝑡 + 𝜙(𝑛 − |𝑆^𝑖(𝑡)| (5) Where 𝑆^𝑖 𝑡 is the feature subset found by particle i at iteration t, and

|𝑆^𝑖 𝑡 |is its length. Fitness is computed in order to both the measure of the classifier performance, 𝛾𝑆^𝑖(𝑡), and feature subset length. ϕ and φ are two parameters that control the relative weight of classifier performance and feature subset length, ϕ ∈[0,1] and φ = 1

− ϕ. This formula denotes that the classifier performance and feature subset length have different effect on gene selection.

To enhance the optimal solutions in the gene selection dataset, the PSO is hybrid with GA. The PSO has lower convergence rate when large number of genes increase. And hence time complexity is an issue also the accuracy of the gene classification lower due to misclassification of important genes. To overcome the above mentioned issues, the HPSOGA is proposed in this research.

Genetic algorithm is an evolutionary algorithm that mimics the natural selection, crossover and mutation process [14]. GA is a stochastic optimization method, which is based on metaheuristic search procedures. GA starts with a matrix of population of solution. Each row of this matrix shows the individuals that generated randomly.

Each individual shows a solution of an objective function. In GA, every solution is encoded with a gene that is called individual. Using objective function,

(6)

fitness of individuals is computed according to an objective function.

Population is improved with combination of genetic information from different members of population. This process is called as crossover. Another population improvement method is mutation. Some individuals of population are mutated according to the mutation rate of population.

Fitness function = 𝑓(𝑥)𝑖/𝑛 (6) Algorithm 2: HPSOGA

Step 1: Initializing a population with N individuals

Step 2: Initialize the position and velocity of each particle (gene) in the swarm

Step 3: While Maximum Iterations is not reached do

Step 4: Set algorithm factors:

Objective function of 𝑓(𝑥), from, where 𝑥 = 𝑥₁, . . . , 𝑥_𝑑 ^𝑇

Step 5: Generate primary population of fireflies or 𝑥_𝑖 (𝑖 = 1, 2, . . . , 𝑛)

Step 6: Describe light intensity of 𝐼_𝑖 at 𝑥_𝑖 via 𝑓 (𝑥_𝑖)

Step 7: Initialize genetic population (gp)

Step 8: Compute fitness function using (6)

Step 9: Determine population size, crossover and mutation.

Step 10: Evaluate fitness for each member of the generation

Step 11: With the crossover rate, generate offspring, in which the ranking mechanism is used for selection of chromosomes.

Step 12: With the mutation, generate offspring

Step 13: Generate maximum accuracy and lower time complexity fitness chromosomes

Step 14: If the fitness value is better than its personal best (pBest)

Step 15: Set current value as the new pBest

Step 16: End

Step 17: Choose the particle with the best fitness value of all as gBest

Step 18: For each particle

Step 19: Calculate particle velocity and update particle position according Equation (6)

Step 20: Randomly selecting a 𝑔𝑏𝑒𝑠𝑡 for particle 𝑖 from highest ranked solutions

Step 21: Update the velocity and position of particle based on best chromosomes (3) and (4)

Step 22: Return most informative gene features

Step 23: Update the pbest and gbest

Step 24: Return the positions of genes

Step 25: End

The efficiency of gene classification model, based on HPSOGA algorithms relies on learning acquired through the gene dataset. The appropriately dataset helps the hybrid model to attain desirable gene classification performance. This HPSOGA approach finds and ranks the most informative gene features from the given microarray gene dataset to provide an optimal classification result. Thus it produces effective training model and it is used to maximize the overall classification accuracy

3.4. Classification algorithm using Improved TSVM (ITSVM)

TSVMs are basically iterative algorithms that gradually search the optimal separating hyperplane in the feature space with a transductive process that incorporates unlabeled samples in the training phase.

This procedure improves the generalization capability of the classifier.

Gradually, the separating hyperplane will move to a finer position in subsequent iterations. This can be explained by arguing that reducing the misclassification of transductive samples can lead to the identification of a more reliable discriminant function. The convergence of the learning depends on the similarity between the problems represented by the training points and unlabeled points. In the proposed TSVM, the training phase and testing phases are involved.

The SVM learner aims to build a decision function 𝑓𝐿: 𝑥 → {−1, +1} based on training set 𝑆_{𝑡𝑟𝑎𝑖𝑛} which is

f_L= L(S_train) (7) Where Strain = x 1, y1 , x ₂, y2 , … . . x _n, yn

The TSVM learning specially includes the knowledge of test set S_tstin training procedure[15], thus the above learning function Eq.(2) of inductive SVM can be reformulated as

f_L= L(S_train, S_tst) (8)

(7)

Where

S_train = x ₁^∗, y₁^∗ , x ₂^∗, y₁^∗ , … . . x _n^∗, y_n^∗ Therefore, in a linearly separable data case, to find a labeling y1∗, y2∗, … yn∗ of the test data, the hyperplane < w , b > should separate both training and test data with maximum margin

Minimize y₁^∗, y₂^∗, … y_n^∗, w , b :¹₂w^Tw (9)

The computation of f_L can be traced back to the classical Structural Risk Minimization (SRM) approach, which determines the classification decision function by minimizing the empirical risk, as

R =¹_l ^N_i=1|f x _i − y_i| (10)

Where N and f represent the size of examples and the classification decision function, respectively. For TSVM, the primary concern is determining an optimal separating hyper-plane that gives a low generalization error.

In TSVM, this optimal separating hyperplane is determined by giving the largest margin of separation between different classes. It bisects the shortest line between the convex hulls of the two classes, which is required to satisfy the following constrained minimization, as

A TSVM model can be generated on number of genes evaluated in the selected data set from a local problem space for a particular new sample. The TSVM allows for an individual model generation and it provides more accurate classification results for the given dataset.

Algorithm 3: ITSVM

Input: Labeled points: S = [(xj, yj )], j = 1, 2, . . . , l and unlabeled points: V = [(xj )], j = l + 1, . . . , n.

Output: TSVM classifier with original training set and the transductive set.

Begin

1. Initialize the working set W⁽⁰⁾= S, previous transductive set A(0) t = ∅ and specify C and C^∗

2. Train SVM classifier with the working set W⁽⁰⁾

3. Obtain the label vector of the unlabeled set V.

for i = 1 to T // T is the number of iterations

4. Select N+ positive transductive samples from the upper side of the margin and N− negative transductive samples from the lower side respectively.

5. Select positive candidate set B+

containing N+ positive transductive samples and negative candidate set B − containing N− negative transductive samples respectively.

6. Update the training set:

7. Train TSVM classifier with the updated training set W(i)

10. Obtain the label vector of the unlabeled set V

End for End

4. EXPERIMENTAL RESULTS

In this section, evaluate the overall performance of gene selection methods using six popular binary and multiclass microarray cancer datasets, which were downloaded from http://www.gems- system. org/. These datasets have been widely used to benchmark the performance of gene selection methods in bioinformatics field. The binary-class microarray datasets are colon [16], leukemia [16, 17], and lung [18] while the multiclass microarray datasets are SRBCT [19], lymphoma [20], and leukemia [21]. In Table 1, a detailed description of these six benchmark microarray gene expression datasets with respect to the number of classes, number of samples, number of genes, and a brief description of each dataset construction.

Table 1 Gene Datasets

Dataset No.of classes No.of samples Number of genes

Colon [16] 2 62 2000

Leukemia [17] 2 72 7129

Lung[18] 2 96 7129

SRBCT [19] 4 83 2308

Lymphoma [20] 3 62 4026

Leukemia [21] 3 72 7129

(8)

In this study, the performance of the proposed HPSOGA-MANN algorithm is tested by comparing it with other standard bioinspired algorithms, including ImRMR-HCSO, ImRMR-GSO, mRMR-ABC and EPSOFFA-MANN.

Compare the performance of each gene selection approach based on parameters such as classification accuracy, error rate, precision, recall, time complexity and the number of predictive genes that have been used for cancer classification.

Classification accuracy is the overall correctness of the classifier and is calculated as the sum of correct cancer classifications divided by the total number of classifications.

Performance metrics

The performance metrics which are used to measure the prediction results are described as follows,

4.1 Accuracy

The overall accuracy of the systems is measured as follows ,

Accuracy = _T ^T^p^+Tⁿ

p+T_n+F_p+F_n× 100 (11) Where TP is known as the amount of correct predictions that an instance is negative, Tn is called the amount of incorrect predictions that an instance is positive, Fp is known as the amount of incorrect of predictions that an instance negative, and Fn is known the amount of correct predictions that an instance is positive.

4.2 Precision

Precision is explained as the ratio of the true positives (TP) contrary to both true positives(TP) and false positives (Fp)

outcomes for imposition and real features.

It is distinct as given below Precision(P) = _T^T^p

p+F_p (12)

4.3 Recall

Recall value is calculated on the root of the data retrieval at true positive (TP) forecast, false negative (Fn). Usually it can be distinct as

Recall(R) = _T^T^p

p+F_n (13)

4.4 F-measure

It is a measure of a test's accuracy. It considers both the precision (P) and the recall (R) of the test to compute the score

F − measure = 2. P. R P + R

(14) 4.5 Time complexity

The algorithm is superior when it provides lower time complexity for the given dataset

4.6 Error rate

The system is better when the algorithm provides lower error rate

The comparison results for the binary-class microarray datasets: colon, leukemia1, and lung are shown in Tables 2, 3, 4, 5 6, and 7 respectively, present the comparison result for multiclass microarray datasets: SRBCT, lymphoma, and leukemia 2. From these tables, it is clear that proposed HPSOGA-ITSVM algorithm performs better than the ImRMR-HCSO with TSVM, ImRMR-GSO with RF SVM, mRMR-ABC and HPSOGA- MANN algorithm in every single case (i.e., all datasets using a different number of selected genes).

Table 2 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO with the RFSVM, mRMR-ABC

classification performance for colon dataset Classification Accuracy (%)

Number of genes

mRMR-

ABC ImRMR- GSO with RF

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

MANN HPSOGA-

MANN HPSOGA- ITSVM

3 87.50 88 89.99 92.01 92.03 93.03

4 88.27 89.9 91.21 93.33 94 95

5 89.5 90 92.56 94.77 95 96

6 90.12 90.80 93.83 95.99 96 97

7 91.64 92 94.41 96.32 97 97

8 91.8 92.2 95.79 97.45 98 99

9 92.11 92.75 96.10 98.22 99 99

10 92.74 93.1 96.77 98.99 99 99.1

15 93.6 94 97.55 99.01 99.05 99.07

20 94.17 94.8 97.89 99.65 99.67 99.99

(9)

Table 3 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN, ImRMR- HCSO with TSVM, ImRMR-GSO with RFSVM, mRMR-ABC classifier for

leukemia1 dataset Classification Accuracy (%) Number

genes of

mRMR-

ABC ImRMR- with RF GSO

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

2 89.63 90 91.34 93.5 94 95

3 90.37 91 92.76 94 95 96

4 91.29 92 93.99 95 96 97

5 92.82 93 94 96 97 98

6 92.82 93 94.41 97 98 99

7 93.10 93.50 95.82 97.66 98 99.8

10 94.44 95 96 98 99 99.83

13 94.93 95 96.33 98.3 99 99.88

14 95.83 96 97 98.99 99 99.95

Table 4 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN , ImRMR-HCSO with TSVM, ImRMR-GSO with RFSVM, mRMR-ABC classifier for

Lung dataset

Classification Accuracy (%) Number

of genes

mRMR-

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

2 95.83 96 97 98 99 99.6

3 96.31 97 98.2 98.55 99 99.7

4 97.91 98 98.7 98.99 99 99.8

5 97.98 99 98.99 99.34 99.79 99.9

6 98.27 98.6 98.99 99.78 99.89 99.97

7 98.53 98.85 98.99 99.79 99.90 99.98

8 98.95 99 99.2 99.8 99.91 99.99

Table 5 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO with RFSVM, mRMR-ABC classifier for

SRBCT dataset Classification Accuracy (%) Number

genes of

mRMR-

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

2 71.08 71.6 82 85 86 87

3 79.51 80 83 87 88 90

4 84.33 84.9 85 88 89 92

5 86.74 87 88 90 91 94

6 91.56 92 93 95 96 97

7 94.05 94.5 95 97 98 99

8 96.3 96.9 97 98 99 99.88

(10)

Table 6 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN , ImRMR-HCSO with TSVM, ImRMR-GSO with RFSVM, mRMR-ABC classifier for

lymphoma dataset Classification Accuracy (%) Number

of genes

mRMR-

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

2 86.36 86.9 88 90 91 94

3 90.90 91.2 92 94 95 97

4 92.42 92.8 94 96 97 98

5 96.96 97.1 97.99 98.55 99 99.98

Table 7 Comparison between HPSOGA-ITSVM and HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO with RFSVM, mRMR-ABC classifier for

Leukemia 2 dataset Classification Accuracy (%) Number

genes of

mRMR-

SVM

ImRMR- HCSO

with TSVM

EPSOFFA-

2 84.72 85.03 86 88 90 92

3 86.11 86.5 87 90 91 94

4 87.5 87.9 88 91 93 95

5 88.88 89 89.5 92 94 96

6 90.27 90.65 91 93 95 97

7 89.49 89.9 92 94.5 95 97

8 91.66 92.05 93 95.69 96 98

9 92.38 92.7 94 97.8 98 99

10 91.66 92.1 95 98 99 99.2

15 94.44 94.85 96 98.33 99 99.5

18 95.67 96 97 98.77 99 99.6

20 96.12 96.5 97.7 99 99.34 99.8

Fig 2 Feature selection results comparison for colon dataset

The comparison results for the binary-class microarray datasets: colon, leukemia1, and lung are shown in Fig 2,3, and 4, respectively while Fig 5, 6, and 7, respectively, present the comparison result for microarray datasets: SRBCT, lymphoma, multiclass and leukemia2.

From these tables, it is clear that proposed HPSOGA-ITSVM algorithm performs better than the HPSOGA-MANN,

EPSOFFA-MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC algorithms in every single case (i.e., all datasets using a different number of selected genes).

Fig 3 Feature selection results comparison for Leukemia1 dataset

8082 8486 8890 9294 9698 100

3 4 5 6 7 8 9 10 15 20

Classification accuracy(%)

Number of genes for colon dataset mRMR-ABC

ImRMR-GSO with RFSVM ImRMR-HCSO with TSVM EPSOFFA-MANN HPSOGA-MANN HPSOGA-ITSVM

84 86 88 90 92 94 96 98 100

2 3 4 5 6 7 10 13 14

Classification accuracy(%)

Number of genes for colon dataset mRMR-ABC

(11)

Fig 4 Feature selection results comparison for Lung dataset

Fig 5 Feature selection results comparison for SRBCT dataset

Fig 6 Feature selection results comparison for Lymphoma dataset

Fig 7 Feature selection results comparison for Leukemia2 dataset

Thus, the ImRMR is a promising method for identifying the relevant genes and omitting the redundant and noisy genes. It can conclude that the proposed HPSOGA-ITSVM algorithm generates accurate classification performance with minimum number of selected genes when tested using all datasets as compared to the HPSOGA-MANN, EPSOFFA-MANN, ImRMR-HCSO with TSVM, mRMR-ABC, ImRMR-GSO algorithm under the same cross validation approach. Therefore, the HPSOGA-ITSVM algorithm is a promising approach for solving gene selection and cancer classification problems.

Fig 8 Time complexity results comparison for all given datasets

From the above Fig 8, the graph explains that the time complexity comparison for the given datasets. In x- axis the number of datasets is considered and in y-axis the time complexity value is considered. The experimental result proves that the proposed HPSOGA-ITSVM 95

96 97 98 99 100

2 3 4 5 6 7 8

Classification accuracy(%)

Number of genes for Lung dataset mRMR-ABC

ImRMR-GSO with RFSVM ImRMR-HCSO with TSVM EPSOFFA-MANN

HPSOGA-MANN

70 75 80 85 90 95 100

2 3 4 5 6 7 8

Number of genes for SRBCT dataset mRMR-ABC

HPSOGA-MANN HPSOGA-ITSVM

85 90 95 100

2 3 4 5

Number of genes for SRBCT dataset mRMR-ABC

75 80 85 90 95 100 105

2 3 4 5 6 7 8 9 10 15 18 20

Number of genes for Leukemia2 dataset mRMR-ABC

100 2030 40 50

Time complexity (sec)

Number of datasets mRMR-ABC

(12)

algorithm has lower time complexity than the existing HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods.

Thus, the result explains that the proposed HPSOGA- ITSVM algorithm is superior to existing system in terms of better classification.

Fig 9 Precision results comparison for all given datasets

From the above Fig 9, the graph explains that the precision metric comparison for the given datasets. In x- axis the number of datasets is considered and in y-axis the precision value is considered. The experimental result proves that the proposed HPSOGA-ITSVM algorithm provides higher precision value than the existing HPSOGA-MANN, EPSOFFA-MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus, the result explains that the proposed HPSOGA- ITSVM algorithm is superior to existing system in terms of better classification.

Fig 10 Recall results comparison for all given datasets

From the above Fig 10, the graph explains that the recall metric comparison for the given datasets. In x-axis the number of datasets is considered and in y-axis the recall value is considered. The experimental result proves that the proposed HPSOGA-ITSVM algorithm provides higher recall value than the existing HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods.

Thus, the result explains that the proposed HPSOGA- ITSVM algorithm is superior to existing system in terms of better classification

Fig 11 F-measure results comparison for all given datasets

From the above Fig 11, the graph explains that the f-measure metric comparison for the given datasets. In x- axis the number of datasets is considered and in y-axis the f-measure value is considered. The experimental result proves that the proposed HPSOGA-ITSVM algorithm provides higher f-measure value than the existing HPSOGA-MANN, EPSOFFA-MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus, the result explains that the proposed HPSOGA-ITSVM algorithm is superior to existing system in terms of better classification

0.5 0.6 0.7 0.8 0.9 1

Precision

0.5 0.6 0.7 0.8 0.9 1

Recall

0 0.2 0.4 0.6 0.8 1

F-measure

(13)

Fig 12 Average error rate

From the above Fig 12, the graph explains that the error rate comparison for the given datasets. In x-axis the number of datasets is considered and in y-axis the error value is considered. The experimental result proves that the proposed HPSOGA-ITSVM algorithm provides lower error rate value than the existing HPSOGA-MANN, EPSOFFA- MANN, ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods.

Thus, the result explains that the proposed HPSOGA-ITSVM algorithm is superior to existing system in terms of better classification.

5. CONCLUSION AND FUTURE WORK Microarray data can be used in the discovery and prediction of cancer classes. Various approaches are used to efficient gene selection for producing cancer classification results. In this research work, HPSOGA-ITSVM is proposed to improve the overall system performance. This research has modules as preprocessing, feature selection and classification. The preprocessing is used to remove the noise data from the dataset.

It is done by using IKNN algorithm which is used to fill the missing values and remove the redundancy values effectively.

It provides reduction of the dataset which is passed to feature selection process. The important and relevant genes are selected by using HPSOGA. This optimization algorithm generates best fitness function values and it is used to provide optimal features. Then the features are passed to classification phase. The ITSVM classifier is trained and tested using the selected genes and returned the classification accuracy. The result proves that the proposed system achieves higher classification results. It provides higher accuracy rate, precision, recall and f-

measure value which indicates better cancer classification for the specified microarray database. Also it reduces the time complexity and error rates significantly. Thus the result concludes that the proposed system is better than the existing system.

REFERENCES

1. Ntzani, Evangelia E., and John PA Ioannidis.

"Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment."The Lancet 362.9394 (2003): 1439- 1444.

2. Tseng and C.-P. Kao, “Efficiently mining gene expression data via a novel parameterless clustering method,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2(4), pp. 355–365, 2005.

3. Layana, C. and Diambra, L. “Dynamical analysis of circadian gene expression”, International Journal of Biological and Life Sciences, Vol.8, No.3, pp.101–5, 2007

4. García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. Berlin: Springer; 2015.

5. Catak, Ferhat Ozg Ur. "Genetic Algorithm based Feature Selection in High Dimensional Text Dataset Classification."

6. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y.

Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics 26 (3) (2000) 392–398.

7. Martinez, A. M. and Kak, A. C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.

8. Alba, Enrique, et al. "Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms." Evolutionary Computation, 2007. CEC 2007. IEEE Congress on. IEEE, 2007.

9. Banerjee, M. Mitra, S. and Banka, H. “Evolutionary Rough Feature Selection in Gene Expression Data”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol.

37, No. 4, pp. 622-632, 2007

10. Subanya, B., and R. R. Rajalaxmi. "Artificial bee colony based feature selection for effective cardiovascular disease diagnosis." International Journal of Scientific & Engineering Research 5.5 (2014).

11. Xu, Yunfeng, Ping Fan, and Ling Yuan. "A simple and efficient artificial bee colony algorithm." Mathematical Problems in Engineering 2013 (2013).

12. Chiş, M., A new evolutionary hierarchical clustering technique, Babeş-BolyaiUniversity Research Seminars, Seminar on Computer Science, 2000,13- 20.

13. Duval, Béatrice, Jin-Kao Hao, and Jose Crispin Hernandez Hernandez. "A memetic algorithm for gene selection and molecular classification of cancer."Proceedings of the 11th Annual conference on Genetic and evolutionary computation. ACM, 2009.

14. Wu, Fang-Xiang, W. J. Zhang, and Anthony J.

Kusalik. "A Genetic K-means Clustering Algorithm Applied to Gene Expression Data."Lecture Notes in Computer Science 2671 (2003): 520-526.

15. Wang, Junhui, Xiaotong Shen, and Wei Pan. "On

transductive support vector

machines." Contemporary Mathematics 443 (2007):

7-20.

16. U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering 12

34 56 78 109

Error rate

Number of datasets

mRMR-ABC ImRMR-GSO with RFSVM ImRMR-HCSO with TSVM EPSOFFA-MANN HPSOGA-MANN HPSOGA-ITSVM

(14)

analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.

17. T. R. Golub, D. K. Slonim, P. Tamayo et al.,

“Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,”

Science, vol. 286, no. 5439, pp. 531–527, 1999.

18. D. G. Beer, S. L. R. Kardia, C.-C. Huang et al.,

“Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nature Medicine, vol. 8, no. 8, pp. 816–824, 2002.

19. J. Khan, J. S. Wei, M. Ringner et al., “Classification and ´ diagnostic prediction of cancers using gene expression profiling and artificial neural networks,”

Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.

20. A. A. Alizadeh, M. B. Elsen, R. E. Davis et al.,

“Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol.

403, no. 6769, pp. 503–511, 2000.

21. S. A. Armstrong, J. E. Staunton, L. B. Silverman et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41–

47, 2001.