J00143

(1)

Supervised learning approaches and feature

selection – a case study in diabetes

Yugowati Praharsi

Department of Industrial and System Engineering, Chung Yuan Christian University,

Chung Li, 32023, Taiwan and

Department of Information Technology, Satya Wacana Christian University, Salatiga, 50711, Indonesia

E-mail: [email protected]

Shaou-Gang Miaou

Department of Electronic Engineering, Chung Yuan Christian University Chung Li, 32023, Taiwan

E-mail: [email protected]

Hui-Ming Wee*

Department of Industrial and System Engineering, Chung Yuan Christian University,

No. 200, Chung Pei Rd., Chungli, 32023, Taiwan Fax: +886-3-2654499

E-mail: [email protected] *Corresponding author

Abstract: Data description and classification are important tasks in supervised learning. In this study, three supervised learning methods such as k-nearest neighbour (k-NN), support vector data description (SVDD) and support vector machine (SVM) are considered because they do not suffer from the problem of introducing a new class. The data sample chosen is Pima Indians diabetes. The results show that feature selection based on mean information gain and a standard deviation threshold can be considered as a substitute for forward selection. This indicates that data variation using information gain is an important factor that must be considered in selecting feature subset. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in information gain can sort out the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritisation for symptom recognition.

(2)

Reference to this paper should be made as follows: Praharsi, Y., Miaou, S-G. and Wee, H-M. (xxxx) ‘Supervised learning approaches and feature selection – a case study in diabetes’, Int. J. Data Analysis Techniques and Strategies, Vol. X, No. Y, pp.000–000.

Biographical notes: Yugowati Praharsi is a PhD student in Department of Industrial and Systems Engineering at Chung Yuan Christian University in Taiwan. She received her BSc in Mathematics from Satya Wacana Christian University, Indonesia and MSc in Electronic Engineering from Chung Yuan Christian University, Taiwan. Her research interests are in the field of mathematics modelling, operations research, and supply chain management.

Shaou-Gang Miaou is a Professor in Department of Electronic Engineering at Chung Yuan Christian University in Taiwan. He received his BS in Electronic Engineering from Chung Yuan Christian University, Taiwan, MS and PhD in Electrical Engineering from University of Florida, USA. His research interests are in the field of image processing, biomedical signal processing and pattern recognitions. His publications include four patents, ten books and over 120 journal and conference papers. He is a senior member of IEEE.

Hui-Ming Wee is a Professor in Department of Industrial and Systems Engineering at Chung Yuan Christian University in Taiwan. He received his BSc (hons.) in Electrical and Electronic Engineering from Strathclyde University, UK, MEng in Industrial Engineering and Management from Asian Institute of Technology (AIT) and PhD in Industrial Engineering from Cleveland State University, Ohio, USA. His research interests are in the field of production/inventory control, optimisation and supply chain management. His publications include four books and over 200 refereed journal papers.

1 Introduction

The rapid development of technology leads to increasing data accumulation that yields valuable collection of facts and information. With this, data storage has become a typical method to preserve information and important facts. However, in order to obtain valuable information from those data, an effective learning approach to explore the data must be carried out.

The learning approach in data exploration can be categorised into supervised and unsupervised. A supervised learning classifies the data based on the label of input data. Whereas, classifying the data without the label of input data is called unsupervised learning. Several methods have been proposed to solve problems on supervised learning such as Naive Bayes, k-means, principal component analysis (PCA), k-nearest neighbour (k-NN), support vector data description (SVDD), support vector machine (SVM), and artificial neural network. These methods involve class labels and features of input data (Duda et al., 2000; Ji et al., 2008; Smith, 2009).

(3)

retraining or need a small scale retraining. These three classifiers are also extended by implementing feature selection methods for higher classification accuracy.

Features used in describing data are not equally important for certain problems. Its two main properties that influence the performance of classifiers are redundancy and relevancy. Minimising inter-correlation among features can avoid redundancy. While, obtaining the correlation between feature and class can produce relevancy of each feature to class. In this study, a selected feature subset is obtained using a forward selection and correlation method. Forward selection and correlation method are based on a wrapper approach and filter approach of feature selection theory, respectively. As introduced by Kittler (1978) and mentioned by Hall (1998), the former begins with an empty set, then features are added one by one until no features can produce higher accuracy. The latter measures feature-feature intercorrelations and feature-class correlations. Correlation between features and class label is measured by entropy and information gain while feature-feature intercorrelation uses Pearson correlation.

The remaining parts of this paper are organised as follows. SVM, SVDD, and NN are given in Sections 2, 3, and 4, respectively. Feature selection is described in Section 5 and diabetes is presented in Section 6. Experimental designs are given in Section 7. Finally, experimental results and some conclusions are summarised in Sections 8 and 9, respectively.

2 Support vector machine

The basic idea of a SVM is to construct a hyperplane that the margin of separation, ,

T w w

ρ = can be maximised to classify the data into positive and negative classes. SVM looks for optimal separating hyperplane (OSH) so that it can classify correctly. To construct an optimal hyperplane is preceded by non-linear mapping in the high dimensional feature space so the mapped data can be separated linearly. In addition, the correlation between the mapped data and the input data is non-linear (Cortes and Vapnik, 1995; Haykin, 1999; Huang and Wang, 2006).

For an unseen/test data z, its class can be obtained by the decision function:

(

)

( ) T ( )

D z =sign w ϕ z +b (1)

and the decision rules as follows:

• if D(z) > 0, ϕ(z) belongs to positive class • if D(z) < 0, ϕ(z) belongs to negative class • if D(z) = 0, ϕ(z) is misclassified.

where the variables can be explained as follows: w weight vector

b bias of separating hyperplane

(4)

3 Support vector data description

The basic idea of SVDD is to create a description/closed boundary that contains the training data and then detects the new data does have the same nature with the training data. The purpose of data description is to provide a compact closed boundary of training dataset which is called a hyper-sphere. The sphere has a centre a and radius R > 0. The main idea is to minimise the volume of the sphere by minimising R2.

For a testing data z, it is accepted if:

4 Nearest neighbour

The nearest neighbour classifies data based on the class of their nearest neighbours. In k-NN, a data will be classified by a majority vote of its nearest neighbours. k-NN is a classification method that is modest and basic and can be used as the first step to learning classification when there is no prior knowledge about data distribution. The k-NN classifier is based on the Euclidian distance between testing data and training datasets (Cunningham and Delany, 2007; Peterson, 2009).

5 Feature selection

Feature selection is the process of selecting some features of pre-existing set of features. The selected features should carry good generalisation capabilities to design classifier (Liu and Yu, 2005; Theodoridis and Koutroumbas, 1999). Feature selection is one of the optimisation techniques that will evaluate these features which are relevant to the class so as to improve accuracy and reduce the features dimensionality by removing the features which has high mutual correlation (Chen and Cheng, 2009). There are some evaluation criteria for feature selection methods such as InfoGain, gain ratio, and correlation-based feature selector (Cfs).

5.1 Correlation-based feature selection

(5)

( 1) class-feature correlation (intracorrelation), and rff is the mean of feature-feature intercorrelation.

5.2 Entropy

Uncertainty in a system due to randomness is often measured as entropy. The concept of entropy in information theory is introduced by Quinlan (1993) as mentioned in Hall (1998). The entropy of feature Y in a system is described as:

(

)

The entropy of Y after partitioning is given by

( )

log2

(

( )

)

x X y Y

H Y X p x p y x p y x

∈ ∈

= −

∑ ∑

(5)

The gap of the entropy of Y prior and after partitioning is called information gain. Information gain can be formulated as:

( )

Information Gain=H Y( )−H Y X (6)

6 Diabetes

Blood sugar/glucose level that is too high can lead to diabetes. The body needs glucose to produce energy and blood provides it. Glucose comes from foods we eat and is also produced by the liver and muscle. Blood bring glucose throughout the body cells. On the other hand, insulin helps the glucose to be absorbed into the body cells. Insulin is a hormone produced by pancreas. If the body is unable to produce enough insulin, or insulin is not working properly, the glucose cannot be absorbed into the body cells. Consequently, the level of blood glucose increases. If this glucose levels exceed normal limit, it will cause diabetes.

There are two types of diabetes namely type one and two. Type one usually affects children, adolescents, or youth. This is characterised by the destruction of beta cell due to autoimmune process so that pancreas cannot produce insulin more. Type two usually affects most people at any age. This is characterised by insulin disorder, obesity factor, and liver which do not use insulin properly (Alberti and Zimmet, 1998; What Diabetes Is).

(6)

high risk for type 2 diabetes in the future life (Ben-Haroush et al., 2004; Cheung and Byth, 2003; Lauenborg et al., 2004).

7 Experiment design

The dataset used in the study is Pima Indians from the UCI Machine Learning Database. The Pima Indians involves only females at least 21 years old. There are 768 instances, where 268 instances are diabetes patients and 500 instances are not diabetes patients. Each instance contains eight attributes/features and one class attribute. The information relating to the datasets are summarised in Table 1.

Table 1 Dataset of Pima Indians

Num. of classes Num. of features Num. of data in each class Num. of total data

268 (diabetes) 2 8

500 (normal)

768

The attributes/features are number of times pregnant, oral glucose tolerance test (OGTT), diastolic blood pressure, triceps skin fold (TSF) thickness, 2-hour serum insulin, body mass index (BMI), diabetes pedigree function/heredity, and age. For class attribute, class value 1 is interpreted as diabetes patients and class value 0 is interpreted as non-diabetes patients (Sigillito, 2008).

7.1 Performance evaluation measure

The performance of classifier is evaluated using confusion matrix given in Figure 1. The following evaluation criteria will be used in this study:

Figure 1 Confusion matrix

Predicted class

P N

P TP FN

N FP TN

Actual class

Rate

TP TP

TP FN

= ₊ (7)

Rate

TN TN

TN FP =

+ (8)

TP Precision

TP FN

= ₊ (9)

TN Recall

TP FN =

(7)

TP TN Accuracy

TP TN FP FN

+

= ₊ ₊ ₊ (11)

2 Recall Precision F score=

Recall Precision × × −

+ (12)

Observed Agreement Chance Agreement Kappa Value

1 Chance Agreement −

=

− (13)

rate rate

G−mean= TN ⋅TP (14)

All the evaluation criteria above are used because each of them has its own strength and weakness. There is no single criterion that works best on all given data. Some definitions about evaluation criteria are described as follows. TPrate or recall is the proportion of positive data that were correctly identified. TNrate is the proportion of negative data that were classified correctly. Precision is the proportion of the predicted positive data that were correct. Accuracy is the proportion of the total number of predictions that were correct. F-score is a measure to test the accuracy or the measurement of the balance between precision and recall. Kappa value is an index that compares the agreement against the opportunities that would be expected to occur by chance. Geometric mean (g-mean) is a type of mean, which shows a tendency to converge or particular value of a set of numbers. In Kubat and Matwin (1997), the geometric mean of the two quantities (precision and recall) is used as an extra criterion. For example, g-means is used as an extra criterion for imbalanced training data besides accuracy because g-means considers the number of correct predictions for positive and negative data.

In this study, evaluation is done using 10-fold cross-validation. In 10-fold cross validation, the training set is divided into 10 subsets of equal size. One subset drawn at random and then tested using the classifier. This is done continuously until the tenth subset. Therefore, any data from the entire training set is predicted once so the cross validation accuracy is the percentage of correctly classified data. A grid-search on C (penalty weight of error) and s (RBF kernel parameter) is set using cross-validation. This work used software MATLAB 7.0 to run the programme.

7.2 Classifiers

Three classifiers are compared in this research: nearest neighbour (NN with k = 1, 3, 5, 7, 9, 11), SVDD and SVM. All the performance results of classifiers were obtained through 10-fold cross validation to minimise the impacts of data dependency and prevent the over-fitting problem. The grid algorithm for all classifiers is given in Figure 2.

(8)

Figure 2 A flow chart showing the grid algorithm for SVM, NN, and SVDD classifiers

Testing Set Training Set Initializing C and s

Training SVM , NN, SVDD Classifiers using 10-fold Cross Validation

Trained SVM , NN, SVDD Classifiers Grid Search

Termination Criteria

Optimized C and s N

Y

7.3 Feature selection methods

In order to improve the performance of the classifier, feature selection methods are applied. In this study, two feature selection methods are used: forward selection search and correlation approach. The purpose of feature selection is to obtain subset feature that has strong correlation with the class and uncorrelated to each others.

The feature correlation measures used here are Pearson correlation and entropy. Pearson correlation is used to measure the feature-feature correlation because it indicates linear relationship. Features are similar if they have strong relationship. Entropy is used to measure feature-class correlation because it has good performance for data with nominal class value. The flow chart for all classifiers with feature selection is given in Figure 3.

Best subset is obtained using several methods: 1 Forward selection heuristic search

It uses greedy method to obtain the best subset. Beginning with the empty set and then add features one by one until there are no features that can produce higher accuracy. The possible number of subsets obtained in this method is calculated by using:

1

2 1

n

n n i i number of subset C

=

(9)

2 The best subset is obtained based on the best merit. The merit is calculated using (3). In this formula, rff is derived from Pearson correlation and rfc is derived from information gain.

3 Using a threshold to select subsets. Select subsets consisting of features that have information gain above the threshold. The following two thresholds are considered: a threshold = mean information gain

b threshold = mean information gain – 0.5 * standard deviation.

Two thresholds are tested in order to see which threshold is better. The basis threshold is mean information gain. The first threshold just considered the basis and the second threshold also took the variant of data (standard deviation) into

consideration.

Figure 3 A flow chart of the main programme for SVDD, SVM, and NN classifiers with feature selection

Start

Positive and Negative training data, Parameter C and s

Select the best feature subset by forward, merit, and threshold of mean information

gain with/without standard deviation

10 fold Cross Validation (CV) SVDD for target data, SVDD

for testing data

10 fold CV SVM for target data, SVM for testing data

10 fold CV NN Classifier between target data & testing data

Accuracy, Error, Precision, Recall, F-Score, TPrate, TNrate, FPrate, Kappa, g_means

(10)

8 Results and discussion

8.1 Performances of supervised learning approaches without feature selection

Performance evaluation measure without feature selection is summarised in Table 2. According to Table 2, the values between TN rate (proportion of non-diabetes patients that were classified correctly) and TP rate (proportion of diabetes patients that were correctly identified) are not balanced. This is due to imbalanced training datasets (Liu et al., 2006). In this study, the accuracy of SVM classifier (78.3%) outperforms the accuracy of the other two classifiers and the work by Bacauskiene et al. (2009) (76.9%). The optimal parameters for SVM and SVDD are C = 0.6 and s = 1 and C = 0.3 and s = 750, respectively. After implementing k = 1, 3, 5, 7, 9, 11, the optimal k for k-NN is k = 5. The performance of 1-NN or simply NN is included here for comparison due to its widespread use and simplicity. The table shows that SVDD has the lowest accuracy as compared to other classifiers. Consequently, SVDD cannot be applied for two classes in Pima Indians diabetes database.

Table 2 Performance evaluation measure for supervised learning without feature selection

Supervised methods

TP rate

TN

rate Accuracy Precision Recall F-score Kappa

Table 3 provides computational time in training and testing stages of each classifier without feature selection. The number of total data used in this study is 768. The time is generated by using a personal computer (PC) with Intel CPU (Pentium 1.6 GHz) and 512 MB of RAM. This work used software MATLAB Version 7.0 Release 14 to run the programme. As expected, all classifiers need more time in training phase than that in testing phase. Moreover, SVM consumes much more time than other classifiers.

Table 3 Computational time for the classifier without feature selection

Supervised methods

TP rate

TN

8.2 Performances of supervised learning approaches with feature selection

(11)

(GA) driven SVM proposed by Bacauskiene et al. (2009) (77.6%). However, the improvement in classification accuracy obtained for the data presented in Table 4 is rather marginal. Other performance evaluation measures are described in Table 5 to Table 8. In addition, many evaluation criteria could increase confidence value of the result.

Table 4 Accuracy performance for supervised learning with feature selection

Methods 1-NN 5-NN SVDD SVM

Without feature selection 70.1% 75.9% 49.5% 78.3%

Forward selection 71.1% 76.2% 67.5% 78.3%

Correlation (merit) 62.5% 68.4% 32.1% 74.3%

Thresholding on mean information gain

69.9% 74.1% 26.3% 75.9% Feature

selection

Thresholding on mean InfoGain – 0.5 * standardDev

71.2% 74.3% 57.1% 77%

Table 5 SVM performance with feature selection

TP rate

TN

Table 6 SVDD performance with feature selection

TP rate

TN

value g-means

Forward 0.55 0.74 0.675 0.524 0.55 0.534 0.285 0.634

Merit 0.588 0.182 0.321 0.281 0.588 0.377 –0.171 0.294 Mean info 0.573 0.102 0.263 0.249 0.573 0.347 –0.248 0.236 Mean info

and stdev

0.592 0.56 0.571 0.414 0.592 0.483 0.138 0.566

Table 7 1-NN performance with feature selection

TP rate

TN

(12)

Table 8 5-NN performance with feature selection

TP rate

TN

rate Accuracy Precision Recall F-score Kappa

Table 9 Computational time (in second) with feature selection

Methods 1-NN 5-NN SVDD SVM

Without feature selection 52.14 57.35 47.82 961.22 Forward selection 14,167.12 17,632.11 11,463.77 377,638.1

Correlation (merit) 50.52 56.39 45.57 999.95

Thresholding on mean information gain

50.75 58.10 47.41 943.19 Feature

selection

Thresholding on mean InfoGain – 0.5 * standardDev

51.36 57.63 46.61 952.14

Table 10 The best feature subsets selected

Methods NN SVDD SVM

Age Pregnant, glucose,

diastolic, TSF, insulin, BMI, pedigree, age

01000000 01000000 01000000

(1-NN and 5-NN) Correlation (merit)

Glucose

Glucose Glucose

01000100 01000100 01000100

(1-NN and 5-NN) Thresholding on mean

info gain

Glucose, BMI

Glucose, BMI Glucose, BMI

11000101 11000101 11000101

(13)

Table 9 provides total computational time involving both training and testing stages. The computer specification is the same as that for generating Table 3, except that the results for SVM and 5-NN forward selection are generated by using the PC with Intel CPU (Pentium 2.81 GHz) and 1 GB of RAM. By columns observation, it can be seen that SVDD is the fastest. It happens because SVDD is one class classifier. By rows observation, mean information gain with standard deviation threshold is much more efficient as compared to forward selection.

The best feature subset (Table 10) is chosen by the highest accuracy in each feature selection method of classifiers. According to the results given in Table 5 to Table 8, forward selection is the best feature selection method for all classifiers, except the 1-NN case using mean information gain and standard deviation threshold. 1-NN uses four relevant features, including pregnant, glucose, BMI and age while 5-NN uses seven features, i.e., pregnant, glucose, diastolic, insulin, BMI, pedigree, and age. SVDD just uses single feature (age) that is most relevant for describing its structure while SVM uses all eight features.

Table 11 Ranking in relevance of each feature to class

Feature Information gain

Glucose 0.1686 BMI 0.0822 Age 0.0488 Pregnant 0.0469 Heredity 0.0225

Serum Insulin 0.0197

Diastolic blood pressure 0.0160

Triceps skin fold 0.0094

According to Table 11, glucose has the highest information gain. It means that glucose has the highest relevance to class. Note also that glucose feature is adopted in all columns of Table 10, except SVDD with forward selection.

9 Conclusions and future work

(14)

SVDD cannot be applied for this dataset. Relevancy measurement using information gain can be used to sort from the most important feature to the least important feature. It can be useful in medical applications such as defining feature prioritisation. In the future, this paper can be extended by applying imbalanced SVM (ISVM) to solve the imbalanced training datasets.

References

Alberti, K.G.M.M. and Zimmet, P.Z. (1998) ‘Definition, diagnosis and classification of diabetes mellitus and its complications part 1: diagnosis and classification of diabetes mellitus provisional report of WHO consultation’, Diabetic Medicine, Vol. 15, No. 7, pp.539–553. Bacauskiene, M., Verikas, A., Gelzinis, A. and Valincius, D. (2009) ‘A feature selection technique

for generation of classification committees and its application to categorization of laryngeal images’, Pattern Recognition, Vol. 42, No. 5, pp.645–654.

Ben-Haroush, A., Yogev, Y. and Hod, M. (2004) ‘Epidemology of gestational diabetes mellitus and its association with type 2 diabetes’, Diabetic Medicine, Vol. 21, No. 2, pp.103–113.

Chen, Y-S. and Cheng, C-H. (2009) ‘Evaluating industry performance using extracted RGR rules based on feature selection and rough sets classifiers’, Expert Systems with Applications, Vol. 36, No. 5, pp.9448–9456.

Cheung, N.W. and Byth, K. (2003) ‘Population health significance of gestational diabetes’,

Diabetes Care, Vol. 26, No. 7, pp.2005–2009.

Cortes, C. and Vapnik, V. (1995) ‘Support vector networks’, Machine Learning, Vol. 20, No. 3, pp.273–297.

Cunningham, P. and Delany, S.J. (2007) k-Nearest Neighbour Classifiers, Artificial Intelligence Group, Department of Computer Science, Trinity College, Dublin.

Duda, R.O., Hart, P.E. and Stork, D.G. (2000) In Pattern Classification, pp.526–527, John Wiley & Sons, Inc.

Ghiselli, E.E. (1964) Theory of Psychological Measurement, McGraw Hill.

Hall, M.A. (1998) Correlation-based Feature Selection for Machine Learning, The University of Waikato, Hamilton, New Zealand.

Haykin, S. (1999) ‘Support vector machine’, in Neural Network: A Comprehensive Foundation, pp.318–350, Prentice-Hall, New Jersey.

Huang, C.L. and Wang, C.J. (2006) ‘A GA-based feature selection and parameters optimization for support vector machines’, in Expert Systems with Applications Journal, Vol. 31, No. 2, pp.231–240.

Ji, R., Liu, D., Wu, M. and Liu, J. (2008) ‘The application of SVDD in gene expression data clustering’, Paper presented at the Proc. of the 2nd Int. Conf. on Bioinformatics and Biomedical Engineering.

Kittler (1978) ‘Feature set search algorithms’, in C.H. Chen (Ed.): Pattern Recognition and Signal Processing, the Netherlands.

Kubat, M. and Matwin, S. (1997) ‘Addressing the curse of imbalanced training sets: one-sided selection’, Paper presented at the Proc. of the 14th Int. Conf. on Machine Learning.

Lauenborg, J., Hansen, T., Jensen, D.M., Vestergaard, H., Molsted-Pedersen, L., Hornnes, P. et al. (2004) ‘Increasing incidence of diabetes after gestational diabetes’, Diabetes Care, Vol. 27, No. 5, pp.1194–1199.

Lee, K.Y., Kim, D.W., Lee, K.H. and Lee, D. (2007) ‘Density-induced support vector data description’, IEEE Trans. on Neural Networks, Vol. 18, No. 1, pp.284–289.

(15)

Liu, Y-H., Chen, Y-T. and Lu, S-S. (2006) ‘Face detection using kernel PCA and imbalanced SVM’, in Advances in Natural Computation, Vol. 4221, pp.351–360, Springer, Berlin Heidelberg New York.

Peterson, L.E. (2009) k-Nearest Neighbor, available at http://www.scholarpedia.org/article/K-nearest_neighbor.

Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kauffman.

Sigillito, V. (2008) Pima Indians Diabetes Data Set, 12 December, 2009, available at http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

Smith, L.I. (2009) Tutorial on Principal Component Analysis, 27 June, available at http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf (accessed on 26 February 2002).

Tax, D.M.J. and Duin, R.P.W. (1999) ‘Support vector domain description’, Pattern Recognition Letter, Vol. 20, Nos. 11–13, pp.1191–1199.

Tax, D.M.J. and Duin, R.P.W. (2004) ‘Support vector data description’, Machine Learning Journal, Vol. 54, No. 1, pp.45–66.