Techniques on Iris Data Set

(1)

Techniques on Iris Data Set

Article in Talent Development and Excellence · June 2020

CITATIONS

3

READS

426 3 authors:

Some of the authors of this publication are also working on these related projects:

EDUCATIONAL REVAMPView project

E-GOVERNANCEView project Mohammed A. Afifi Skyline University College 14PUBLICATIONS 31CITATIONS

SEE PROFILE

Deepak Kalra

Skyline University College 18PUBLICATIONS 38CITATIONS

SEE PROFILE

Taher M. Ghazal Skyline University College 59PUBLICATIONS 295CITATIONS

SEE PROFILE

(2)

Data Mining and Exploration: A Comparison Study among Data Mining Techniques on Iris Data Set

Taher M. Ghazal, Mohammed A. M. Afifi* and Deepak Kalra

*Correspondence:

[email protected]

1School of Information Technology,

Skyline University College, Sharjah, United Arab Emirates.

Full list of author is available at the end of the article.

Abstract

This work aims at investigating the efficiency of diverse methods of classification through the use of WEKA software for the well-known Iris data set. For the assessment of the classification algorithm performance, this paper adopted the use of Receiver Operating Characteristic (ROC) curves. The different classification algorithm techniques used for this work include neural networks, naïve Bayes and decision trees. The data set used in our investigation, Iris data, is one of the oldest and widely used data sets in data mining. For the three techniques of classification used in this study, a comparison of the ROC curves used in this study indicate that the Neural Network (NN) is the most appropriate method of evaluation investigated in this work. The other two methods, Bayes network classifier and decision trees, have their classical procedures for classification that might need to improve significantly.

Keywords: Data Mining, Iris data, Decision Trees, Naïve Bayes, Neural Networks, ROC Curve Introduction

The model performance assessment is a central question in data mining. The use of confusion matrix is one way of doing this. The confusion matrix is a tabular form for presenting both the actual and the predicted classifications. One of the ways to assess the performance of data mining techniques, which goes side by side with the confusion matrix, is the use of ROC (Receiver

Operating Characteristic curves. The Receiver Operating Characteristic (ROC) analysis is considered as one of the visual methods, at the same time it is considered as one of the assessing tools for the performance of different classification techniques. This paper presents the application of different classification algorithms and the philosophy behind ROC analysis. All of this ought to be investigated using a well-known dataset, which is Iris data, and some of other famous classification techniques.

Researchers and analysts develop important classification models. Some of these techniques are neural networks analysis, k-nearest neighbor, decision trees and Naïve Bayes. Each of these techniques has its own predictions, of which they may or may not give contrasting similar results on the basis of the data nature. The researchers also give some attention to evaluating these models. However, developing the model only halves the problem, evaluation of the models is the second half.

Related work

The aim of this work is to compare different classification techniques using Weka software. Many authors applied different data mining techniques using Iris data. One of the first works which used Iris dataset is adopted by [1].

In his paper, Fisher introduced discriminate analysis and he applied this method on Iris data. Nowadays we still find many authors use the famous Iris dataset. One of the latest publications in this issue is Kumar and Sirohi [2].

In their paper Kumar and Sirohi, [2] aimed to compare one suggested method with Fussy c-mean clustering method. They implemented the two techniques using Matlab and they found that their suggested method gets

(3)

faster result than the Fussy c-mean method. In our work we will use three different methods for data mining and Weka toolkit instead of Matlab.

[Sakthi and Thanamani, 2011] investigated the accuracy of classification techniques. They compared their proposed kernel principle component analysis (KPCA) method with the classical k-means method. For this purpose, the authors conducted many experiments on iris data set. The accuracy was computed for each experiment and the average accuracy of all experiments had been taken. Their suggested method gave better initialization of centroids for clustering than K-means method. In our research we will use Receiver Operating Characteristic (ROC), proposed originally by (Metz, 1978) analysis as a method for assessing classification techniques using Iris dataset. The true positive rate (Sensitivity) plot is the ROC curve, and the false positive rate (100-Specificity) pairs for different cutoff (threshold) values.

Basic concepts of ROC curves

To understand the basic concepts of the ROC curves, we will use an example from medical diagnostics.

Assuming that for a particular disease, N patients are tested alongside four possible outcomes – true positive (abbreviated by TP), true negative (abbreviated by TN), false negative (abbreviated by FN) and false positive (abbreviated by FP). The construction of the ROC curve can then be done on the basis of the proportions in the following equations

𝑆𝑆𝑇 = ^𝑁^𝑇𝑃

𝑁_𝑇𝑃+𝑁_𝐹𝑁 and 𝑆𝑃𝑇 = ^𝑁^𝑇𝑁

𝑁_𝑇𝑁+𝑁_𝐹𝑃

Where SPT and SST and indicate the specificity and sensitivity respectively and SST = 1 − SPT. NTP and NFN

represent how many people (in number) are diagnosed of the disease, and those with the disease but got cleared by the test respectively. Contrastingly, NTN and NFP denotes the number of those tested negative (without the disease) and those tested positive without having the disease respectively. This will be illustrated in the Figure 1

Figure 1. A hypothetical chart of the distribution of true positive and true negative.

As mentioned and presented above, given an assumption that a model will produce the four possible outcomes, then ACCR (ROC accuracy) and ERR (the corresponding error) are defined as in the following equations.

𝐴𝐶𝐶𝑅 = ^𝑁^𝑇𝑃^+𝑁^𝑇𝑁

𝑁_𝑇𝑃+𝑁_𝑇𝑁+𝑁_𝐹𝑃+𝑁_𝐹𝑁 𝑎𝑛𝑑 1 − 𝐴𝐶𝐶𝑅 = 𝐸𝑅𝑅

Then the main goal of predictive modeling is to maximize the probability of 𝐴𝐶𝐶𝑅 or, equivalently, to minimize the probability of 𝐸𝑅𝑅.

(4)

Interpreting ROC curves using Weka

The ROC curve is a graph in which we plot for different threshold values; the x-axis true positive rate and the y-axis false positive rate. In the data mining field, we use the word sensitivity for the true positive rate as well as specificity for the false positive rate. Now, for the assessment of any classification technique performance, we use the ROC curve. The best discrimination, means the lowest overlapping between the classes, will be reached for the closest curve to the upper left corner of it is graph. And then this will be lead to the best accuracy of the classification technique (Zweig & Campbell, 1993).

One can plot the sensitivity versus specificity that each point on the plot corresponds to a specific threshold. This will produce the ROC curve in which the horizontal axis hosts the specificity plot and the 100-sensitivity on the vertical axis as shown in the Figure 2.

Figure 2. A hypothetical chart for the ROC curve of the specificity versus the sensitivity.

Most of the paper’s computational work was done in Weka. How wonderful is Weka! It helps us to perform many statistical techniques, such as the regression and correlation analysis, the analysis of variance, classification and clustering and many more. One of the main objectives of the data mining, in general, is to extract important information form data. Thanks for Weka which help us to do this in a professional and easy way. Weka also provides researchers with some techniques help in assessing classification techniques by visualizing classifier performance. One of these visualizing techniques is ROC curves that we mentioned above.

These and many other reasons encourage us to use Weka in this research to apply the different data mining techniques in the data set we proposed in our survey.

The description of Iris data

In our investigation in this work we used the Iris data [Fisher, 1936]. Iris data is one of the earliest and widely used data sets in data mining. In the data set, there are 150 samples (instances) from three different Iris flower types. These types (species) are: Iris-Virginica, Iris-Versicolour and Iris-Setosa. For each sample of Iris flowers four different measures (attributes) have been calculated in centimeters. So that the data set contains four attributes, these are: the length and width of the sepal, and the length and width of the petal. The summary of the standard deviations and means of the different attributes are given in the Table 1 hereunder.

Table 1: The Iris data attributes averages and standard deviations.

Attribute Mean Standard deviation

sepal's width 3.05 0.43

sepal's length 5.84 0.82

petal's width 1.19 0.76

petal's length 3.76 1.76

(5)

The table above shows clear differences among the different attributes in Iris data, especially for petal width and petal length. The variation between the instances in each one of these two attributes is noticeable. It is more than double; 0.76 versus 1.76. This indicates that the petal width and petal length might be distinguishable features for Iris data. Figure 3 presents the box-plots of the Iris data which show big differences in the averages and standard deviations values in the four attributes of Iris data.

Figure 3. Box-plots represent the different attributes and the classes of Iris data.

The classification of Iris data using Decision Trees (DT)

The goal of the decision trees (DT) method is create a decision tree to classify unknown Iris samples. This will be obtained by deciding the type of Iris flower (Setosa, Veriscolor, or Virginica) using the 4 different attributes illustrated above. In DT we cannot be absolutely certain of our classification of an unknown sample, but we can only determine the probability that an unknown sample belongs to a particular class.

Figure 4. Histograms representing the different attributes and the classes of Iris data.

From the above histograms (Figure 4) for the Iris data we noticed a clear overlapped among the three classes using the sepal width and sepal length. In such cases the three types of Iris flowers having the same value of petal length and petal width. This indicates that it’s difficult to use these two attributes in the classification of Iris species. So that we will use the other two attributes, i.e. sepal length and sepal width in building our decision tree as given in

(6)

Figure 5. This DT is produced using Weka. 70% of the data set is selected by the system randomly to train the decision tree and the remaining samples will be used to test it.

Figure 5. Decision Trees (DT) for classifying Iris data.

The results of the classification of Iris data using DT algorithm is presented in the following confusion matrix, Table 2. These results show that only three followers form the test group were classified wrongly. One of the Iris- Veriscolor flowers is classified wrongly as Iris-Virginica and two of Iris-Virginica flower is misclassified as Iris- Veriscolor. And then the percentage of correctly classified instances is 95.555% as indicated in Table 2 hereunder.

Table 2: The confusion matrix of the classification of Iris data using DT algorithm

Using Neural Networks (NNs) for Iris Data Classification

In this part of our research we aim to use a multi-layer perceptron network (MLP) technique for classifying Iris data. MLP is an acyclic forward network. In this method neurons can be divided into disjoint layers in such a way that makes the output of each neuron of one layer be connected to the inputs of each neuron layer follows the previous one. This method is considered as one of the back propagation methods in which we start getting a network from scratch (means without information) to a fully informed (or learned) one [Stastny, et. al., 2011].

The results of the Weka software showed the following neural network, Figure 6, and classifier outputs

Figure 6. Multi-Layer Perceptron network (MLP) using Iris data.

Classification category Iris-Setosa (a) Iris-Veriscolor (b) Iris-Virginica (c)

a 14 0 0

B 0 16 1

c 0 2 13

(7)

Table 3: The confusion matrix of the classification of Iris data using NNs algorithm

The above confusion matrix, in Table 3, shows that only two followers were classified wrongly. One of the Iris- veriscolor flowers is classified wrongly as Iris Virginica and one Iris Virginica flower is misclassified as Iris- veriscolor. So that the percentage of correctly classified instances is 98.6% as shown in the following Table 5.

Applying Naïve Bayes (NB) classifier to Iris data

A Naïve Bayes classifier (NB) is probabilistic approach for classifying data into none overlapping classes. It bases on collecting frequency counts of events, and then the odd events will be grouped together into classes. The philosophy behind this method is that, if you want to classify a new event, you have to check in the previous categories of classes and then decide to put the element to one of the classes the count the highest frequency of the new. The goal of classification using this probability method is to ensure the correct prediction of the value of class given a list of attributes with the assistant of Bayes rule.

Table 4: The confusion matrix of the classification of Iris data using Naive Bayes classifier algorithm.

From the above confusion matrix presented in Table 4, we notice that only three samples are classified wrongly as another Iris type. Two of them were Iris-Virginica and classified as Iris-Veriscolor, and one of the Iris- Veriscolor was misclassified as Iris-Virginica.

Results and discussion

Here, the results of this paper is analyzed. Even though the three algorithms we used in this analysis worked well, we can see that Neural Networks (NN) outperformed the other two techniques; i.e., decision trees (DT) and the Naïve Bayes (NB) techniques with a classification rate of 0.98 as shown in Table 5. Moreover, the accuracy of NB almost the same as the accuracy of the DT with a classification rates of 95.5% and 94.1% respectively.

Table 5: Comparison of classification algorithms using different methods of evaluation

The commonly used indicators for the accuracy of classification techniques are the mean of absolute, the root mean squared and the relative errors. All of these indicators are calculated using Weka and presented in Table 6.

We found that the highest error is found in using NB and DT methods. The average of their scores is around 0.36.

In contrast the NN algorithm has a root

a 50 0 0

b 0 49 1

c 0 1 49

A 15 0 0

B 0 18 1

c 0 2 15

Algorithm Correctly Classified Instances

Incorrectly Classified Instances

Kappa statistic

Decision trees 95.5556 % 4.4444 % 0.9331

Neural Networks 98.6667 % 1.3333 % 0.98

Naïve Bayes classifier

94.1176 % 5.8824 % 0.9113

(8)

Algorithm Mean absolute error

Root mean squared error

Relative absolute error %

Root relative squared error %

Neural Networks 0.0248 0.0911 5.5779 % 19.3291 %

Decision trees 0.0416 0.1682 9.3466 % 35.6559 %

Naïve Bayes classifier

0.0447 0.1722 10.0365 % 36.4196 %

Table 6: Errors of classifying Iris data using different classification algorithms

relative square error of 0.19. An algorithm having the lowest rate of error will be preferred as it has the most potent technique of classification. Alternatively, Kappa statistic (Cohen, 1960) is used to assess the process for the classification method. It gives a reflection of the difference between the agreement expected by chance and the actual agreement, for example Kappa of 0.98 connotes 98% better agreement than by chance alone. By using the Kappa Statistic criteria, we found that the accuracy of the three classification methods used in this study are substantial since the Kappa statistic for each one of them is more than 91% as shown in Table 1.

The ROC curve as an evaluation method have been used in this work for the assessment of the model performance for the three classification algorithms. ROC curves comparison for the three classification techniques used in this study is depicted in Figure 7. The graph indicates that the NN is the most appropriate evaluation method investigated in this work.

Figure 7: ROC Curves for different algorithms result Conclusion

ROC analysis provides graphical tools for assessing the performance of classification techniques. But it should not be used alone for this purpose. So it is vital to include other comparison measures into the evaluation process.

Even though, using ROC curves for a given data set it needed more research work to determine which classification method is the best? And what is its optimal decision threshold. It also needed to use more empirical data sets, rather than using only Iris dataset. This will be in the benefit of solving many real life problems.

We have applied three different classification algorithms according to Weka. NN, with an accuracy of 98.6667

%, is the most suitable algorithm for classifying Iris data. It possesses the lowest average error at 0.0248 compared to the two other methods. The other two classification methods gave us satisfactory results. These results give a suggestion that out of the data mining techniques tested in this work, Bayes network classifier and decision trees methods have the potential to tremendously advance the classification methods for use in different areas.

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0.00% 20.00% 40.00% 60.00% 80.00% 100.00%

Sensitivity

100-Specificity

Decision Trees Naïve Bayes classifier Neural Network s Random

(9)

ISSN 1869-0459 (print)/ISSN 1869-2885 (online)

ROC: Receiver Operating Characteristic; NN: Neural Network; KPCA: Kernel Principle Components Analysis;

TP: True Positive; TN: True Negative; FN: False Negative; FP: False Positive; ACCR: Accuracy of ROC; ERR:

Corresponding ERRor of ACCR of ROC; DT: Decision Tree; MLP: Multi-Layer Perceptron network; NB: Naïve Bayes.

Acknowledgements

Thanks to all the participants helped in this work.

Authors’ contributions

TG and DK prepared the draft and Idea. TG and MA wrote the manuscript. MA, TG and DK collaboratively done the work on the analysis and result optimization, MA and DK prepared the tables, references and checked the English. All authors read and approved the final manuscript.

Funding

The research received no external funding.

Availability of data and materials Not applicable.

Ethics approval and consent to participate

The authors ethics approval and consent to participate.

Consent for publication

The authors consent for publication Competing interests

The authors declare that they have no competing interests.

Authors details Taher Ghazal¹

Mohammed A. M. Afifi² Deepak Karla³

1,2,3 School of Information Technology, Skyline University College, Sharjah, United Arab Emirates.

1[email protected], ²[email protected],

3[email protected] References

1. Fisher, R. A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

2. Kumar and Sirohi, 2010] P. Kumar and D. Sirohi. Comparative Analysis of FCM and HCM Algorithm on Iris Data Set. International Journal of Computer Applications. Volume 5 No.2, August 2010

3. Egan, J. P. (1975). Signal Detection Theory and Roc Analysis; Academic Press; ISBN-13 978-0122328503 4. Zweig MH, Campbell G (1993) Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool

in clinical medicine. Clinical Chemistry 39:561-577.

5. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.

6. J. Stastny, P. Turcinek, and A. Motycka (2011). Using Neural Networks for Marketing Research Data Classification, Mathematical Methods and Techniques in Engineering and Environmental Science. ISBN: 978- 1-61804-046-6.

7. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.