Performance metrics for machine learning techniques

Several metrics are used in the literature to evaluate the performance of ML algorithms. In this section, we discuss these metrics and relevant concepts.

2.7.1 Confusion Matrix

A confusion matrix obtains information about actual and predicted classifications done by a classification algorithm/model and is represented by four items: (1) True Positives (TP), (2) True Negatives (TN), (3) False Positives (FP), and (4) False Negatives (FN) [156–158]. For a database of a disease, TP are those cases in which the model predictsyes (they have the disease) and they have the disease actually. For this database, TN are the cases in which the model predictsno and they actually don’t have the disease. In this context, FP are the cases in which the model predicts

yes but they don’t have disease actually. And FN are the cases in which the model predicts no but they do have the disease. Therefore, by definition, TP and TN represent correct predictions while FP and FN represent wrong predictions by the model.

2.7.2 Overall Accuracy

Overall accuracy [156,159] is a common standard technique that is used to examine the performance of a classifier. Equation 2.2 describes this measurement. The value of accuracy ranges from 0.0 to 1.0 with a higher value indicating a better accuracy.

Accuracy = (T P +T N)

(T P +F P +T N +F N) (2.2)

2.7.3 Sensitivity or Recall

Sensitivity is also known as true positive rate (TPR) or recall. It measures the proportion of positives that are correctly identified. It is defined in Equation 2.3. The value of sensitivity ranges from 0.0 to 1.0 with a higher value indicating a better sensitivity.

Sensitivity = T P

(T P +F N) (2.3)

2.7.4 Specificity

Specificity is also known as false positive rate (FPR). It measures the proportion of negatives that are correctly identified. It is defined in Equation 2.4. The value of specificity ranges from 0.0 to 1.0 with a higher value indicating a better specificity.

Specificity = F P

(F P +T N) (2.4)

2.7.5 Precision

Precision identifies how many of the instances predicted by the classifier as positives are indeed positive. Equation 2.5 describes this calculation. The value of precision ranges from 0.0 to 1.0 with a higher value indicating a better precision.

Precision = T P

(T P +F P) (2.5)

2.7.6 F-score

F-score [157, 158] is also a valuable measure to evaluate the combined performance of the two classes. It is sometimes known as F1-score. It is interpreted as a weighted average of the Precision and Sensitivity/Recall; an F-score reaches its best value at 1 and worst at 0. Equation 2.6 defines it. The value of F-score ranges from 0.0 to 1.0 with a higher value indicating a better F-score.

F-score = 2×Precision×Recall

(Precision + Recall) (2.6)

2.7.7 MCC

Matthews correlation coefficient (MCC) is a correlation coefficient between the actual and predicted binary classifications. It generates a value between −1 and +1. A coefficient of +1 indicates a perfect prediction, 0 an average random prediction, and−1 an inverse prediction. It is calculated directly from the confusion matrix using the formula defined in Equation 2.7.

MCC =

((T P ×T N)−(F P ×F N))

p(T P +F P)×(T P +F N)×(T N +F P)×(T N +F N) (2.7) 2.7.8 Receiver operating characteristics (ROC) Area

Receiver operating characteristics (ROC) curve is a useful method for visualizing a classifier’s discrimination capability [197]. It is used to evaluate the strength of a model.

ROC curve is a graphical representation in which the true positive rate (TPR, i.e., Sensitivity) is plotted in function of the false positive rate (FPR, i.e., 1 − Specificity). Suppose, a binary classification model predicts the probability of the observations belonging to the classes,YES and NO. A probability threshold (e.g., 0.5) is chosen. The observations with probability ≥0.5 belong

to the class, YES and probability<0.5 belong to the class,NO. The purpose of the classification model is to assign higher probabilities to the observations that belong to the class, YES and lower probabilities to the observations that belong to the class, NO. Basically, if there is a substantial distinction in the probabilities assigned to the observations of either class, it is possible to define a threshold and to predict the two classes with high confidence.

Figure 2.4 shows an example of the ROC spaces of a classifier. The ROC curve of a good classifier is closer to the top left of the graph. Because in a good classifier, using an optimal threshold, high TPR can be achieved at low FPR. The value of the ROC curve ranges from 0.0 to 1.0 with a higher value indicating a more stable model.

Figure 2.4: An example of a ROC curve. This figure shows the ROC space for a “better” and

“worse” classfier. Figure borrowed from [220].

2.7.9 Kappa statistic

Cohen’s Kappa statistic, κ, is a measurement of agreement between two binary variables (i.e., categorical variables X and Y). It is used to measure inter-rater (i.e., inter-observer) reliability (sometimes called inter-observer agreement) for qualitative (categorical) items. Equation 2.8 is used to calculate Cohen’s kappa for two raters.

k= Po−Pe

1−P_e = 1−1−Po

1−P_e (2.8)

Here, Po is the relative observed agreement among raters and Pe is the hypothetical probability of chance agreement. Kappa is always less than or equal to 1. A value of 1 indicates perfect agreement and values less than 1 imply less than perfect agreement.

2.7.10 Mean absolute error

The Mean absolute error (MAE) calculates the average magnitude of the errors in a set of predictions. It computes accuracy for continuous variables. It is the average over the instances of the absolute values of the differences between predictions and the corresponding actuals. It is a linear score which defines that all the individual differences are weighted equally in the average. MAE is to measure the performance of the regressor (Equation 2.9).

MAE = 1 n

i=1

|x_i−x| (2.9)

Here,nis the number of samples,x_i is the predicted value ofxsample at thei_thposition, and

|x_i–x|is the absolute error for the i_th sample.

2.7.11 Mean absolute percentage error

Mean absolute percentage error (MAPE) or MAE (%), also known as mean absolute percentage deviation (MAPD), is the mean or average of the absolute percentage errors of a forecasting method in statistics. It is also used as a loss function for regression problems in machine learning. Equation 2.10 is used to measure MAE (%). The value of MAPE ranges from 0.00% to 100% with a higher

value indicating a better model.

MAE(%) = (1 n

i=1

|x_i−x|

x_i )∗100 (2.10)

Here,nis the number of samples,x_i is the predicted value ofxsample at thei_thposition, and

|x_i–x|is the absolute error for the i_th sample.

2.7.12 Root mean squared error

Root mean squared error (RMSE) is a quadratic scoring rule which calculates the average magnitude of the error. Firstly, the difference between prediction and corresponding actual values are each squared and then averaged over the instances. Finally, the square root of the average is calculated. It generates a relatively high weight to large errors because of the errors are squared before they are averaged. This means that the RMSE is most fruitful when large errors are particularly undesirable (Equation 2.11).

RMSE = s

i=1(x_i−x)²

n (2.11)

Here, nis the number of samples,x_i is the predicted value of thex sample at thei_th position, and x_i–x is the error for thei_th sample.

2.7.13 R-squared

In statistics, the coefficient of determination is denoted byR²orr²and pronounced as “R-squared”.

It is a statistical measure that represents the proportion of the variance of the dependent variable which can be explained or predicted by the independent variables in a regression task. It is defined by Equation 2.12 below.

R²= 1− Unexplained Variation

Total Variation (2.12)

Here, the unexplained variance is the summation of errors (i.e., subtraction of the predicted values and actual values) squared and the total variance is the summation of squared of the

subtraction of the average actual value from each of the actual values.

For a given data set of n values x1...xn and the mean thereof, ¯x, the total sum of squares (SStot;proportional to the variance of the data) can be defined by Equation 2.13 and the sum of squares of residuals (SSres), also called the residual sum of squares, by Equation 2.14, whereyi is a fitted (or modeled, or predicted) value of the i^th sample.

SS_tot =

i=1

(x_i−x)¯ ² (2.13)

SSres=

i=1

(xi−yi)² (2.14)

Finally, following Equation 2.12, the coefficient of determination,R², is calculated using Equa- tion 2.15. The value ofR² ranges from 0.00% to 100%. The correct R2 value depends on the study area. Any study that conducts any research work on predicting human behavior is tend to have R² values less than 50%.

R² = 1−SS_res SStot

(2.15) 2.7.14 Pearson correlation coefficient

Pearson Correlation Coefficient (PCC) is known as Pearson’s r [115]. It is a measure of linear correlation between two sets of data. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. Generally, it is utilized when there are two quantitative (continuous) variables and there is an interest to see if there is a linear relationship between those variables. PCC is defined by Equation 2.16 below.

r =

P(xi−x)(y¯ i−y)¯

pP(x_i−x)¯ ²(y_i−y)¯² (2.16) Here,r is the PCC, xi are the values of thex-variable in a sample, ¯xis the mean of the values of the x-variable, y_i are the values of the y-variable in another sample, and ¯y is the mean of the values of the y-variable.

Dalam dokumen Department of Computer Science and Engineering in partial fulfilment of the requirements for the degree of (Halaman 43-50)