2.6 M ODEL E VALUATION
2.6.2 G OODNESS - OF - FIT TESTS
John von Neumann, one of the foremost mathematicians of the 20th century said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”(Dyson 2004). This quote helps to sum up the main problem with goodness-of-fit measures for model evaluation. The problem is that goodness-of-fit measures usually do not take into account model complexity via the number of free parameters (Schunn et al., 2005). It is therefore recommended that in conjunction with goodness of fit measures, researchers always divulge the number of free parameters in the underlying model together with the definition used to define a free parameter.
Numerical goodness-of-fit measures can be divided into two types; measures of deviation from exact data location (also referred to as absolute error) and measures of trend relative magnitudes (also referred to as correlation) (Schunn et al., 2005). The most frequently used measures of deviation from exact data location and trend relative magnitudes are root mean squared deviation (RMSD) and the coefficient of determination (r2) respectively (Schunn et al., 2005). Legates and McCabe (1999) suggest that correlation measures should not solely be used to assess goodness-of-fit but that they should be coupled with summary statistics and absolute error measures (Legates and McCabe 1999). The calculational methods follow and variables, where not specified, take on the same meaning as was mentioned above in structure characterization.
√∑
Equation 34: Root mean squared deviation
As can be seen in the formulation of the root mean squared deviation (Schunn et al., 2005), the index gives a measure of the mean variance of model predictions about expected data. Via the squaring operation, the index penalizes poorly fitted data points more severely than closely fitted data points and also ensures that there is no error compensation caused by the summing of positive (over-prediction) and negative (under-prediction) errors.
The coefficient of determination (r2) is more commonly applied to linear models. Evidence of its application to nonlinear models exists but for such cases the use of a pseudo-r2 coefficient is recommended (Schunn et al., 2005). A general definition of the coefficient of determination is that it is the proportion of variability in the dataset that is accounted for by the model. Mathematically this translates into the definition which follows.
Equation 35: Coefficient of determination Where:
∑ ̅
̅
(as was previously defined)
AD models are highly non-linear which presents complications in applying the co-efficient of determination to evaluate model/data correlation.
Hauduc et al. (2011) suggests two other ways of classifying goodness-of-fit criteria. The first classification scheme characterizes the criteria according to their underlying mathematical structures into six classes; single event statistics, absolute criteria from residuals, residuals relative to observed values, total residuals relative to total observed values, agreement between distributional statistics of observed and modelled data, and comparison of residuals with reference values and with other models. The second classification scheme is based on the characteristic of the model that the criterion exposes namely mean error, bias, large errors, small errors, peak magnitude and chronology of events.
The abovementioned RMSD is an example of an absolute criterion from residuals. Absolute criteria give a measure of the deviation in the model outputs as compared to the observed data in the units of the observed/modelled data. Relative error criteria on the other hand give a dimensionless measure of errors in the model outputs with reference to the observed data. Relative criteria are often preferred since they can be compared across various target constituents being modelled in the dataset (Hauduc et al., 2011).
Selection of relevant goodness-of-fit criteria should take into account the modelling objectives. This research deals with the development of an exploratory model (AD-FTRW2) with the primary objective of improving pH prediction as compared to its predecessor (AD-FTRW1). This improvement on prediction accuracy of one process variable should not come at the expense of the prediction accuracy of the other measured process variables. Due to the general and comparative nature of this performance evaluation (AD-FTRW1 vs AD- FTRW2), a handful of goodness-of-fit criteria are selected out of Hauduc et al.’s (2011) classification schemes based on giving a good indication of overall and comparative model performance and are reviewed here. The selected criteria encompass measures of mean error, bias error and measures for comparsion of residuals with other models.
Two possible criteria for characterizing mean error are the Root Mean Square Error/Deviation (RMSE / RMSD) or the Mean Square Relative Error (MSRE). These criteria are very similar in that they both eliminate error compensation via their squaring operations but a difference lies in the fact that RMSE is an absolute criterion (refer to Equation 34 above) while MSRE is a relative criterion. A further difference is that MSRE emphasizes larger relative errors.
∑ [ ]
Equation 36: Mean Square Relative Error
Where:
Two possible criteria for quantifying bias error include Mean Error (ME) and Percent Bias (PBIAS). In both of these measures error compensation can occur and thus they do not give a good indication of the magnitude of the errors. They do however give a fair indication of systematic bias in a model whereby a model systematically over- or under-predicts. The calculation of these two criteria is outlined below (Hauduc et al., 2011). It can be seen that Mean Error is an absolute criterion and Percent Bias is a relative criterion.
∑
Equation 37: Mean Error
∑
∑
Equation 38: Percent Bias
A generalized form of a criterion for comparison of residuals with other models is suggested by Hauduc et al.
(2011) as:
∑
∑ ̃
Equation 39: Coefficient of Efficiency Where:
and
̃
An example of one of these criterion is the Nash-Sutcliffe Coefficient of Efficiency where , and the reference model is defined by the mean of the observed values ( ̃ ̅) and is referred to as the “no knowledge”
model. These criteria define the improvement of using a certain model when compared to a simpler one. The measures range from to 1 with negative outcomes indicating that the new model leads to worse predictions than the reference model, zero indicating no improvement in predictions from the reference model to the new model and 1 indicating that the new model describes the observed variable perfectly. A downfall of such criteria is that they do not give an indication of the difference in model complexity when comparing the two models.