Directory UMM :Data Elmu:jurnal:A:Agricultural Systems:Vol64.Issue1.Apr2000:

(1)

Statistical methods for evaluating a crop

nitrogen simulation model, N_ABLE

J. Yang

a,

_{*, D.J. Greenwood}

b

_{, D.L. Rowell}

a

_,

G.A. Wadsworth

a

_{, I.G. Burns}

b

a_{Department of Soil Science, The University of Reading, Whiteknights, PO Box 233, Reading RG6 6DW, UK} b_{Department of Soil and Environment Sciences, Horticulture Research International, Wellesbourne,}

Warwick, CV35 9EF, UK

Abstract

Modelling nitrogen (N) dynamics is a valuable tool to predict the uptake, mobility and leaching of mineral N in soil pro®les. Many crop/soil N simulation models have been devel-oped in the last 20 years for this purpose. However, methods for the operational evaluation of simulation models are not well established. Standard test statistics such as theF- andt-tests are being questioned because they may give dierent conclusions. Dierence measures are being used as alternatives but they have not been thoroughly investigated. This paper reviews statistical methods that may be helpful in comparing simulations with measured variables. They have been used to analyse comparisons of simulations produced by a nitrogen response model, N_ABLE, with measurements made in a ®eld experiment with lettuce. The techniques include: (1) data transformation, (2) regression analysis and (3) analysis of dierence. They show that accuracy of prediction for dierent variables varied in dierent growth periods. Mean absolute errors (MAE) were within 0.5% and 35 kg haÿ1_{of the measured values of}

per-cent-N and soil mineral N, respectively, over the whole growth period. They also demonstrate systematic errors when crop weights exceeded 2 t haÿ1_{, with 0±0.38 t ha}ÿ1_{under-estimation of}

dry weight and 3±16 kg haÿ1_{under-estimation of N uptake when the harvest date was set at}

values between 42 and 61 days. Use of regression analysis showed that the data sets all vio-lated normality and equal variance assumptions and that choosing the right transformation before analysis was crucially important. When dierence measures were used, it was found that there was strong correlation between the outcome of some, but not others. Overall, the results suggest that the tests can be grouped according to the degree of correlation between individual tests and that only one test needs to be made from each correlated group. We suggest that two sets of four statistics can be used, each statistic explaining a special property of the data. Each set leads to useful conclusions. The ®rst set is mean of error(E), root mean square error, modi®ed forecasting eciency and pairedt-statistic, and the second set isE, MAE, forecasting coecient andF-ratio of lack of ®t over experimental error (FLF(Y=X)). Either

www.elsevier.com/locate/agsy

(2)

Keywords: Statistical evaluation; Test statistics; Dierence measures; Nitrogen simulation model, N_ABLE; Nitrogen uptake; Soil mineral N

1. Introduction

Operational evaluation is the assessment of accuracy and precision of a simulation model and methods vary from inspection and demonstration to analytical test (Knepell and Arangno, 1993). In general, what we most require to know is what proportion of the treatment variation, i.e. that excluding experimental error, can be accounted for by the model. We also need to know about biases the model has over the entire response surface, i.e. in this case yield against N fertiliser rate and time. At a more detailed level it may be helpful for us to know: (1) In what regions of the response surface are the discrepancies between simulated and measured variables greatest and over what regions are they smallest? (2) Does the model generally give the right shape of response surface? Do the simulated values simply dier from the measured values by a ®xed amount or are they a ®xed proportion of the measured values? (3) Does the model give transformed or un-transformed values that are lin-early related to the measured values?

The purpose of this paper is to discuss a range of statistical techniques for answering these questions. Two types of statistical methods have commonly been used in model evaluation: test statistics and dierence measures. Test statistics (both parametric and non-parametric) have been reviewed by Reckhow et al. (1990) and O'Leary and Con-nor (1996), and comprehensive discussions on dierence measures have been given by Willmott et al. (1985), Loague and Green (1991) and Kabat et al. (1995). It seems that there is no robust statistic which can be used for all models because of the complexity of the data structure of a model, e.g. continuous or discrete, error or error-free and one- or two-dimensional. In this paper, we try to select some available statistical methods both from test statistics and from dierence measures for the evaluation of the model, N_ABLE (Greenwood and Draycott, 1989a, b; Greenwood et al. 1996) using four measured data sets from a nitrogen experiment with lettuce.

2. Data transformation

2.1. Experiment and data structure

(3)

mineral N on a total of seven dates on Days 12, 19, 29, 42, 50, 57 and 61 in each plot during the growth period. Four state variables were measured regularly during the experiment in each of six N treatments with lettuce (Yang et al., 1999). The state variables are: weight of dry matter excluding ®brous roots (WDM, t haÿ1_{), soil} mineral N in the 0±30 cm layer (NS, kg haÿ1_{), N uptake excluding ®brous roots (}_NU_, kg haÿ1_{) and percent N in dry matter (}_PN_,%).

The above data were simulated with the model N_ABLE. The model calculates the daily changes in the distributions of mineral-N and water down the soil pro®le together with the increments in plant dry weight and N-uptake. The major inputs to the model are level and timing of fertiliser application, the maximum potential yield of dry matter, and the daily rainfall and potential evaporation together with the initial soil characteristics. Detailed values of the input parameters were given by Yang et al. (1999). It was found that output is sensitive to the harvest time para-meter (Th), i.e. the date on which the simulation is stopped. So the eect of changing Thwas examined in the simulation by settingThto the sample Days 42, 50, 57 or 61, producing four data sets for each of the four variables.

The following notation is used in the data analysis later. A measured variable is represented byYand a simulated variable byX, while measured data are denoted by yij(i=1, 2. . .nrepresents the total measurements on each data set (see later exam-ple) and j=1, 2 . . . rrepresents replicates) and simulated data by xi, because the model cannot simulate random experimental errors for each replicate. Then the dierence between simulation and measurement can be de®ned as:

DY_ÿX 1

dijyijÿxi i1;2. . .n; j1;2. . .r

(4)

yieldWmax=4.3 at the ®nal harvestTh=61 days. Thus there is an overlap in the mea-sured but not the simulated values for the dierent times. The magnitude of variables can be in¯uenced by N levels and by the duration of growth. Graphs of simulatedNU andPNand measured data against time of growth are shown in Figs. 1 and 2. Graphs of measuredWDMandNSwere shown in a previous paper (Yang et al., 1999).

2.2. Test of normality and homoskedasticity

Regression methods are frequently used in model testing and validation processes (Sutherland et al., 1986; Reckhow et al., 1990; O'Leary and Connor, 1996) and provided one of the methods used in this paper (see later for details). In the linear regression model:

YiXi"i 2

the t-test can be used to test H0: a=0, b=1 sinceta aÿ=Sa t (Nÿ2) and

tb=(b_ÿ)/Sbt (N_ÿ2), where a and b are least squares measures of and

(Aigner, 1971). However, for the classical least squares estimation in Eq. (2), the signi®cant test of the regression parameters assumes that the individual error terms eiare normally distributed, are of equal variance, are independent of each other, and

(5)

are not correlated with the independent variable, Xi. In our research, transforma-tions are generally required because the range in response variables is large. For example, the largest values ofWDM, NUand NS are all 10 or 100 times more than the smallest values. Diagnostic plots forWDM, NU,PN andNSproduced evidence that these data sets seriously violated the assumption of equal variance, and slightly violated the assumption of normality (Yang, 1999). Violations of these two assumptions may cause biased estimates of the parameter variances in the regression analysis. Shapiro and Wilk'sW-test is considered to be the best multi-functional test of normality (Shapiro and Wilk, 1965; Royston, 1982), whereas White'sw2_{-test was} thought to be a good statistic to check heteroskedasticity ofeisince it was derived from a null hypothesis that not only are the errors homoskedastic, but they are also independent of the regressors (White, 1980).

2.3. Transformation for normality and homoskedasticity

(6)

standard linear model shown in Eq. (2), there is a need to stabilise error variances. Two frequently used transformation methods were chosen for use in this paper based on the methods discussed by Pindyck (1981) and Bowker (1993). They are:

1. ModelYiXi"iwas transformed by 1/Si, and the model becomes:

Yi=Si1=Si Xi=Si "i=Si 3

whereSiis the sample standard derivation, i.e.Sipyijÿyi2=rÿ1, andyiis the mean value of theith measured data. Eq. (3) shows thatSiwill be measured with experimental errors, meaning that the greater the number of the replicates, the more stable the standard errors. from diagnostic plots that residual erroreiis proportionally related to the indepen-dent variable Xi/2_{. Eqs. (3) and (4) were ®tted by the weighted regression method} using SAS software (SAS Institute, 1989, 1990a, b).

To select the best transformation,Wandw2_{statistics of random errors in Eqs. (3)} and (4) were ®rst calculated for each data set atTh=42, 50, 57 and 61 days indivi-dually, and then mean values ofWand w2_{for dierent} _Th_{values were compared.} The ideal transformation is one where both the minimum w2_{and the maximum}_W occurred simultaneously, but this condition was only roughly met by theNSdata set (Table 1). When this condition did not hold, we identi®ed the minimum value ofw2 as our selected transformation. Following this rule, we concluded that the 1/Xi transformation gave the best correction for equal variance inWDM,NUandPNdata sets with slight loss of normality, and 1/Xi0.5_{gave the best correction for both} nor-mality and equal variance inNSdata sets (Table 1).

2.4. Transformation of normality of data setD(Y_ÿX)

(7)

3. Regression analysis

3.1. Testing of the null hypothesisH0:0; 1

Linear regression was therefore carried out using Eq. (2) with the weighted variable 1/XiforWDM,NUandPNdata sets, and with the weighted variable 1/Xi

0.5_for_NS_data

set, and thet-test used to test H0:=0,=1 in the model. In our data,N=nr, anda andbare weighted least squares measures ofand. Calculatedt-statistics are:

ta ÿ0=sa sapMSE=nrx2s2b 5

tb bÿ1=sb sbpMSE=xiÿx2 6

Table 1

Comparison of dierent transformations of data for testing normality and homoskedasticity using mean values (Th=42, 50, 57 and 61) ofWand White-w2statistics for each state variablea

State

WDM No-transfer 0.9118 (0.00) 21.25 (0.00)

1/Si 0.8838 (0.00) 10.06 (0.03)

PN No-transfer 0.9711 (0.26) 11.49 (0.01)

1/Si 0.9756 (0.40) 6.36 (0.17)

NS No-transfer 0.9571 (0.13) 9.78 (0.01)

(8)

where MSE is the mean square of experimental error, calculated as shown later. Signi®cance of the values of coecientsaandbwere tested by comparingtaandtb with the signi®cance level oft0.05(nr_ÿ2) ort0.01(nr_ÿ2).

3.2. Lack of ®t test

In regression analysis with replicates of dependent variables, the lack of ®t test was carried out by theFratio of MSLFover MSE. Values of both mean squares were obtained as follows. The sums of squares of residuals (SSR) were partitioned into two parts, the sums of squares of the randomised error (SSE) and the sums of squares of the lack of ®t (SSLF). The degrees of freedom of residuals (DFR) were then partitioned into the degree of freedom of randomised error (DFE) and the degree of freedom of lack of ®t (DFLF), where:

SSRSSESSLF DFRDFEDFLF

SSRyijÿaÿbxi2 DFR nrÿ2

SSEyijÿyi2 DFEnrÿ1

SSLFSSRÿSSE DFLF nÿ2

7

MSLF SSLF=DFLF

MSE SSE=DFE 8

and whereyiis the measured mean of theith measurement. The variance ratio is:

FLFMSLF=MSE 9

Statistical inferences from the above analysis of variance of residuals were drawn by comparingFLFwith the signi®cance level ofF0.05(DFLF, DFE) orF0.01(DFLF, DFE).

3.3. Goodness of ®t test

R2_{is a commonly used statistic for testing the goodness of ®t and is de®ned as:}

R2SSU=SSY1ÿ SSR=SSY 10

Following the earlier method, the results of regression with a lack of ®t test on simulated and measured data from the four state variables are listed in Table 2. Graphical displays ofYagainstXand regression lines forNUandPNare shown in Fig. 3 and 4.

4. Analysis of dierence

(9)

al., 1985). In model testing, however, our interest is in the model's accuracy, i.e. the extent to which simulated values approach the measured data. This can be achieved by examining the dierence directly from dij=yij_ÿxi. It is equal to the test of the goodness of ®t of modelY=X, but notY=a+bX. In this circumstance, the regres-sion method with a=0 or b=1 uses the following equation to calculate R2_{, e.g.} R2₌₁

ÿ(residual sum of squares/uncorrected total sum of squares) (SAS Institute, 1990b), to avoid situations whereR2_{>1 or}_R2_{<0 (Aigner, 1971; Pindyck, 1981).}

4.1. Test statistics

Two dierent statistics were used for this propose:

4.1.1. Paired t-statistic

In testing simulated data against experimental data, the pairedt-statistic was used to test the null hypothesis dxiÿyi 0 (Reckhow et al., 1990). In our data, it is calculated as follows:

paired td=s_d 11

wheredis the mean value for the dierence variable shown in Eq. (1), ands_d is the standard error of the mean. It is important that the dierence variable D (Y_ÿX) should be normally distributed and be independent without considering equal var-iance in the paired t-test (Snedecor and Cochran, 1976). D data sets used in the pairedt-test were transformed as discussed previously.

Table 2

a _*,_{** Refer to 0.05 and 0.01 signi®cance levels.} b_W

(10)

4.1.2. F_LFYX statistic

This still holds for the modelY=X(Whitmore, 1991), but the sums of squares of residuals SSR(Y=X) and FLF (Y=X) values are calculated dierently from those in Eqs. (7) and (9):

SS_RYXyijÿxi2 DFRYX nr

SS_LFYXSS_RYX_ÿSSE DFLFYX n

12

where SSEis the same as given in Eq. (7). The variance ratio is

F_LFYXSS_LFYX=DF_LFYX=SSE=DFE 13

Calculated values of pairedtandFLF(Y=X)_{statistics using Eqs. (11) and (13) are} recorded in Table 3, and used to test the dierenceD(Y_ÿX).

(11)

4.2. Dierence measures

Several simple dierence measures have been developed for this purpose by Loa-gue and Green (1991) and Willmott et al. (1985). Kabat et al. (1995) employed ®ve of them in evaluating soil nitrogen simulation models. These statistics have the fol-lowing features in common: (1) they measure dierence, dijyijÿxi, in several ways, including the sum of dierence, (yijÿxi), the sum of absolute dierence,

_jyij_ÿxi_j and the sum of squares of dierence,(yij_ÿxi)2_{; and (2) they de®ne a} statistical function ofdij as a measure of dierence to examine model performance by characterising systematic under- or over-prediction. Bearing in mind these de®-nitions, here we discuss six statistics as our alternative indicators for further eva-luation of the N_ABLE model:

Mean error

E1=nryijÿxi 14

Fig. 4. Comparison of measured percent-N in dry matter against simulated values for each of the sam-pling dates, for simulations with dierent harvest times; dashed lines areY=X, solid lines areY=a+bX;

(12)

Mean absolute error

Values of dierence statistics (test statistics and dierence measures) when assessing dierenceD(YÿX) for each state variablea

42 3.91** 1.84 0.01 0.12 0.19 0.1318 0.9575 0.8447 72

50 4.14** 4.89** ÿ0.08 0.19 0.30 0.1429 0.9366 0.8238 90

57 15.86** 13.58** ÿ0.30 0.33 0.48 0.1917 0.8866 0.7437 108

61 23.31** 14.55** ÿ0.38 0.40 0.56 0.1976 0.8640 0.7088 126

NU

42 4.23** 0.03 ÿ3.12 6.34 9.91 0.1517 0.9397 0.8178 72

50 5.85** 5.85** ÿ7.00 9.41 14.31 0.1661 0.9137 0.7759 90

57 16.84** 11.08** ÿ13.43 14.13 20.84 0.2067 0.8498 0.6984 108

61 21.07** 12.09** ÿ15.64 16.86 23.83 0.2171 0.8197 0.6597 126

PN

42 9.28** 2.65** ÿ0.15 0.48 0.56 0.1003 0.2149 0.0487 72

50 5.88** 2.51** ÿ0.11 0.41 0.50 0.0887 0.5749 0.3233 90

57 4.83** 0.37 0.02 0.34 0.44 0.0787 0.7371 0.5022 108

61 3.77** 1.13 0.05 0.33 0.44 0.0774 0.7596 0.5539 126

NS

42 3.66** 4.66** ÿ23.7 34.4 47.88 0.2476 0.7285 0.5690 63e

50 3.91** 5.07** ÿ23.3 34.5 47.39 0.2542 0.7293 0.5659 80

57 2.78** 3.93** ÿ16.2 30.1 43.99 0.2313 0.7679 0.6205 97

61 4.74** 3.85** ÿ13.1 28.3 40.53 0.2287 0.7877 0.6211 115

a _*,_{** Refer to 0.05 and 0.01 signi®cance levels.} b_For_F

LF

(Y=X)_{calculation, data set have been transformed as in Table 1.}

c_{For paired}_t_{-calculation, data sets}_D_{have been transformed based on selections made as follows;}_D_{data from}

WDM, andNUwere transformed by1/Xi;Ddata fromPNwere transformed by1/Xi0.5; andDdata from

NSwere transformed by1/Xi0.25.

d_{No data transformation was made for calculation of}_E_{, MAE, RMSE,}_C_{, EF and EF} 1. e_{There were missing values in the measured}_N

(13)

Cooecient of error

C1=nr_jyijÿxij=yMAE=y 19

Eis an indicator of whether the model predictions tend to over- or under-estimate measured data (Addiscott and Whitmore, 1987). MAE and RMSE have been used by Willmott et al. (1985) to evaluate their models.E, MAE and RMSE all take on units ofdijyijÿxi and are mainly used as measures of accuracy to compare the output of the same variables, e.g. to compareWDM output of the model with mea-sured data for dierent simulation dates, or to compare the output of the same variables among dierent models.

EF is a relative measure of error used by many authors (Loague and Green, 1991) and preferred by Loague and Freeze (1985). It is de®ned as for R2 _{in regression} analysis, but EF₆R2_{. EF=1 if} _y

ijxi and EF<1 for any realistic simulation. EF<0 if the model predicted values are worse than simply using the measured mean of yij, whereas corresponding values of R2 _{are 0}₄_R2₄_{1. This is because EF is} actually a statistic to test the goodness of ®t of the modelY=X, and thus has the restrictions a=0 and b=1 in the model Y=a+bX. Under these conditions, the partition of SSY into SSU and SSR no longer holds in general: SSY 6SSU+SSR (Aigner, 1971). For this reason, we consider thatR2_{is not a good statistic in testing} the goodness of ®t of the model Y=X. It is considered that quadratic residual functions, such as EF, are sensitive to outliers (Klepper and Rouse, 1991). To over-come this, we have de®ned another function EF1by replacing the sum of squares of dierence with the sum of absolute dierences [Eq. (18)], where_jEF1j4jEFj.

Cis another relative average measure of absolute dierence which is expressed as a proportion of the mean of the measured variable,Y(Klepper and Rouse, 1991).C, EF and EF1are all dimensionless indices (C50, EF and EF141), and they have been used to depict the degree to which dijyijÿxi approaches the null set, i.e.

dij=0. They have also been used to compare accuracy of model outputs for dierent variables. Detailed results of dierence measures calculated using Eqs. (11), (13) and (14)±(19) are also listed in Table 3.

5. General discussion

It can be seen from Table 2 that most of the values ofFLFandaandbare statis-tically signi®cant at the 0.01 level, but the values are small, e.g.FLF< 9. In general, good ®ts forWDMandNUare obtained for early harvests (Days 42 and 50), and for PNat Days 57 and 61. There is only a small dierence between the four lines forNS. The order of goodness of ®t can be ranked byR2_as_WDM_>_NU_>_NS_>_PN_{(Table 2),} indicating that the poorest linear relationship detected is forPNwhenThis set to 42 days (Fig. 4a).

(14)

means of simulated and measured data. This conclusion is valuable because it can-not be deduced from a graphical display. It shows that all the dierences between measured and simulated values in these regions of the response surfaces can be attributed to experimental error. All signi®cant values of the paired t and FLF(Y=X) values indicate that further modi®cation of the model is needed. This focuses atten-tion on regions of WDM and NU for Th=61 in the ®rst instance because of the higher values of bothFLF(Y=X)_{and paired}_t_-values._FLF(Y=X)_{values calculated in Table 3} were based on the model Y=X, indicating the extent to which the dierences

(yij_ÿxi) dier from experimental error. By comparison, the values ofFLFin Table 2 were based on the model Y=a+bX, indicating the extent to which the residual errors(yij_ÿa_ÿbxi) diered from experimental error. For this reason, theFLF(Y=X) values are the real measures of the accuracy of model simulation.

Most of the Evalues are negative (Table 3), indicating that the model generally underestimated the measured data except for PN at Days 57 and 61 and WDM at Day 42. The magnitude of underestimation increased with time for the WDM and NU variables, but the reverse is true for PN and NS variables. MAE and RMSE showed that the largest dierence occurred forNUwhenTh=61.Cvalues indicate that the largest relative error is <25% for all variables, meeting the tolerance limit of 26% given by Klepper and Rouse (1991). The order of goodness of match among the four variables is WDM>NU>NS>PN ranked by values of EF and EF1(Table 3), giving a similar result to that drawn fromR2_{in Table 2.}

It has been found that the dierence statistics in Table 3 are strongly correlated with the outcome of some tests but not others. The tests can be grouped according to the degree of correlation (Table 4). Thus, RMSE, MAE and C were strongly correlated with one another, but not with any of the others. Likewise, pairedtand FLF(Y=X)_{were strongly correlated, but not with the other tests as were EF and EF}

1. A single linkage dendrogram for the correlation matrix (rij>0) in Table 4, for eight dierence statistics is shown in Fig. 5.

Table 4 shows that values of Ehave a strong negative correlation with C, MAE and RMSE. In this study, the negative correlations result from negative E values which are caused by X>Yin Eq. (1). In contrast, the other seven statistics are all

Table 4

Correlation coecient matrixrijfor eight dierence statistics listed in Table 3a

FLF(Y=X) Pairedt E MAE RMSE C EF EF1

(15)

positive irrespective of whether the dierenceD=(Y_ÿX) is negative or positive. The correlations betweenEandC, MAE and RMSE are not universally applicable and for this reason the negative correlation has not been used in the cluster analysis in Fig. 5.

We conclude from Fig. 5 that only one statistic from each group needs be used to test the validation of the model. Group 1 includes RMSE, MAE andC(grouped byrij 50.82) and provides measures of the magnitude of the dierence. Group 2 includes EF and EF1 (grouped byrij50.99), and gives measures the goodness of match or the co-ordinate of the dierence. Group 3 includesFLF(Y=X)_{and paired}_t_{-statistics (grouped} byrij50.89) and gives a measure of the probability of signi®cant dierence. Group 4 includes onlyEand provides a measure of direction of dierence (i.e. under- or over-estimation). The maximum correlation coecient between Groups 1 and 2 is very small (i.e.rij=0.51), and between Groups 3 and 4 is only 0.07 (Table 4).

6. Conclusions

Data from the nitrogen experiment were shown to violate normality and equal variance assumptions and transformation is necessary to ensure standard statistical analyses are carried out correctly. Shapiro and Wilk's W-test and White's w2 _test have been found to be useful methods for testing the normality and hetero-skedasticity of "i. Two types of transformations were employed in this study to ensure both normality and equal variance, and 1/Xiwas required for theWDM,NU andPNdata, and 1=X0:5

(16)

Reckhow et al. (1990) suggested that a predictionb0could be used for testingb0 but not for testing1, whereb0can be decided from a real subject (i.e.b0=1,

j_j>0 is an acceptable error). In addition, if thet-test shows that parametersand

do not deviate fromH0when regression analysis is carried out, R2_{can be used to} check the goodness of ®t. Thus, if both of these tests are carried out, we can then draw a ®nal conclusion, whereas a decision based on any one of them can be misleading.

Dierence measures shown in Eqs. (14)±(19) have advantages over test statistics in that they are easy to interpret and do not need data transformation, but the methods are not fully developed. Many authors believe that there is no robust statistic which can be used to draw conclusions in model evaluations and therefore several methods need to be used together to give a double check. In this study, we found that eight dierence statistics (Table 3) were strongly correlated with each other allowing one to be chosen from each correlated group so as to save time without losing accuracy. We suggest that the same conclusion can be reached by using either RMSE, EF1, pairedtandEor MAE, EF,FLF(Y=X)andEas the dierence measures because each of them is present in Groups 1, 2, 3 and 4 (Fig. 5). In addition, graphical display should be used as a visual aid for comparison between simulation and measurement and for interpretation of the statistics used in the testing and evaluation processes.

Acknowledgements

Financial support through an ORS award from UK is gratefully acknowledged. We also thank Andrew Mead from Horticulture Research International, UK, for his statistical comments on the ®rst draft of this paper.

References

Addiscott, T.M., Whitmore, A.P., 1987. Computer simulation of changes of soil mineral nitrogen and crop nitrogen during autumn, winter and spring. Journal of Agricultural Science, Cambridge 109, 141±157. Aigner, D.J., 1971. Basic Eeconometrics. Prentice-Hall, Englewood Clis, NJ.

Bowker, D.W., 1993. Dynamic models of homogeneous systems. In: Fry, J.C. (Ed.), Biological Data Analysis: A Practical Approach. Oxford University Press, Oxford, pp. 313±343.

Greenwood, D.J., Draycott, A., 1989a. Experimental validation of an N-response model for widely dif-ferent crops. Fertiliser Research 18, 153±174.

Greenwood, D.J., Draycott, A., 1989b. Quantitative relationships for growth and N content of dierent vegetable crops grown with and without ample fertiliser-N on the same soil. Fertiliser Research 18, 175±188. Greenwood, D.J., Rahn, C.R., Draycott, A., Vaidyanathan, L.V., Paterson, C., 1996. Modelling and measurement of the eects of fertilizer-N and crop residue incorporation on N-dynamics in vegetable cropping. Soil Use and Management 12, 13±24.

Kabat, P., Marshall, B., van den Broek, B.J., 1995. Comparison of simulation results and evaluation of parameterization schemes. In: Katab, P., Marshall, B., Van den Broek, B.J., Vos, J., Van Keulen, H. (Eds.), Modelling and Parameterization of the Soil-plant-atmosphere System. A Comparison of Potato Growth Models. Wageningen Pers, Wageningen, pp. 439±501.

(17)

Knepell, P.L., Arangno, D.C., 1993. Simulation Validation: A Con®dence Assessment Methodology. IEEE Computer Society Press, Los Alamitos, CA.

Loague, K.M., Freeze, R.A., 1985. A comparison of rainfall-runo modelling techniques on small upland catchments. Water Resources Research 21, 329±348.

Loague, K., Green, R.E., 1991. Statistical and graphical methods for evaluating solute transport models: overview and application. Journal of Contaminant Hydrology 7, 51±73.

O'Leary, G.J., Connor, D.J., 1996. A simulation model of the wheat crop in response to water and nitrogen supply: II. Model validation. Agricultural Systems 52, 31±55.

Pindyck, R.S., 1981. Econometric Models and Economic Forecasts. McGraw-Hill Inc, USA.

Reckhow, K.H., Clements, J.T., Dodd, R.C., 1990. Statistical evaluation of mechanistic water-quality models. Journal of Environmental Engineering 116, 250±268.

Royston, J.P., 1982. An extension of Shapiro and Wilk'sWTest for Normality to large samples. Applied Statistics 31, 115±124.

SAS Institute Inc, 1989. SAS/STAT User Guide, Version 6, 4th Edition, Volume 2. SAS Institute, Cary, NC.

SAS Institute Inc, 1990aa. SAS Procedures Guide. Version 6, 3rd Edition. Cary, NC, USA.

SAS Institute Inc, 1990bb. SAS/ETS Software, Applications Guide 2, Version 6, First Edition: Econo-metric Modelling, Simulation, and Forecasting. Cary, NC, USA.

Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biome-trika 52, 591±611.

Snedecor, G.W., Cochran, W.G., 1976. Statistical Methods, 6th Edition. The Iowa State University Press, Ames, IA, USA.

Sokal, R.R., Rohif, F.J., 1987. Introduction to Biostatistics, 2nd Edition. W.H. Freeman, New York. Sutherland, R.A., Wright, C.C., Verstraeten, L.M.J, Greenwood, D.J., 1986. The de®ciency of the

`eco-nomic optimum' application for evaluating models which predict crop yield response to nitrogen ferti-lizer. Fertilizer Research 10, 251±262.

White, H., 1980. A heteroskedasticity-consistent coveriance matrix estimator and a direct test for hetero-skedasticity. Econometrica 48, 817±829.

Whitmore, A.P., 1991. A method for assessing the goodness of computer simulation of soil processes. Journal of Soil Science 42, 289±299.

Willmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, K.M., Legates, D.R., O'Donnell, J., Rowe, C.M., 1985. Statistics for the evaluation and comparison of models. Journal of Geophysical Research 90, 8995±9005.

Yang, J., 1999. Testing and evaluation of a nitrogen simulation model, N_ABLE, using independent ®eld data. PhD thesis, The University of Reading, Reading, UK