Declaration 2 Publications and manuscripts
5.2 Materials and methods
76 Where d(λi) is the first-order derivative of reflectance at a wavelength i, midpoint between wavelength j and j+1, Rλ(j) is the reflectance at the j waveband, Rλ(j+1) is the reflectance at the j+1 waveband, and ∆λ is the difference in wavelengths between j and j+1.
Hyperion images were then geometrically registered to a Landsat TM image (7 May 2001), which was already georeferenced (Universal Transverse Mercator, South 36 zone). A nearest neighbour algorithm was performed to resample Hyperion images into their initial spatial resolution (30 m). A root mean square error (RMSE) of less than a pixel was considered to be an indicator of good geometric correction (Ferencz et al., 2004). Sugarcane canopy reflectance spectra from single pixels were extracted for analysis using the collected ground truth points.
5.2.2 Field data collection and chemical analysis
Twelve fields of large-scale growing sector, and six fields of small-scale growing sector, all having 6–7 month old sugarcane variety N19, were visited for leaf sample collection within 10 days of image acquisition. Variety N19 was selected as it is the most common variety in the study area, representing about 39% of the total crop (South African Sugar Association: SASA, 2007, unpublished industry database). The third fully expanded leaf from the top was sampled for chemical analysis (20 leaves per sample). Two to three points in each field were sampled to make a total of 27 sample points in the large-scale growing sector and 15 in the small-scale growing sector (total of 42 sample points).
However, due to cloud cover on both images only 28 spectra in total at 28 sampling points were extracted from cloud-free pixels and used for the analysis (n = 28).
The midribs of the collected leaf samples were removed from the leaf blades and discarded according to South African Sugarcane Research Institute (SASRI) recommended procedures (SASRI, 2003). The leaf blades were oven-dried at 75 °C for 24 hours, then ground using a leaf grinder, and oven-dried again at 75 °C for 24 hours to ensure that water content was about zero. Leaf powder, 0.11−0.25 gram per a sample, was then taken and put in a tin foil to be placed into the sample carousel of TruSpec N®
77 instrument (Leco, 2006) which uses a thermal combustion technique to measure leaf N concentration (%).
5.2.3 Data analyses
Spectral regions between 1356 and 1466 nm and 1810 and 2012 nm which are associated with strong water absorption features and show high levels of noise (Curran, 1994; Smith et al., 2003; Thenkabail et al., 2004; Abdel-Rahman et al., 2008a) were excluded from the analysis. Hence, only 163 out of 196 calibrated wavebands were used in this study.
5.2.3.1 Random forest ensemble
A machine learning algorithm known as “random forest” was examined to predict sugarcane leaf nitrogen concentration using at-canopy reflectance data. Random forest uses recursive partitioning to divide the data into many homogenous subsets called regression trees (ntree) and then averages the results of all trees. Each tree is independently grown to maximum size based on a bootstrap sample from the training data set (approximately 70%) without any pruning. In each tree, random forest uses randomness in the regression process by selecting a random subset of variables (mtry) to determine the split at each node (Breiman, 2001). In each tree, the ensemble predicts the data that are not in the tree (the out-of-bag: OOB data) and by calculating the difference in the mean square errors between the OOB data and data used to grow the regression trees. The random forest algorithm gives an error of prediction called OOB error of estimate for each variable (Breiman, 2001; Maindonald and Braun, 2006; Prasad et al., 2006; Palmer et al., 2007). This error produces a measure of the importance of the variables by comparing how much OOB error of estimate increases when a variable is permuted, whilst all others are left unchanged (Archer and Kimes, 2008). Simply stated,
“variable importance is evaluated based on how much worse the prediction would be if the data for that variable were permuted randomly” (Prasad et al., 2006). Moreover, random forest algorithm performs well when a large number of input variables are analysed to build a model using a few number of samples (Breiman, 2001).
78 The randomforest library (Liaw and Wiener, 2002) developed in the R package for statistical analysis (R Development Core Team, 2008) was employed to implement the random forest algorithm. The two parameters, mtry and ntree, were optimised based on the root mean square error of prediction (RMSEP). The ntree values were tested at 500, 1000, 1500, and 2000 based on Prasad et al. (2006), while the mtry was evaluated at all the possible 163 wavebands.
The wavebands were ranked according to their importance in predicting sugarcane leaf nitrogen content as determined by the random forest algorithm. The 10 most important wavelengths were used to generate narrow NDVI-based vegetation indices including all possible two-band combinations (10 x 10). Random forest algorithm was re-calibrated in terms of ntree and mtry and used to predict the nitrogen concentration. The vegetation indices were ranked according to their importance, and a forward selection function (Kohavi and John, 1997; Guyon and Elisseeff, 2003) was implemented to determine the least number of vegetation indices that can predict nitrogen content with greater accuracy. Random forests were iteratively fitted; at each iteration new forests were developed after including the vegetation indices in the model one-by-one (it starts with the most important ones). The nested subset of vegetation indices with the lowest RMSEP was chosen as optimal vegetation indices for leaf nitrogen estimation. RMSEP was used instead of the OOB error of estimate, because the latter is biased and could not be used to evaluate the overall error of the selection procedure (Díaz-Uriarte and de Andrés, 2006).
5.2.3.2 Stepwise multiple linear regression
Stepwise multiple linear regression was also performed to test the utility of the generated vegetation indices in predicting sugarcane leaf nitrogen concentration in a linear relationship. To avoid the overfitting problem, only the five most important vegetation indices in estimating sugarcane leaf N concentration that were ranked by random forest were used. Stepwise regression has been widely employed to relate remotely-sensed data
79 to vegetation variables (Kumar et al., 2003; Bortolot and Wynne, 2003; Zhao et al., 2003;
Mutanga et al., 2004; Jain et al., 2007).
5.2.3.3 Validation
A leave-one-out cross validation method (Kohavi, 1995) was applied to assess the performance of the predictive models developed. This was achieved by dividing the data into k samples (k = total number of samples used for the analysis) and then the samples were removed one-by-one. The model was calibrated k times using all k samples, except for the omitted one, and used to predict sugarcane leaf nitrogen concentration on this excluded sample. One-to-one relationships between measured and predicted nitrogen values were established and coefficients of determination (R2) as well as RMSEP values were calculated.