Experiences with Applied DSM: Protocol, Availability, Quality and Capacity Building
10.5 Protocols or Methods for Developing Predictive Rules for DSM
120 R.A. MacMillan that ancillary information describing regional scale variation in environmental con-ditions such as bedrock and surficial geology, vegetation, climate, and physiography can usually be extracted from available secondary source thematic maps or can, if necessary, be interpreted manually using manual visual analysis of available im-agery and DEM derivatives. However, at finer resolutions, improved predictive abil-ities will require more reliable and spatially precise information on parent material texture and depth, perhaps obtained from analysis of airborne radiometric data (see Chapters 14 and 15), on local variation in climate, perhaps obtained from analysis of thermal or multi-temporal imagery and on other subsurface conditions, such as soil depth, salinity or moisture content, that may be detected and mapped using proximal field sensing tools (see Chapters 2 and 13 for examples). Ancillary data sources can be vital inputs for predictive mapping and they can often be approxi-mated using manual interpretation, if necessary or obtained directly using additional sensing technologies, if available.
Of the two main types of data required to support DSM, it is therefore the author’s opinion (supported by comments in Sections 1.2 and 20.1) that the most limiting is that which permits description and elaboration of rules that describe the spatial arrangement of soils in the landscape and the conditions or criteria that control this distribution. Field observations of soil-landscape relationships or well devel-oped tacit knowledge of these relationships are essential to support construction, application and review of classification rules for predictive maps. In areas of sparse spatial data, the most important requirement may well be to find ways to collect and assimilate information on soil-landscape patterns so that this can be related to available input data layers in digital format to create predictive rules. (examples of approaches to building knowledge of soil-landscape relationships are presented in Chapters 9, 20 and 25)
10.5 Protocols or Methods for Developing
10 Experiences with Applied DSM 121
Fig. 10.3 Schematic diagram illustrating the most widely-used approaches for developing and applying rules for classifying soil entities using DSM methods
in several ways. The choice of what input variables and what sample locations to make available for input into unsupervised classification procedures can greatly influence the type and nature of classes that result. User identification of the to-tal number of classes to predict also exerts an influence on the final classification as does selection of the clustering or classification algorithm implemented by the procedures. Finally, users must exercise local knowledge or judgment when as-signing descriptions or attributes to each output class produced by a supervised classification.
The principal advantages of an un-supervised classification approach are that it is systematic and unbiased and it generates classes that exhibit a maximum amount of difference with respect to the input variables used in the classification. The principal disadvantage is that, since it is data driven, it is almost impossible to produce classes that closely match those defined for a locally derived heuristic classification system.
Unsupervised classifications are also not generally capable of differentiating classes that exhibit only subtle, but often significant, differences with respect to one or more input variables or site conditions.
Unsupervised approaches are among the least commonly used protocols for de-veloping and applying rules for producing soil class maps for operational use. How-ever, the approach has been demonstrated to be capable of producing useful and meaningful classified maps by Burrough et al., (2000, 2001); Irwin et al., (1997) and others (see Chapter 26). If one does not have strong pre-existing knowledge of the main classes (of soil) that occur in an area and of the conditions or criteria that control the spatial distribution of these classes, then unsupervised approaches can help define optimum classes and map their spatial extent.
122 R.A. MacMillan
10.5.2 Supervised Classification Approaches
A number of different forms of supervised classification have been used to develop and apply rules to produce soil class maps. All of these approaches can be consid-ered to represent a form of data mining. All data mining approaches extract rules for consistently recognizing class entities by analyzing training data from representative sites or example areas to detect relationships between each desired output class and a set of predictor variables or classes from a suite of user-selected input maps. In this approach, the map producer is called upon to exercise local expert judgment and opinion in selecting and classifying a sufficient number of reference sites or example locations for each and every class that may exist in an area. The map maker is considered able to recognize an instance of a class when it is encountered in the field, or observed on a display, but is not considered able to completely and systematically identify the environmental site conditions or rules that control the spatial distribution of any given class. In order to use this approach, a map maker must be able to obtain or generate a large number of reference sites for each class of interest. These reference sites should ideally encompass the full range of envi-ronmental conditions under which each class of interest is known to, or can, occur.
There must be a sufficient number of reference sites for each class and the reference sites must be selected in as valid a manner as possible so that they do not provide a biased representation of the conditions under which a class occurs.
The principal disadvantage of most supervised classification approaches is that they require assembly of large data sets of spatially-located reference or training data. This assembly can be time consuming and expensive if it requires field sam-pling. If it involves on-screen selection of training locations, classes are not verified by field observations. Classified training sites can lead to misleading or false rules if the spatial size (support) of the reference locations is incompatible with the reso-lution and spatial accuracy of the main input data layers (e.g. mainly the DEM grid mesh). The principal advantage of all supervised approaches is that they provide a formal, systematic framework to identify which values of which input variables or classes are most strongly associated with (predictive of) each desired output class.
These rules can uncover local tacit knowledge about where in the landscape certain soils are most likely to occur and why and can codify this knowledge systematically, formally and quantitatively (for examples see Chapters 2, 19 and 32).
Decision trees have been shown to be capable of extracting classification rules by analyzing patterns in input data values relative to classified reference areas (Bui et al., 1999; Bui and Moran, 2001; Lagacherie and Holmes, 1997; Moran and Bui, 2002; Scull et al., 2003, 2005; Zhou et al., 2005; Zambon et al., 2006; also Section 2.3). Decision trees work by splitting data sets into more homogeneous subsets. Splitting rules attempt to minimize entropy or variance within reference sites included in any node produced by a split. Different kinds of splitting rules can be used (Zambon et al., 2006) but the objective is always to identify splits that produce more homogeneous groupings for each node of the tree. Advantages of decision trees include the fact that they require no assumptions about the data, they can analyze both classified and continuous data, they can deal with non-linearity in input data and they are quite easy to interpret. Decision trees can be used to predict
10 Experiences with Applied DSM 123 the spatial patterns of either classed entities or continuous variables. These predic-tions can have attached to them an estimate of the likelihood of occurrence of the predicted class (or individual property value) at each node based on consideration of the proportion of each node that is occupied by instances of the class or value being predicted. Decision trees leave behind a set of splitting rules for each node that can be easily interpreted to understand the criteria used to define each node and the conditions that are likely to occur within each defined class. Decision trees may not produce desired results if the selected predictor variables do not exhibit a strong spatial relationship with the classes or values to be predicted (as for example, in trying to use local measures of slope gradient and curvature to predict regional patterns typical of soil associations or complexes – see Moran and Bui, 2002 and Bui and Moran, 2003 for a discussion of this).
Bayesian analysis of evidence (BME) has been used to extract rules for recog-nizing soil classes by Bui et al. (1999); Cook et al. (1996); Corner et al. (1996) and Zhou et al. (2005). Bayesian analysis provides two useful sets of information that relate classes on input maps to predicted output classes. It provides a system-atic method for quantifying the relative utility, or predictive strength, of each input layer relative to all other layers available to predict output classes. It also provides quantitative values for the probability of occurrence of any given evidence class given each of K possible output classes. Additionally, final predictions of expected output classes are constrained by consideration of a priori estimates of the relative proportions of each of K possible output classes, such that the final extent of each predicted output class matches the proportional extent provided by the estimate of prior probabilities. A limitation of BME is that it analyzes the frequency of oc-currence of classes of input data relative to desired output classes. Therefore input layers of continuous variables must first be generalized into classes with decisions on the number of classes and the class boundaries having a potentially large effect on the subsequent results. Some advantages of BME are that it can produce esti-mates for the relative likelihood of occurrence of every defined output class at every location and it can identify and quantify which layers of input data are most useful for predicting output classes and which classes on each layer are associated with the highest likelihood of occurrence of any given output class of interest. BME is a powerful data mining tool that can uncover and quantify relationships between input data layers and output classes to be predicted (similar conclusion in Section 2.3.3).
Supervised approaches based on application of fuzzy logic have been described by Odeh et al. (1992); De Gruijter and McBratney (1988) and by several imple-mentations of the SOLIM approach of Zhu (1994) (e.g. see Shi et al. (2004) (see Chapter 20).
10.5.3 Heuristic (Expert Knowledge) Classification Approaches
Heuristic approaches can be used where the user has identified all classes that are to be predicted and also has a well developed set of criteria that describe the conditions under which each potential output class may occur. The expert knowledge about what classes exist and what criteria control their spatial distribution may arise from
124 R.A. MacMillan extensive local field experience and analysis of local field observations to create locally appropriate classification rules (for examples see Qi and Zhu, (2006) and Chapters 9, 19, 20 and 31). Alternately, classes and classification criteria may be defined exclusively on the basis of theoretical knowledge and theoretical considera-tions about what parts of the landscape are likely to exhibit different environmental conditions and why.
Most implementations of the SOLIM approach of Zhu (1994) and Zhu et al.
(2001) are based on capturing and applying local expert heuristic knowledge as fuzzy rules. SOLIM compares the values for predictor variables at each unclassi-fied location to the values for those variables for a very limited number (1 or 2) of user-defined “instances” that define the central concepts of each class of soil to be mapped (Chapter 20). Each predictor variable is compared to the value of the variable for each defined instance and a degree of similarity is computed based on a similarity function. If the value of the predictor variable at the unclassified location is the same as for an instance, then the similarity of the two sites with respect to that variable is 1. Any difference in value can lead to calculation of a lower degree of similarity between the unclassified location and the instance with respect to that variable. In SOLIM, each unclassified site is compared to each instance with respect to each of N user-specified predictor variables. In SOLIM, the smallest or lowest value for fuzzy similarity from among consideration of all N predictor variables is used to establish the overall fuzzy similarity of an unclassified site to a particular reference site. The fuzzy similarity of a given unclassified location is computed for each of M available instances (usually less than 3) of a particular output class K and the highest value of fuzzy similarity is chosen via a max function to represent the similarity of the unclassified site to a particular output class K. A final hard classification is achieved by identifying which of the K output classes has the high-est value for fuzzy similarity to any given instance and assigning this classification to that unclassified location. Published descriptions of these fuzzy methods do not make it clear whether the procedures can be made hierarchical so that every possible output class does not have to be predicted for every possible location in the data set. However, Zhu (personal communication, 2006) has indicated that different sets of rules for different groupings of instances are commonly developed and applied within different major land areas and within different types and scales of landforms.
MacMillan et al., (2007) describe another example of a fuzzy heuristic approach that identifies and maps classes whose existence and defining criteria have been recognized based on extensive local field experience and analysis of field obser-vations. In this subjective classification, the criteria and conditions that control the spatial distribution of the desired classes have been described in a classification field guide and the principal requirement is to translate this existing set of classifi-cation rules into a corresponding set of formal quantitative machine rules that can be applied to digital inputs to predict the required output classes. Implementation of this approach is not strikingly different from supervised classification approaches described above. Fuzzy semantic rules are developed, applied, evaluated and revised in an iterative manner until such time as the spatial distribution of predicted output classes corresponds closely to an expert’s expected distribution of those classes.
This is conceptually similar to asking the same expert to select a large number of
10 Experiences with Applied DSM 125 training sites that are deemed to be representative of each class of interest except here, instead of using the training sites to develop rules, we use successive refine-ments of heuristic rules to classify entire areas that are then treated as a single large set of training data and are reviewed to see if the resulting patterns correspond with expert expectations.
Shary et al. (2002) and Sharaya and Shary (2004) describe examples of a com-prehensive system of classification of surface curvatures based entirely on objective theoretical considerations of expected relationships between curvature classes and anticipated environmental conditions. This approach demonstrates that it is possible to impose a set of theoretical classification rules even without any local, empirical knowledge to guide definition of classes of interest. The resulting maps are antici-pated to differentiate portions of the landscape that can be expected to exhibit signif-icant differences in soil processes and in patterns of development of soil properties and soil classes. So, if local expert knowledge of actual patterns of soil distribution is weak or absent, it may still be possible to produce useful maps based on application of theoretical considerations only.