More often than not, the practitioner will have to compile data from several sources and assemble a single thematic dataset (or sets) that cover the geographic area of the analysis. The checklist in Table 7.1 should be a guide to evaluating each of the several
datasets that have to be compiled into a thematic dataset for a Marxan analysis. In addition, the key consideration when compiling a dataset from several sources is consistency, e.g., in regards to coverage of the study area, collection protocol, units and digital format. The practitioner should strive to ensure a level of spatial, temporal and representational consistency such that the data adequately reflect reality and do not introduce unintended biases into the analysis. Finally, be certain to carefully document each step during data preparation (in metadata this is referred to as “lineage”), as this documentation is critical for repeatability and transparency in the planning process.
Spatial consistency relates to ensuring that the entire study area is sufficiently represented by the data that is being compiled.
• A spatial boundary must be delineated for the study area. Strive for adequate, proportional and representative data. This should be corrected for observation effort in order to provide an unbiased picture. (IMPORTANT NOTE: If some datasets only cover part of a study area, their usage will require setting up within the analysis sub‐
areas and associated targets / penalties for the features in those sub‐areas).
• It is important to distinguish between presence/absence data and presence only data (these are data that usually consist of opportunistic presence records, and which lack
“confirmed absence”). A feature is considered absent in a particular area because it was sampled for and not found, and not because no sampling occurred there. These are crucial distinctions. One should emphasise here that any of the data above are virtually a function of search effort. Ideally, such data should be corrected for equal search effort in space and time.
• At times implicit assumptions are made about the distribution of a feature or phenomena. For instance, a spatial distribution may be obtained by interpolating sample data across a geographic area (e.g., depth sounding point data interpolated into a bathymetric surface). Not all data are appropriate to interpolate (e.g., some sediment grab sample point data of the seafloor interpolated regionally into a substrate surface), so one should be aware of the assumptions used when creating such a dataset.
• For ocean data, depth strata (vertical resolution) can play a major role. Although satellite imagery is widely available for the sea surface, marine applications often require the inclusion of deeper water columns and the seafloor. Keep in mind that data quality often degrades by depth, reflecting its greater inaccessibility.
Temporal consistency in data relates to the time period over which the data were collected, particularly with features that are known to be dynamic and that may have seasonality associated with them.
• A feature may change with seasons and it is important to identify which season(s) best reflect or represent the distribution of the feature that is appropriate for the analysis where possible. It is suggested here to let the data drive these temporal
stratifications, rather than using a classification scheme that normalises the data into annual integers.
• Compiling data without regard to temporal variation may misrepresent the distribution of the features of interest, and can allow for key seasonal areas to be overlooked.
• Other considerations with regard to temporal variability of data are dealt with in Section 7.7 ‐ Data management and maintenance).
Representational consistency relates to how data are reported and/or how a feature or phenomena of interest is represented geographically. Within your data the same feature may be classified in various ways and be represented spatially in the form of a point, line, area (polygon) or a pixel depending on the scale it was observed.
• Data being compiled from various sources that will make up a comprehensive dataset covering the project area can be reported into a common classification system where appropriate (e.g., classifying the seafloor based on primary factors such as bathymetry, geomorphology, and substrate). Relying on a known, standard classification system offers structure to the data being developed, can bring out patterns not seen in individual data layers, and may be more robust during peer review.
• However, we caution the use of pre‐conceived classification schemes that do not necessarily represent observable conditions. Often it is advisable to use the underlying data to naturally classify the information. (NOTE: The number of classes and the method of aggregation are important decisions in terms of the amount of information going into a decision support tool and the how defensible the ecological characterisation will be. Whether using natural breaks or more standardised classifications to compile the information it is important to clearly document your decisions as they should be based on the objectives of the project).
• Data transformation or normalisation may be required to make data comparable or consistent in the way they are represented. However, such data processing should be well documented, as it can create/overcome skews and change of units.
Over large planning/analysis areas it is often difficult to attain a high level of consistency with regards to the above elements, simply because there is a very uneven sampling across the area. In such cases the planning area will need to be sub‐divided to account for data biases. Often smaller sub‐areas are constructed based on political or ecological divisions within the larger study area (eco‐regions or sections, watersheds or watershed councils, international jurisdictions) within which some level of data consistency can be obtained. Care is required, however, as that such spatial stratification sometimes can suffer from a lack of understanding actual differences between sub‐areas (e.g., physical parameters such as surface current patterns) and may therefore misrepresent biological dispersal patterns.