Optimizing WorldView-2, -3 cloud masking using machine learning approaches

(1)

Remote Sensing of Environment 284 (2023) 113332

Available online 3 November 2022

0034-4257/Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Optimizing WorldView-2, -3 cloud masking using machine learning approaches

J.A. Caraballo-Vega

^a^,^*

, M.L. Carroll

^a

, C.S.R. Neigh

^b

, M. Wooten

^b

, B. Lee

^c

, A. Weis

^b

, M. Aronne

^a

, W.G. Alemu

^b

, Z. Williams

^a

aComputational and Information Science and Technology Office, Code 606, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA

bBiospheric Sciences Laboratory, Code 618, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA

cThe Bren School of Environmental Science & Management, University of California Santa Barbara, Santa Barbara, CA 93117, USA

A R T I C L E I N F O Edited by Dr. Menghua Wang Keywords:

Cloud detection Machine learning Random forest

Convolutional neural network (CNN) VHR (very high resolution) WorldView

Cloud shadow

A B S T R A C T

The detection of clouds is one of the first steps in the pre-processing of remotely sensed data. At coarse spatial resolution (>100 m), clouds are bright and generally distinguishable from other landscape surfaces. At very high-resolution (<3 m), detecting clouds becomes a significant challenge due to the presence of smaller features, with spectral characteristics similar to other land cover types, and thin (partially transparent) cloud forms.

Furthermore, at this resolution, clouds can cover many thousands of pixels, making both the center and boundaries of the clouds prone to pixel contamination and variations in the spectral intensity. Techniques that rely solely on the spectral information of clouds underperform in these situations. In this study, we propose a multi-regional and multi-sensor deep learning approach for the detection of clouds in very high-resolution WorldView satellite imagery. A modified UNet-like convolutional neural network (CNN) was used for the task of semantic segmentation in the regions of Vietnam, Senegal, and Ethiopia strictly using RGB +NIR spectral bands. In addition, we demonstrate the superiority of CNNs cloud predicted mapping accuracy of 81–91%, over traditional methods such as Random Forest algorithms of 57–88%. The best performing UNet model has an overall accuracy of 95% in all regions, while the Random Forest has an overall accuracy of 89%. We conclude with promising future research directions of the proposed methods for a global cloud cover implementation.

1. Introduction

Atmospheric moisture, clouds and aerosols can be dense and opaque, persistently obscuring the land surface of the Earth. This often makes monitoring of, and quantifying changes to, Earth surface features chal- lenging with spaceborne optical imagers. In the Tropics, the presence of clouds affects 35 − 66% of all image data collected (Ju and Roy, 2008;

Platnick et al., 2003). Furthermore, cloud coverage in high-latitude Arctic regions can often reach 70 to 90% (Garrett and Zhao, 2006). As a result, detecting clouds is one of the first steps in processing space- based remotely sensed visible and infrared data (Simpson et al., 2000;

Yang et al., 2022).

National Aeronautical and Space Administration (NASA) Enterprise class satellite missions that are super-spectral (>15 bands) such as the Moderate Resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) have multi-spectral and thermal bands designed for detection and characterization of

clouds (Platnick et al., 2003; Frey et al., 2008; Ackerman et al., 1998;

King et al., 2003). At coarse spatial resolution (>100 m), clouds are bright and generally distinguishable from oceans and vegetated surfaces (Ackerman et al., 1998). Primary features of confusion between clouds and bright surface features in visible wavelengths are snow, white sand beaches, and saltpans. In these cases, the thermal infrared bands can aid in separation of bright but hot surfaces from bright but cold surfaces.

The Landsat series of imagers have their own cloud detection methods (Foga et al., 2017; Vermote et al., 2016; Zhu and Woodcock, 2012).

Similar to MODIS and VIIRS, these methods use a combination of the visible reflectance and the thermal infrared bands to determine cloud and cloud extent.

The proliferation of very –high-resolution (VHR), <3 m ground sampling distance (GSD) remote sensing constellations from companies like Maxar Technologies (WorldView), Planet (Skysat), and China Siwei (GaoJing), among others, have given us the opportunity to investigate smaller features of the landscape with greater precision. There are

* Corresponding author.

E-mail address: [email protected] (J.A. Caraballo-Vega).

Contents lists available at ScienceDirect

Remote Sensing of Environment

journal homepage: www.elsevier.com/locate/rse

https://doi.org/10.1016/j.rse.2022.113332

Received 15 March 2022; Received in revised form 24 October 2022; Accepted 26 October 2022

(2)

2019, 2022). The majority of traditional methods have established rule- based systems in which the spectral characteristics of individual pixels are taken into account when segmenting clouds and not cloud pixels (Braaten et al., 2015; Fisher, 2014; Kwan et al., 2020; Li et al., 2017).

Fisher (2014) used morphological image processing techniques and thresholding to identify clouds. Kwan et al. (2020) detected clouds using thresholds on pre-processed same-date Landsat and WorldView imagery composites, adding additional complexity and data volume to the cloud characterization process. These pixel-only thresholding methods have proven not to be scalable or suitable for the classification of clouds in VHR imagery due to the large spectral variability between cloud and non-cloud regions, including cloud edges and seasonal change effects (Kwan et al., 2020; Li et al., 2017).

More accurate cloud masks are needed to address challenges that clouds impose on the interpretation of VHR data. Methods that adopt a spatial approach can account for both the spectral response of individual pixels as well as those from neighboring pixels. These methods are computationally more expensive but are increasingly possible due to the wide availability of high-end computational systems including commercial cloud assets and specialized acceleration hardware such as GPUs. In recent years, with the rapid development in machine learning and its proven applicability to Earth Science (Carroll et al., 2010; Carroll and Loboda, 2017; Elders et al., 2022; Schnase and Carroll, 2022;

Shirmard et al., 2022b; Thessen, 2016; Thomas et al., 2020), supervised machine learning methods have shown great success on the processing of remote sensing imagery (DeVries et al., 2017; Diaz-Gonzalez et al., 2022; Hoffman-Hall et al., 2019; Shirmard et al., 2022a). In comparison with the threshold methods, machine learning based cloud detection methods have further enhanced classification systems by automating the detection process and improving the detection performance (Xie et al., 2017).

Supervised machine learning algorithms (decision trees, random forests, neural networks, etc.) work by taking training data provided by the operator and teaching the computer to detect similar features in the data. This training dataset includes inputs and correct outputs, which allow the model to learn over time. For cloud segmentation, a variety of machine learning methods have been employed (Li et al., 2017, 2022; L.

Wang et al., 2018a; Xie et al., 2017; Yan et al., 2022; Zhan et al., 2017;

Zhang et al., 2022). Xie et al. (2017) and Chen et al. (2018) developed complex multi-stage deep neural networks with color space transformations for the segmentation of clouds. Zhan et al. (2017) developed a deep learning model with large extents of training data for the classification of both clouds and snow from Gaofen-1 imagery. Wang et al.

(2018a) characterized clouds and snow as an object-based problem using an ensemble of convolutional neural networks. Segal-Rozen- haimer et al. (2020) developed a CNN for the detection of clouds mainly located in water bodies. Furthermore, Segal-Rozenhaimer et al. (2020) studied the transferability and adaptation of CNN models trained solely on WV-2 imagery applied to Sentinnel-2 data. Matsunobu et al. (2021) studied the applicability of transfer learning techniques using CNNs for

spatial context in addition to spectral information to perform image classification (Hughes and Hayes, 2014). With “semantic segmentation”, each pixel is still interrogated and assigned an output class, but consideration is given to the “context” of the neighboring pixels before assigning the class. In practice, this could result in keeping visually contiguous features together (Wang et al., 2018b). Here we investigated the application of both RF and CNN algorithms to identify clouds in multispectral data from WorldView-2 and WorldView-3 (henceforth referred to as WV) satellites. Quantifying Earth surface features in high detail across broad areas allowed us to address persistent uncertainties caused by clouds in VHR observations. By manually labeling cloud and non-cloud pixels, we were able to establish reference data and applied the trained models across multi-regions and different seasonality.

Manual labelling of clouds included a diverse set of dense and thin clouds pixels to further increase the diversity of the training dataset.

Furthermore, we demonstrated how our deep learning-based system could be scaled across several regions and dates for the classification of clouds based on texture and features by relying solely on RGB +NIR bands, and geometrical transformations during training.

2. Study area

We selected three study areas to evaluate and characterize the proposed models for masking clouds. The Vietnam Mekong River Delta was the primary study area due to the tropical climate and monsoon season, which provides extensive cloud cover and changes to the land surface from seasonal flooding (Fig. 1 left). The selected Vietnam Mekong River Delta study area covered a total of 93 different dates from 2009 to 2019.

Each one of these dates represent a different combination of day, month, and year across different geospatial locations. Northern and Central Senegal (Fig. 1 middle) were assessed due to highly diverse land cover types and spectral variance from wet and dry seasons. This study area yields additional complexity to the validation of the model and was analyzed across 132 dates from 2011 to 2019. The third study area covers the Amhara Region located in Northwest Ethiopia between 2009 and 2021 (Fig. 1 right). This area was selected due to its complex topography (> 4000 m relief difference) and extensive cloud cover during its summer monsoon season, and was assessed across 108 dates.

3. Data and computational resources

WorldView provided VHR spaceborne imagery used to examine cloud masking methods and was available from the National Geospatial Agency (NGA), under the NextView license agreement (Neigh et al., 2013). The data was processed on the National Aeronautical and Space Administration’s (NASA) Advanced Data Analytics PlaTform (ADAPT) at the NASA Center for Climate Simulation (NCCS https://www.nccs.nas a.gov). A sub-cluster of ADAPT includes, the NCCS Promoting Research In Science using Machine learning (PRISM) graphical processing units (GPUs), that was leveraged in CNN model training development and

(3)

inference (https://www.nccs.nasa.gov/systems/ADAPT/Prism). All scenes were processed through the Enhanced VHR processing tool (EVHR) (Neigh et al., 2019) to orthorectify, convert to top of atmosphere reflectance and pan-sharpen multi-spectral imagery to 1 m spatial resolution GeoTIFFs. This processing provided a consistent base imagery, as GSD differs from the WV constellation. Multi-spectral bands from WorldView-3, range from 1.24 to 1.38 m (20^◦off nadir) and WorldView- 2, range from 1.85 to 2.1 m (20^◦off nadir) respectively (DigitalGlobe, 2012; DigitalGlobe, 2014). Table 1 shows the bandwidth information of WorldView-3 and WorldView-2 bands used in this work. We focused on 4-band imagery because that is how most commercial VHR data is collected.

Wet and dry seasons across all study areas were selected for the analysis of clouds. The training and validation data acquired from Vietnam, Senegal, and Ethiopia includes multiple dry and wet seasons images. Several common land cover types present in the training data are agricultural and burned fields, flooded fields, large extents of trees and shrub. A significant set of bright structures covered by urban areas is included in the training and labeled as not cloud. The Vietnam training data locations in particular were focused on where land cover change fieldwork was conducted in 2019 within the study domain area (Haynes, 2020). The inclusion of a diversified set of land cover types and dates allows the model to better generalize across regions and surface extents.

4. Methods

4.1. Training data development

A training dataset was created from a representative sample of the WV imagery by selecting 5000 ×5000 pixel subsets from four different images from each of the three study areas (Fig. 1). Subsets were manually annotated by domain experts to identify clouds/no-clouds (including labeling buildings and other bright features as not clouds) (Fig. 2). The annotated images served as a “superset” of training and validation data from which a random selection was made to train the respective model (RF or CNN). Approximately 15% and 20% of the annotated data were used to train the RF and CNN models, respectively.

Several iterations of annotation were performed to refine the edges of the cloud features and to add in bright features that caused false

positives when the trained models were applied to full scenes. In addition, an additive approach was taken in order to test the generalization and transferability potential of both models based on a diversified input of features. A combination of wet and dry season samples was used to further diversify input features for model training. Combinations of Vietnam-only and Vietnam-Senegal training data were studied when generating the training dataset.

It is important to note that the inputs used by RF and CNN as training data differ. RF models require individual pixels whereas CNN requires contiguous “tiles”. Although both RF and CNN draw their training data from the same superset of images, the pixels used by the two techniques may vary slightly (Fig. 2). In addition, we train our models from the ground up for both RF and CNN rather than starting with pretrained models from SpaceNet or similar (You et al., 2021). This is important because even though pretrained models may speed training, they come with liabilities including limitations in the pre-processing, sizing of in- dividuals tiles, and image bit depth (8-bit vs 16-bit) that can degrade the final model. By training from the ground up, using a seed controlled randomized weight initialization method, we retain full control over final model behavior.

4.2. Model development

1) RF classification: A Random Forest model created with the NVIDIA RAPIDS CUML library was used to predict cloud and no-cloud cate- gorical variables across all study areas. RF uses an ensemble of decision trees to determine the best outcome from a random selection of sample data and independent variables (Belgiu and Dr˘agut¸, 2016;

Breiman, 2001). The RF algorithm has been used extensively in remote sensing due to its ease of implementation and accuracy, which is determined by validating a subset of data not used for training, usually one-third of the total input dataset (Belgiu and Dr˘agut¸, 2016; Hoffman-Hall et al., 2019; Huang et al., 2018; Vuolo et al., 2018). For this work, a dataset of 2 million randomly sampled pixels (one million from Vietnam and Senegal respectively) was generated in tabular form to serve as the training dataset. Care was taken to balance the number of pixels of cloud and not-cloud for the training dataset. A total of 20 decision trees were used to train the model after extensive cross validation, with 20% of the data held back for training validation. During development, tests with more trees (50, 100 and 200) did not yield significantly different model results so we settled on a small number of trees for the final model for efficiency. The maximum number of features, which is the number of features to consider when computing the best node split, was set to log base 2 (Breiman, 2001). The model was then trained using the tabular data and saved as a model object. This object was incorpo- rated into a separate script and applied to the full raster dataset for Fig. 1. Pool of training data sites, shown with yellow polygons, located in the Vietnam Mekong River Delta region (left), Senegal (center), and Amhara Region of Northwest Ethiopia (right). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 1

The bandwidth information of the used satellite imagery.

Multi-spectral imagery WorldView-2/WorldView-3

Band 1 (Blue/μ_m) _0.450–0.510

Band 2 (Green/μm) 0.510–0.580

Band 3 (Red/μm) 0.630–0.690

Band 4 (Near-Infrared/μm) 0.770–0.895

(4)

map creation and evaluation of results. The GPU-accelerated average training time was <1 min.

2) CNN classification: A slightly modified UNet-based (Ronneberger et al., 2015) CNN was developed in TensorFlow (an open-source machine learning framework) for the task of semantic segmentation (Fig. 3). The network consists of a contracting path (encoder) and an expansive path (decoder) for both learning and precise localization. Each block in the encoder path consists of two 3 ×3 convolutions, followed by batch normalization and a 2 ×2 max pooling operation with stride 2 for down sampling. Because of these layers, the size of the image gradually reduces, while the depth gradually increases enabling the extraction of contextual information. The decoder section consists of an up sampling of the feature map followed by a 2 × 2 convolution, and a concatenation of encoding and decoding blocks for precise localization. These decoder blocks gradually increase the size of the image, while gradually decreasing its depth up to the final layer that returns the precise output. A set of 16,000,256 ×256 pixel “tiles” with some pixels of overlap were randomly subset from the larger superset of annotated imagery where 80% of the tiles were used for training, and 20% for training validation to monitor model performance. The Adam opti- mizer (D´efossez et al., 2020) with a batch size of 128 was used during multi-GPU training. Several empirical trials showed that changing the initial learning rate of Adam did not significantly improve the

training performance. In addition to the learning rate, the expo- nential decay rates for the first and second momentum were set to 0.001, 0.9, and 0.999 correspondingly. The binary cross entropy loss was used for penalizing the model when training classifications did not match true labels (Zhang and Sabuncu, 2018). Test-time data augmentation techniques were employed using geometric transformations directly embedded in the training loop. Early stopping with a patience of 20 was used to monitor the validation loss of the model and to prevent over fitting. The epochs with the lowest training and validation loss defined the final model. The average training time using 4 NVIDIA V100 GPUs was 3 h after 152 epochs.

The trained model was then applied to the full raster dataset for map creation, as in the RF model.

4.3. Post processing

Post-processing was used to handle obvious errors that persisted after several iterations of model development with both RF and CNN models. These errors were mostly associated with small bright features.

To address this, post-processing procedures including sieve and median filtering were implemented to improve the representation of the clouds via edge-smoothing and small object removal. The first post-processing step was to reduce false detections by performing area thresholding which sieved out any wrongly segmented objects whose size was <50 Fig. 2. Flow chart of the cloud detection scheme for both RF and CNN methods. Top-of-atmosphere (TOA) reflectance is calculated within the EVHR step. Post processing steps are applied individually to each model output. The classification output is a Cloud Optimized GeoTIFF (COG) file with embedded cloud coverage statistics.

Fig. 3. Modified UNet architecture. Additional batch normalization layers are included in each convolutional block.

(5)

pixels and replaced them with the value of their largest neighbor. Then, a median filter with 25 ×25 kernels was applied by iterating through the output and replacing each value with the median value of neighboring pixels. This second post-processing step allows for a more noise-free representation of clouds. The post-processing was applied equally to both CNN and RF results. Cloud shadows were not assessed in this study and are being studied in parallel to this manuscript. The initial pre- processing required for the work supporting this research did not include the analysis of cloud shadows, thus the authors prioritized the development and deployment of accurate and optimized models for the classification of clouds.

4.4. Assessing map and model accuracy

Accuracy is measured in two ways: first, during model training some data are held back and used for cross validation to get the “model derived accuracy”; second, an independent verification/validation is performed using the analyst interpreted pixels described in this para- graph. Verification is the assessment of the final map accuracy and is essential to understand the true performance of the map created from the model as the model derived accuracy only describes how well the model can reproduce the training data. Furthermore, validation quantitatively describes how well the trained model produces an accurate map. For verification of the maps generated by the models we employed the best practices for estimation and accuracy (Olofsson et al., 2014) protocols to produce scientifically rigorous and transparent estimates of accuracy and model performance. Using a pixel-based probability

sampling design, 970 individual pixels from 20 different dates across each study area were randomly selected, totaling 2910 individual pixels, to represent the area of interest and to constitute the validation dataset.

The spatial assessment unit in this work was a WV pixel (1 m ×1 m), with all pixels from the reference data being taken into consideration at the time of sampling to better simulate cloud cover conditions. These pixels were then manually interpreted by an analyst to ensure correct- ness and used to generate a confusion matrix with the accuracy assessment.

5. Results

The trained models were run on full scenes (i.e., the complete scenes from which the training squares were subset) for 333 samples across the three study areas (Fig. 1) to produce cloud masks for those images.

Overall, both models were able to identify clouds in the imagery with qualitatively reasonable outputs (Fig. 4). Closer inspection showed differences in features identified by each model (Figs. 5 – 9).

5.1. Model results and map validation

In the experiments, RF and CNN methods were adopted to assess their effectiveness across all study areas. Both the RF and CNN models were trained iteratively with varying amounts (number of training pixels/tiles, respectively) of training and with different combinations of spectral bands and calculated spectral indices. Furthermore, we applied cross validation techniques to find the optimal parameters for both the

Fig. 4. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) results (left to right). Both models provide qualitatively reasonable outputs across the larger regions. © 2012, 2018 DigitalGlobe, Inc., a Maxar company, NextView License. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(6)

RF and the CNN by considering overall accuracy and map validation output. Regardless of the combination of bands or the method there remained some features that the models themselves could not avoid.

Mostly these were small bright features such as bright roofs and bright land features (saltpans, sandy/rocky river edges, etc.) as shown in Fig. 5.

The features that carried through were essentially consistent so standard methods for cleaning the results; including applying sieve and median filtering. These filtering methods made improvements in reduction of false positives with a minimal loss of actual cloud features (Fig. 5).

A closer evaluation of 100 points that were incorrectly delineated as clouds in maps produced by both methods indicate that these pixels fall into three categories: (1) building structures within urban areas; (2) roads; (3) points that are in fields and sandy soil that is visually brighter than common land features. Fig. 5 shows several of these commission errors present in both RF and CNN. The CNN in particular identified as

‘cloud’ more heterogeneous pixels not completely covered by clouds.

Over half of the pixels misclassified as clouds by the RF and CNN were related to building structures as shown in Fig. 5.

A close-up of three cloud instances and a visual comparison of the RF and CNN results is shown in Fig. 6. The trend of RF omission at cloud boundaries was clearly visible in the presence of both dense and thin clouds. Comparing CNN results to RF, it is evident that CNN detected cloud edges more precisely. Both RF and CNN models produced confidence values higher than 80% for the centroid of the cloud. Because of the lower confidence rating around the cloud edges, cloud boundaries are eventually left out. These segmented cloud confidence values are consistent across the different cloud types for both models. In most

instances, the CNN’s confidence values around cloud edges remained higher than 60% as shown in Fig. 6.

5.2. Regional results and map validation 5.2.1. Vietnam Mekong River Delta

The overall training accuracy of the RF and the CNN in the Vietnam Mekong River Delta region was 99 and 97% respectively (Table 2). This first metric was calculated from the 20% of data kept from training to monitor model performance and out of bag accuracy (OOB). The lowest overall loss calculated from the withheld test data of the CNN was ob- tained after 23 epochs (0.137), thus the result from this epoch was chosen for the final model. Additional metrics such as precision, recall and F1 scores were evaluated to determine the efficacy of the models during training (Table 2). Accuracy is the total number of cloud pixels correctly classified, divided by the total number of clouds in the validation subset. Precision reflects the proportion of positive identification of clouds that was correct, while ‘Recall’ identifies the proportion of actual positives that were identified correctly. Hence, the F1 score combines both precision and recall.

The RF model performed very well in the validation dataset withheld during training (Table 2). This high training performance could be due to the nature of the RF to better predict areas of similar spectral response and its tendency to over fit. Nonetheless, the overall accuracy of the RF when applied to the wider validation dataset, which included additional dates and sites, was 87%, while the CNN overall accuracy was 94%

(Table 3). Fig. 7 shows a qualitative observation of several RF and CNN Fig. 5. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) results (left to right). The red mask outlines cloud delineated pixels before post-processing. Blue mask represents the final mask after post-processing is applied. More than half of the misclassified pixels belong to building structures. © 2012, 2019 DigitalGlobe, Inc., a Maxar company, NextView License. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(7)

classifications. Many pixels from neighboring areas where bright fields and buildings were misclassified as clouds. In many cases, the smaller structures were corrected by the post-processing applied to the final product, with a minimum mapping unit, although larger structures were still part of the commission error group presented by the RF model.

Although the RF slightly outperformed the CNN model in the ‘recall’

metric, Fig. 7 shows how the RF identifies several thin clouds features in building areas not retrieved by the CNN. For the purposes of this study, these thin clouds are not considered because the underlying surface features were still clear and visible. Furthermore, the RF performs notably well in the validation dataset extracted from only four different dates (Table 2) but is outperformed by the CNN in precision and accuracy when classifying the additional twenty-three Vietnam locations (Table 3).

To further compare the performance of each model across the Viet- nam study area, various confusion matrices were produced by classifying the subset of data annotated for validation. In general, confusion matrix compares a class mapped by the model to its actual or reference class, allowing errors of commission and omission as well as classification skill for each class to be easily identifiable. Commission error refers to the number of clouds classified in the incorrect class while omission error is for those omitted from their correct class.

The results of the accuracy assessment show that the overall accuracy of the RF and the CNN in Vietnam was 87.3% and 93.5% (Table 4)

respectively. To find the most problematic classification features, pixels from additional traits were identified. For the RF, ~ 50% of the pixels identified as errors of commission were due to the misclassification of building structures, with a few related to bright and flooded fields. In contrast, only a small proportion of pixels ~2% misclassified by the CNN were part of bright fields that were removed by the sieve filtering applied at the post-processing stage. Although the RF cloud producer’s accuracy is higher than the one of the CNN, this can be explained by the overestimation of the RF in areas where buildings and thin clouds are present. Overall, the CNN outperforms the RF both qualitatively and quantitatively on WV multi-spectral imagery in the Vietnam Mekong River Delta.

5.2.2. Senegal River Valley

The results of the accuracy assessment show that the overall accuracy of the RF and the CNN in Senegal, using a model uniquely trained with Vietnam data, was 87.5% and 93.8% respectively (Table 5). These metrics suggest generalization potential for both the RF and the CNN. At the same time, the cloud class user’s accuracy of both the RF and the CNN (Table 5) decreased significantly, increasing the omission error.

This decrease is a clear indicator of additional fine-tuning requirements to improve these models to better classify clouds in the Senegal study area.

Models were then re-trained by combining the previously generated Fig. 6. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) probability maps (left to right).

The red mask shows the probabilities for each model’s classification of cloud pixels. The higher the probability, the more confident the model is in classifying the pixel as cloud. CNN outperforms RF in identifying cloud edges on both dense (top) and thin (middle) clouds. © 2017 DigitalGlobe, Inc., a Maxar company, NextView License. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(8)

datasets from Vietnam, with manually annotated data from Senegal. The data from Senegal included different dates across wet and dry seasons, along with the inclusion of pixels that covered visually inspected bright fields to improve producer’s accuracy. At the same time, additional thin- cloud pixels were included in the training dataset to improve user’s accuracy. As a result, the accuracy assessment shows that the overall accuracy of the RF and the CNN in Senegal, using a model trained with data from Vietnam and Senegal, was 88.4% and 97.3% respectively (Table 6). Furthermore, both the user’s and producer’s accuracy were

improved (Table 6).

Fig. 8 shows visual representations of selected scenes with the inclusion of Senegal training data. When compared with the Vietnam accuracy assessment, both the RF and CNN improved their ability to better classify building structures, urban areas, and bright fields present in Senegal. Furthermore, the inclusion of additional training data significantly improved CNN’s cloud user’s accuracy. RF cloud user’s accuracy remains at 54.2%, with clear indicators of missing the classification cloud pixels when compared to the CNN (Fig. 8).

Fig. 7. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) results (left to right). RF incorrectly classifies building structures and water sources as cloud pixels (top, delineated with red circles and ovals). CNN outperforms the RF in the correct identification of cloud edges and thin clouds (middle). RF incorrectly classifies bright building structures and canals as clouds (bottom). © 2013, 2016, 2019 DigitalGlobe, Inc., a Maxar company, NextView License. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(9)

5.2.3. Northwest Ethiopia

We transferred the model trained with Vietnam-Senegal data to Northwest Ethiopia to further test its ability to generalize across an additional study area. The RF and CNN models were applied to a diverse set of individual samples that included the presence of arid locations prone to bright soil in fields, and urban areas where bright buildings are generally present. The validation dataset was designed to include different seasonality across individual study samples. The results of the accuracy assessment show that the overall accuracy of the RF and the CNN in Ethiopia, using a model trained with Vietnam and Senegal data, was 89.8% and 93.9% respectively (Table 7). These metrics suggest

generalization potential for both the RF and the CNN. Both the RF and CNN perform remarkably in the proper identification of not cloud pixels, with a user’s accuracy of 99.2% and 98.3% respectively.

Fig. 9 top shows several areas where the RF misses thin cloud pixels directly associated with the low cloud user’s accuracy of 63.2%. Fig. 9 middle shows a problematic area where both the RF and CNN miss several thin clouds neighboring additional thick clouds. The superior overall accuracy of the CNN over the RF is a clear indicator of the superior ability of the CNN to generalize over unseen samples across different study areas.

Fig. 8. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) results (left to right). RF incorrectly classifies cloud pixels as not-cloud pixels (top, delineated with red circle). RF misses thin cloud pixels and classifies them as not-cloud (middle). CNN outperforms the RF in the correct identification of cloud edges and thin clouds (bottom). © 2012, 2017, 2019 DigitalGlobe, Inc., a Maxar company, NextView License.

(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(10)

5.3. Overall validation

Comparing maps from different dates and locations exposed these models to a wider set of spectrally heterogenous pixels to further assess their performance. The combined accuracies of the RF model across regions show 70.0% user’s accuracy (errors of omission) and 83.8%

Fig. 9. True color (red, green, blue) WorldView imagery overlain with Random Forest (RF) and Convolutional Neural Network (CNN) results (left to right). RF incorrectly classifies thin cloud pixels as not-cloud pixels (top). Both RF and CNN miss thin cloud pixels and classifies them as not-cloud (middle, delineated with red circles). CNN outperforms the RF in the correct identification of cloud edges and thin clouds (bottom). © 2011, 2016, 2019 DigitalGlobe, Inc., a Maxar company, NextView License. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 2

RF and CNN validation metrics from the withheld test data.

Model Type Accuracy Precision Recall F1-Score

RF 99% 99% 99% 99%

CNN 97% 94% 95% 95%

Table 3

RF and CNN validation metrics from twenty-three training squares over 10 distinct dates.

Model Type Accuracy Precision Recall F1-Score

RF 87% 72% 88% 79%

CNN 94% 90% 86% 88%

(11)

producer’s accuracy (errors of commission) for the cloud class (Table 8).

In contrast, combined accuracies across regions for the cloud class from the CNN model show 86.1% user’s accuracy and 94.2% producer’s accuracy (Table 8). The overall accuracy of the RF and CNN across all study areas were 88.5% and 94.9% respectively.

The performance of these models was further assessed between seasons. ~ 80% of the overall validation dataset was comprised of dry season imagery, while the other ~20% included wet season data. The wet season portion of this study took place during June–November, when the integrated rainfall is maximum in the Vietnam Mekong River Delta, and June–September for the Senegal and Northern Ethiopia study areas. The RF and CNN had an overall accuracy of 90.5% and 94.5%

respectively over dry season data (Table 9). The results of the accuracy

assessment show that the RF and CNN models had an accuracy of 80.7%

and 90.8% respectively over the wet season data (Table 9). While this comparison has its limitations, given that target locations are not uni- formly represented in the subset of favorable seasonal scenarios, the higher overall accuracy of the CNN serves as a reference to evaluate the robustness of CNNs over the RF in the presence of time induced seasonal changes.

6. Discussion

Cloud detection and screening is a critical component of image pre- processing for land cover classification because clouds can obscure land surface features of interest, or they can be misclassified as a feature of Table 4

Confusion matrix for accuracy assessment of Vietnam RF and CNN cloud detection maps using WorldView-2,-3 multi-spectral data.

Reference (from VHR)

RF CNN

Not Cloud Cloud User’s Acc. Not Cloud Cloud User’s Acc.

Not Cloud 614 92 87.0% 680 26 96.3%

Predicted Cloud 31 233 88.3% 37 227 86.0%

Producer’s Acc. 95.2% 71.7% 87.3% 94.8% 89.7% 93.5%

Table 5

Confusion matrix for accuracy assessment of Senegal RF and CNN cloud detection maps using WorldView-2,-3 multi-spectral data and Vietnam-only training data.

RF CNN

Not Cloud 712 5 99.3% 703 14 98.0%

Predicted Cloud 116 137 54.2% 46 207 81.8%

Producer’s Acc. 86.0% 96.5% 87.5% 94.0% 94.0% 93.8%

Table 6

Confusion matrix for accuracy assessment of Senegal RF and CNN cloud detection maps using WorldView-2,-3 multi-spectral data and Vietnam-Senegal training data.

RF CNN

Not Cloud 711 6 99.2% 714 3 99.6%

Predicted Cloud 107 146 57.7% 23 230 90.9%

Producer’s Acc. 86.9% 57.9% 88.4% 96.9% 98.7% 97.3%

Table 7

Confusion matrix for accuracy assessment of Ethiopia RF and CNN cloud detection maps using WorldView-2,-3 multi-spectral data and Vietnam-Senegal training data.

RF CNN

Not Cloud 711 6 99.2% 705 12 98.3%

Predicted Cloud 93 160 63.2% 47 206 81.4%

Producer’s Acc. 88.4% 96.4% 89.8% 94.0% 94.5% 93.9%

Table 8

Confusion matrix for overall accuracy assessment of RF and CNN cloud detection maps using WorldView-2,-3 multi-spectral data and Vietnam-Senegal training data.

RF CNN

Not Cloud 2036 104 95.1% 2099 41 98.1%

Predicted Cloud 231 539 70.0% 107 663 86.1%

Producer’s Acc. 89.8% 83.8% 88.5% 95.1% 94.2% 94.9%

(12)

tections by spectrally and spatially heterogeneous data, which is enhanced with VHR data. CNN workflows are more complex to initiate but can provide more consistent output due to the inclusion of spatial context kernels in model development.

6.1. Wide applicability of machine learning cloud detection with VHR VNIR

In this work we relied solely on the use of visible through near- infrared (VNIR) spectral bands from WV to reduce the data dependencies and make these methods more widely applicable. However, there is an important cost to reducing data dependencies. The lack of thermal bands to assist in the detection of clouds in VHR data likely reduces accuracy significantly. Although, WorldView-3 has 8 shortwave infrared (SWIR, 3.7 m GSD, 1195–2365 nm) and 12 experimental Clouds, Aerosols, Water Vapor, Ice and Snow (CAVIS) atmospheric correction bands (30 m GSD, 405–2245 nm). Unfortunately, these data are available for <10% of the imagery collected. Future studies could compare RF and CNN model performance of cloud detection with VNIR vs. VSWIR and CAVIS. However, Maxar currently does not plan to include CAVIS and SWIR on their next generation constellation World- View Legion.

The initial training annotations used in this study included a few labeled clouds that were refined over several iterations to reduce the amount of time spent manually annotating large batches of data.

Moreover, the amount of data used for training was considerably small in comparison to the overall area of study. This makes it feasible to produce a satisfactory image classification for the entirety of the study domain in a small fraction of time. This is possible due to the availability of high-performance computing resources and accuracy of the developed models. By studying the transferability of these models across several study areas, one can determine when the manual annotation of data is enough based on the accuracy requirements.

6.2. CNNs improve feature delineation in spectrally heterogeneous extents A diverse validation dataset was produced following the “Best Practices” methodology (Olofsson et al., 2014) to benchmark the classification performance of both methods. The RF and CNN models were implemented and refined over several iterations by modifying individual hyper parameters and features. These methods have their unique traits, but the CNN is proved to be superior in this context due to its ability to combine individual spectral responses with neighboring features. The RF model can accurately identify and segment large extents of clouds in the study domain. However, it can be easily misled by the spatial heterogeneity of clouds at VHR resolution. Small features, edges, and internal cloud variations are often missed. In addition, false positives due to bright features (buildings, sandy soil, sun glint, etc.) are high and most of the time un-recoverable. This translates into the RF being

RF routinely includes. Both models still have spurious false positives, but the overall validation accuracy of the CNN (94.9%) is higher than the RF (88.5%). The CNN was able to better delineate areas that included buildings and bright features by taking into consideration both the spectral response and the spatial context of the features. The texture and neighboring pixels of clouds were clearly considered in the activation maps of the CNN model, particularly observed in areas where the RF delineated individual cloud pixels in wide-open and clear fields. This represented a remarkable improvement over the RF in the correct classification of cloud edges, thin clouds, high spectral response pixels, and overall area of substantially larger clouds. We decreased the number of errors by ~5% in maps generated by the CNN, hence producing actionable maps with more precise and accurate observations across time and space.

The inclusion of additional data from Senegal allowed both models to improve their classification performance regarding the misclassification of bright objects. Sieve and median filtering applied to both the RF and CNN outputs improved results. These post-processing techniques removed undesirable pixels and improved the texture of the classified features. While many misclassified pixels were removed, the RF incorrectly classified large extents of features that made post-processing ineffective at times. A clear indicator of the superior ability of the CNN to better generalize across unseen samples.

6.3. CNN seasonality robustness over 12-year period

The trained CNN model was applied across the three study regions to

>300 scenes. Given the tropical monsoonal climate found at the three study areas, we were able to study the performance of these models under wet and dry season imagery of approximately equal length. The robustness of the CNN model can be further distinguished when performing a seasonality-based analysis. By building the training data for the model from different dates, seasons, and sites the CNN model was able to generalize across the selected study regions through a span of 12 years (2009–2021). While both models achieve competitive overall accuracies, the CNN (dry: 94.5%, wet: 90.8%) clearly outperforms the RF (dry: 90.5%, wet: 80.7%) in the creation of maps across different seasons. Again, the CNN shows more consistent score performance across user’s and producer’s accuracy with 81%–98% compared to 45%–98%

from the RF. Furthermore, the performance of the CNN shows the potential of combined location-time training over seasonal variances.

The overall impact of both the RF and the CNN in the delineation of clouds in VHR imagery is substantial. By leveraging these techniques, we can preprocess scenes without the intervention of annotation specialists in a matter of minutes. Furthermore, the performance presented by the CNN will continue to play a big role in the development of deep learning applications for image classification of VHR imagery. Statistically during model development, the RF model is superior to the CNN. For the purposes of product validation, these statistics are irrelevant and serve only

(13)

for the purposes of monitoring model performance. In practical application and through image validation, the CNN outperforms the RF algorithm in this use case thanks to its superior spatial awareness ability and lower error. Anecdotally, the results enabled further investigation of land cover classification in applications where a distinct set of images is used during training to allow the model to generalize over larger area.

Furthermore, the use of test-time data augmentation, particularly geometric transformations, make the model more robust and notably better in generalizing across unseen samples. This feature has proven to be crucial in the development of single models capable of classifying large extents of data through different locations and seasons. Removing the need for extensive amounts of annotation efforts to map Earth’s surface.

We have shown that:

1) the developed CNN cloud model is robust to seasonality effects over a 12-year period across wet and dry seasons in selected Vietnam, Senegal, and Ethiopia study areas;

2) the preprocessing and training techniques used in this work make the CNN cloud model resilient to differences in WV VNIR sensors including WV-2,-3;

3) the CNN cloud model is robust enough to leverage a diversified location-time training dataset that enhances transfer learning and model generalization when applied in the tropics;

Overall, the CNN is remarkably better in this task and has proven to be superior to the RF in the classification of very-high resolution data.

7. Conclusion

Our study evaluated two methods for identifying clouds in World- View imagery: RF and CNN. The RF and UNet-based CNN models used in this work were assessed over a span of 12 years and >300 scenes across three different study areas. Data from both wet and dry seasons was included to further assess the overall model performance in the presence of varying seasonality. To reduce data dependencies and make these methods more widely applicable, we only used VNIR spectral bands from WV imagery. Furthermore, in comparison to the overall area of study, the amount of data used for training was quite small, allowing to produce a satisfactory image classification for the entire study domain in a short amount of time. At the same time, the availability of data allowed us to train our models from the ground up avoiding unexpected artifacts, dependency, and development freedom which are constraints of pretrained models.

Our results show that the CNN outperformed the RF model both qualitatively and quantitatively covering all areas. The CNN models were able to generalize consistently through both space and time, with a superior overall accuracy of 94.9% based on assessment of maps generated from the models, which was 6.4% higher than the RF. The development and implementation of AI/ML based algorithms for the delineation of clouds is a substantial improvement over methods based on individual spectral indices and density slicing methods. The applicability of these techniques will play a key role in the development of

more than regional models for the identification of clouds and other land cover features in VHR data.

Although the CNN outperforms the RF due to its spatial awareness, the RF can serve as a starting point due to its simplicity and speed. The spatial context is important in the classification of VHR imagery due to spectral heterogeneity within contiguous features and different seasonality. Thus, the CNN is more transferrable across a set of diverse study sites of significantly different land surface properties. In the study domain, the RF model can accurately identify and segment large areas of clouds. The spatial heterogeneity of clouds at VHR resolution, on the other hand, can easily mislead it. Overall, the CNN maintains the fidelity of texture-based features while omitting some smaller features that the RF includes routinely. Finally, we have demonstrated how the combination of diverse spatial and temporal training data can enhance these models to further generalize across large extents of land cover.

CRediT authorship contribution statement

J.A. Caraballo-Vega: Methodology, Software, Writing – original draft, Visualization. M.L. Carroll: Supervision, Conceptualization, Methodology, Writing – original draft, Writing – review & editing. C.S.

R. Neigh: Supervision, Conceptualization, Methodology, Writing – review & editing. M. Wooten: Data curation, Methodology. B. Lee:

Methodology, Validation. A. Weis: Methodology, Software. M. Aronne:

Data curation, Validation. W.G. Alemu: Data curation, Validation. Z.

Williams: Data curation, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The authors are unable or have chosen not to specify which data has been used.

Acknowledgements

This work is the result of a collaborative effort between several members of NASA’s Land Cover Land Use Change program and the Computational & Information Sciences and Technology Office (CISTO) Innovation lab team. DigitalGlobe/Maxar data were provided by NASA’s Commercial Archive Data for NASA investigators (http://cad4nasa.gsfc.

nasa.gov) under the National Geospatial-Intelligence Agency’s Next- View license agreement. We acknowledge all the people that contrib- uted and supported this work. We acknowledge the computational resources provided by CISTO through the NASA Center for Climate Simulation high-performance computing center. This work was supported by NASA’s Land Cover Land Use Change program [grant numbers NNH16ZDA001N-LCLUC, NNH20ZDA001N-LCLUC].

Appendix A. Vietnam Mekong River Delta study area unique dates

Dry Season (December–May) Wet Season (June–November) 2009-12-30

2010-01-07 2010-01-15 2011-04-08 2011-02-01 2012-05-31 2012-01-30 2012-12-31

2015-12-03 2016-03-18 2017-01-20 2017-03-29 2017-04-17 2017-12-06 2017-01-04 2018-03-06

2014-08-10 2015-11-17 2015-11-21 2015-09-06 2016-07-20 2016-11-18 2017-11-28 2018-08-04 (continued on next page)

(14)

2011-04-30 2011-05-22 2011-11-23 2012-02-18 2012-02-29 2012-10-14 2013-04-14 2013-11-06 2013-11-14 2013-12-06 2013-12-09

2016-01-23 2016-02-17 2016-03-04 2016-03-07 2016-10-15 2016-10-21 2016-10-26 2016-10-31 2016-11-09 2016-12-06 2017-01-02

2017-02-17 2017-03-04 2017-03-29 2017-04-11 2017-04-28 2017-05-06 2017-05-30 2018-01-11 2018-01-19 2018-01-22 2018-03-04

2019-02-22 2019-03-11 2019-03-19 2019-05-29 2019-10-12 2019-10-18 2019-11-27 2019-12-02 2019-12-11 2019-12-16

2016-09-18 2016-09-21

Appendix C. Northwest Ethiopia study area unique dates

Dry Season (October–May) Wet Season (June–September) 2010-01-08

2010-01-16 2010-01-30 2010-02-04 2010-02-07 2014-11-11 2014-11-13 2014-12-27 2015-03-24 2015-03-25 2015-03-31

2015-04-08 2015-11-20 2016-04-27 2016-10-06 2017-12-08 2017-12-15 2019-03-05 2019-03-14 2020-11-01 2020-12-14 2021-01-02

2011-09-11 2011-09-30 2012-08-16 2012-09-09 2012-09-12 2013-06-21 2014-06-03 2014-07-11 2016-09-23 2017-08-12 2019-06-26

References

Ackerman, S.A., Strabala, K.I., Menzel, W.P., Frey, R.A., Moeller, C.C., Gumley, L.E., 1998. Discriminating clear sky from clouds with MODIS. J. Geophys. Res. 103, 32141–32157. https://doi.org/10.1029/1998JD200032.

Belgiu, M., Dr˘agut¸, L., 2016. Random forest in remote sensing: a review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 114, 24–31. https://doi.

org/10.1016/j.isprsjprs.2016.01.011.

Belward, A.S., Skøien, J.O., 2015. Who launched what, when and why; trends in global land-cover observation capacity from civilian earth observation satellites. ISPRS J.

Photogramm. Remote Sens. 103, 115–128. https://doi.org/10.1016/j.

isprsjprs.2014.03.009.

Braaten, J.D., Cohen, W.B., Yang, Z., 2015. Automated cloud and cloud shadow identification in landsat MSS imagery for temperate ecosystems. Remote Sens.

Environ. 169, 128–138. https://doi.org/10.1016/j.rse.2015.08.006.

Breiman, L., 2001. Random forests. Machine Learn. 45, 5–32.

Carroll, M., Loboda, T., 2017. Multi-decadal surface water dynamics in North American tundra. Remote Sens. 9, 497. https://doi.org/10.3390/rs9050497.

Carroll, M., Townshend, J., Hansen, M., DiMiceli, C., Sohlberg, R., Wurster, K., 2010.

MODIS vegetative cover conversion and vegetation continuous fields. In:

Ramachandran, B., Justice, C.O., Abrams, M.J. (Eds.), Land Remote Sensing and Global Environmental Change, Remote Sensing and Digital Image Processing.

Springer, New York, New York, NY, pp. 725–745. https://doi.org/10.1007/978-1- 4419-6749-7_32.

Chen, Y., Fan, R., Bilal, M., Yang, X., Wang, J., Li, W., 2018. Multilevel cloud detection for high-resolution remote sensing imagery using multiple convolutional neural networks. IJGI 7, 181. https://doi.org/10.3390/ijgi7050181.

D´efossez, A., Bottou, L., Bach, F., Usunier, N., 2020. A Simple Convergence Proof of Adam and Adagrad. arXiv:2003.02395 [cs, stat]..

DeVries, B., Huang, C., Lang, M., Jones, J., Huang, W., Creed, I., Carroll, M., 2017.

Automated quantification of surface water inundation in wetlands using optical satellite imagery. Remote Sens. 9, 807. https://doi.org/10.3390/rs9080807.

Diaz-Gonzalez, F.A., Vuelvas, J., Correa, C.A., Vallejo, V.E., Patino, D., 2022. Machine learning and remote sensing techniques applied to estimate soil indicators – review.

Ecol. Indic. 135, 108517 https://doi.org/10.1016/j.ecolind.2021.108517.

DigitalGlobe, 2012. WorldView-2 Data Sheet. https://www.spaceimagingme.com/do wnloads/sensors/datasheets/WorldView2-DS-WV2-Web.pdf. (Accessed 8 January 2021).

DigitalGlobe, 2014. WorldView-3 Data Sheet. https://www.spaceimagingme.com/do wnloads/sensors/datasheets/WorldView2-DS-WV2-Web.pdf. (Accessed 8 January 2021).

Elders, A., Carroll, M.L., Neigh, C.S.R., D’Agostino, A.L., Ksoll, C., Wooten, M.R., Brown, M.E., 2022. Estimating crop type and yield of small holder fields in Burkina Faso using multi-day Sentinel-2. Remote Sens. Appl. Soc.Environ. 27, 100820 https://doi.org/10.1016/j.rsase.2022.100820.

Fisher, A., 2014. Cloud and cloud-shadow detection in SPOT5 HRG imagery with automated morphological feature extraction. Remote Sens. 6, 776–800. https://doi.

org/10.3390/rs6010776.

Foga, S., Scaramuzza, P.L., Guo, S., Zhu, Z., Dilley, R.D., Beckmann, T., Schmidt, G.L., Dwyer, J.L., Joseph Hughes, M., Laue, B., 2017. Cloud detection algorithm comparison and validation for operational landsat data products. Remote Sens.

Environ. 194, 379–390. https://doi.org/10.1016/j.rse.2017.03.026.

Frey, R.A., Ackerman, S.A., Liu, Y., Strabala, K.I., Zhang, H., Key, J.R., Wang, X., 2008.

Cloud detection with MODIS. Part I: improvements in the MODIS cloud mask for