Methodology - Improvement of SMAP Sea Surface Salinity in river-dominated oceans using machine

Chapter 2. Improvement of SMAP Sea Surface Salinity in river-dominated oceans using machine

3.3. Methodology

3.3.1 Correction of SMAP SSS

In-situ SSS data were selected considering colocation with SMAP L2B data products. The distances between the satellite and in-situ data were less than ±12.5 km by considering the SMAP pixel size (25 km), and time differences were less than ±6 hours. Multiple in-situ data distributed within the same pixel were averaged. A total of 98,471 in-situ SSS data distributed worldwide between April 2015 and December 2020 were used in this study.

Nine input parameters (SMAP SSS, Tbs of H-pol and V-pol, and Tb ratio between H- and V-pol, HYCOM SSS, OISST, NCEP wind speed, distance from land, and GPM precipitation) were used to correct SMAP SSS (Figure 3.2 and Table 3.1). OISST and Tbs of H-pol and V-pol are directly used for the SSS retrieval algorithm (Fore et al., 2020). Since the sensitivity of both Tbs of H- and V-pol varies depending on the environment (i.e., SST, SSS, and wind speed) (Hollinger, 1971), the Tb ratio between H- and V-pol was also considered as an input variable. Unlike SMAP SSS, which is retrieved in real- time, HYCOM SSS is estimated using an ocean general circulation model that relies on climatology (Cummings and Smedstad, 2014). SMAP SSS and HYCOM SSS can compensate for each other’s shortcomings and have discussed the relationship in detail in Chapter 3.4. Wind speed and precipitation affect sea surface roughness, and these were used to correct SMAP SSS (Rajabi-Kiasari et al., 2020;

Qin et al., 2020). Since the uncertainty of the L-band microwave sensor-retrieved SSS products increases in the coastal region due to RFI and LSC (Qin et al., 2020; Reul et al., 2012; Tang et al., 2017), this study considered distance from land as an input variable.

Figure 3.2. The flow chart for correcting SMAP sea surface salinity (SSS)

Table 3.1. Summary of the input data used in this study for improving SMAP SSS.

Input Data Data Source Variable

Distance from land TRMM TMPA Distance from land

SMAP SSS

SMAP

Sea Surface Salinity

Tb H-pol Brightness temperature of

H-polarization

Tb V-pol Brightness temperature of

V-polarization

Tb H/V The Tb ratio between H- and V-pol

HYCOM SSS HYCOM model Sea Surface Salinity

OISST NOAA OISST Sea Surface Temperature

NCEP wind speed NOAA NCEP GFS model Wind speed

GPM precipitation GPM 3IMERGDL Precipitation

Samples were divided into two groups for validation of each model. Eighty percent of the samples were used as training data for the model to optimize the hyperparameter of each model, while the remaining twenty percent of the samples were used to validate the performance of the model to correct SMAP SSS. Two types of 80:20 holdout validation were conducted (a total of 98,471): 1) datasets were selected based on DOY (78,418 for calibration and 20,053 for validation) and 2) they were randomly selected (78,781 for calibration and 19,690 for validation). Three metrics were used to assess the model results; the RMSE, R², and MB (equations are in the Chapter 2.3.1). SHAP values were used to analyze the relationship between input parameters of the machine learning model (García and Aznarte, 2020;

Lundberg and Lee, 2017). A detailed information about SHAP values is provided in Chapter 2.3.1.

3.3.2 Machine learning approaches

We chose the seven empirical models widely used in SSS applications and remote sensing fields (Figure 3.2). A detailed information about SVR and ANN is provided in Chapter 2.3.2.

K-nearest neighbor (KNN) regression is a well-known data-driven regression technique without defining a predetermined parametric relation between predictors and target (Zhang et al., 2017;

Modaresi et al., 2018). KNN algorithm also has been shown a successful performance for classification and regression problems in the remote sensing field (Meng et al., 2007; Alimjan et al., 2018). A main idea of KNN is the similarity based on the distance between pre-trained data in the feature space (Zhang et al., 2017). The performance of the KNN model is affected by the number of neighbors (K) and the amount of predictor weights, which are assigned by kernel functions such as uniform weight, Euclidean distance, and Nadaraya-Watson kernel (Zhang et al., 2017; Modaresi et al., 2018; Zhang et al., 2018).

KNN was performed using the “KNeighborsRegressor” function in Python.

This study used four tree-based machine learning approaches: Random Forest (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Model (LGBM), and Gradient Boosted Regression Trees (GBRT). These are ensemble-based machine learning, which works based on multiple decision trees (Breiman, 2001). Each decision tree is constructed by multiple nodes that repeatedly divide the input data to predict the target value (Jang et al., 2017). RF constructs numerous independent trees using two randomized parameters: a random subset of training sample and input parameters at each split node (Altman and Krzywinski, 2017; He et al., 2021). It is effective for overfitting by averaging the predicted values from each tree. Other methods except RF use boosting technique.

Boosting technique constructs trees by weighting to the part in which error is significant, unlike the bagging technique in which each tree is independent (Park et al., 2021). Although it is vulnerable to overfitting compared to bagging, it can prevent overfitting by optimizing various hyperparameters in each algorithm. Gradient boosting is a gradient descent algorithm in which the gradient of the loss function is reduced by changing the variables in the model to minimize a cost function (Callens et al., 2020; He et al., 2021). The trees in GBRT are gradually added to the previous ensemble trees. The following new trees are trained so that the model is fitted to the error of the previous model using the gradient of residual reduction (Chen et al., 2021). XGBoost is an extension of the GBRT proposed by Chen and Guestrin (2016). It is more robust to overfitting by adding a weighted regularization term that controls the complexity of the model to the loss function (Chen and Guestrin, 2016; Ribeiro and dos Santos Coelho, 2020). Also, learning time is faster than GBRT due to parallel tree learning. LGBT uses a leaf-wise strategy that continuously splits around nodes with a maximum loss instead of a level-wise strategy used in GBRT and XGBoost (Su et al., 2021). While splits leaf nodes with the maximum split gain without balancing trees, the tree’s depth deepens and generates asymmetric trees. It reduces the

time and memory required to deepen the tree. The four tree-based machine learning models were implemented in Python with the “RandomForestRegressor”, “GradientBoostingRegressor”,

“XGBRegressor”, and “lgb” functions, respectively.

Dalam dokumen 이 저작물을 복제 (Halaman 56-59)