Minso Shin

The estimation of nitrogen dioxide (NO 2 ) and ozone (O 3 ) concentrations at ground level using machine learning based on Real Time Learning (RTL). Nitrogen dioxide (NO2) and ozone (O3) are the main components of gaseous air pollutants that have harmful effects on human health. In this study, the estimation model was developed for ground-level O3 and NO2 concentrations, which are spatially continuous over land and ocean.

The ground-level estimation was developed using an RTL-based machine learning technique with various satellite data and numerical model data as input variables. Because NO2 and O3 have relatively short lifetimes, the real-time learning model is effective in estimating accurate ground concentrations.

Introduction

In the case of O3, since there is a limitation on the estimation of the soil concentration from the satellite-based total column concentration (Zoogman et al., 2014), not many studies focusing on satellite data have been done. The CTM-based proportional method was first applied in the particulate matter estimation study and was calculated by applying it to NO2 in the same way (Y. Zhang et al., 2018). In addition, soil concentration estimation by statistically-based empirical models such as GTWR (Qin et al., 2017) and Bayesian maximum entropy (Jiang & Christakos, 2018) have been used for soil NO2 estimation.

Study area and data

Study area

Data

With TROPOMI data, GOCI aerosol products including aerosol optical depth (AOD), fine mode fraction (FMF), Ångström exponent (AE) and single scattering albedo (SSA) were used. In addition, two types of Modulated Resolution Imaging Spectroradiometer (MODIS) products (i.e., 16-day NDVI (Normalized Difference Vegetation Index) and land cover product) were used. Using MODIS data for land cover of land provided in 17 classes, the four land cover ratio variables (barren, crop, vegetation and urban land cover ratio) were calculated.

After reclassification as binary for each class, the ratio within a 3 km radius was calculated and used as an input variable. In the case of vegetation classes, 10 classes, including forests, shrublands, savannas, and grasslands, were grouped and used as one class. Wind speed (WS) and wind direction (Wsin, Wcos) were calculated using U-wind and V-wind variables.

A total of 22 model-based meteorological variables were used as input variables, including 18 RDAPS variables strongly associated with air pollutants and the additional calculated cumulative maximum wind speed for 1, 3, 5, and 7 days. In addition to satellite-based and numerical model-based data, population density, road density, day of the year (DOY) and time of the day (HOD) were used as other auxiliary variables. The population density for each year calculated by linear interpolation of 2015 and 2020 data among Gridded Population of the World (GPW) v4 data was used.

DOY and HOD were used by applying a sine function to convert it to a value between -1 and 1 to reflect the continuity of time.

Table 3 shows meteorological data based on the United Model-Regional Data Assessment and Precision System (UM-RDAPS) numerical model required for estimating ground-level NO 2 and O 3

Methodology

Machine learning (random forest)
Oversampling and subsampling
Offline model
Real-Time Learning (RTL)-based dataset construction
Land and ocean modeling
Model validation

To balance the concentration distribution in the unbalanced data set, over-/undersampling and sample adjustments were performed to adjust the NO2 and O3 concentration distributions. Undersampling was performed in intervals with many samples, and oversampling was performed in intervals with few samples. In the case of NO2, undersampling was used in the low concentration part and oversampling in the high concentration part.

And in the case of O3, only oversampling was applied in the low concentration and high concentration sections. The offline model is a model made using samples in the entire study period, from May 2018 to March 2021. In particular, the ratio of high and low concentrations of samples was additionally considered to compensate for the lack of estimation ability in the case of the sudden influx of samples with high and low concentrations.

In the case of existing studies of air pollutant concentration assessment at ground level, only land area information is presented. Since the land variable has a value of NaN in the ocean, the concentration value cannot be obtained even in the estimated land concentration result. In the case of models based on real-time learning, it is difficult to use the model validation method (model construction 80% and model validation 20%) used in existing offline models, since all sample data are used to train the model.

In the case of ground air pollution concentration, it is difficult to confirm marine areas because in-.

Figure 3. Sample distribution of O 3 and NO 2 concentration

Results and discussion

Comparison of the offline models and the RTL-based models

The distributions were expressed by collecting coastal station samples from the ocean model and inland station samples from the land model. As shown in Figure 8, the tested prediction result for the non-training day was underestimated compared to the model validation result. Although the concentration distribution of the data set for the off-line model was adjusted by over- and under-sampling, the estimation accuracy for the untrained day was poor (R2 = 0.50, RMSE = 14.08 ppb, rRMSE = 36 .0%).

In the case of the prediction result of the soil O3 concentration estimation model, overestimation and underestimation results were shown in the low and high concentration sections, respectively, for which oversampling was applied. Increasing the number of samples in the low- and high-concentration sections with oversampling contributed greatly to the improvement in accuracy in model validation, but the degree of contribution to the improvement in accuracy appears to be low in the prediction for raw dates. The RTL-based models show higher accuracy than the offline model regardless of the cumulative periods (Figure 9).

As a prediction result of the model for estimating the NO2 concentration on the ground, an underestimation occurred in the high-concentration part. The RTL-based models for NO2 show better performance than the offline model, regardless of the cumulative periods (Figure 11). It is difficult to say that this is a good performance compared to the ozone results, but it is a significant result compared to the offline model of NO2.

Both O3 and NO3 showed a significant increase in the accuracy of the RTL-based model results compared to the offline model checking results.

Figure 8. The model validation and prediction results of the O 3 estimation using the offline model.

Station-based 10-fold cross-validation

The offline model output shows a similar scatterplot distribution to the RTL-based model output with 30- and 15-day cumulative periods. Unlike the model validation results, the differences were large in the 10-fold cross-validation results of the RTL-based NO2 model. The RTL-based model results for the coast and the 10-fold cross-validation results were verified together.

The black indicators in the lower right corner of each figure are the accuracies of the combined models, including the coastal samples from the ocean model and inland samples from the land model. Model validation and 10-fold cross-validation results of RTL-based O3 estimation model with inland and coastal samples. Model validation and 10-fold cross-validation results of RTL-based NO2 estimation model with inland and coastal samples.

The coastal samples in the ocean model 10-fold cross-validation result have a relatively high rRMSE value because the concentration distribution range is narrower than the inland samples from the land model. In the case of O3, compared to the coastal samples from the ocean model and inland samples from the land model, the distribution of samples was concentrated on the 1:1 line. In the RTL-based model results, the accuracy difference between model validation results between the coastal and inland samples was similar.

On the other hand, in the 10-fold cross-validation results, the accuracy of coastal station samples was slightly lower than that of interior station samples.

Table 5. The 10-fold cross-validation results of offline and RTL-based model for estimating O 3 and NO 2 concentration

Spatial and temporal distribution

Although the validation was performed using a 1 km coastal station, it is difficult to say that these station data are directly representative of the ocean observation. This is because the influence of the cities near the coast is not small, and they are actually inland stations. In the O3 estimation model, the difference between the ocean and land models was not as large as the model validation result.

In particular, the spatial distribution of high concentration values on the east coast of Japan in spring was shown. In the seasonal mean concentration distribution map for O3, the spatial pattern was not prominent, but the seasonal characteristics of the high concentration value in spring and summer were clearly shown. In the case of NO2, seasonal and spatial patterns were evident on land and there was no significant change in the concentration value at sea.

Through this it is possible to monitor the movement pattern of the concentration of air pollutants. The hourly concentration map can be produced, but it is difficult to see the spatial pattern of each hour due to the existence of clouds. In the case of the GOCI data, it contributed to improving the accuracy of the model.

But in the spatial distribution analysis, due to GOCI, the number of empty samples was very large, which made the spatial analysis difficult.

Figure 20. Annual map of O 3 estimation using RTL-based model (3 days)

Conclusion

Derivation of nitrogen dioxide concentrations at ground level with fine spatial resolution applied to the TROPOMI satellite instrument. Development of Western European PM2.5 and NO2 land use regression models using satellite-derived and chemical transport modeling data. Derivation of surface concentrations of nitrogen dioxide (NO2) in Colombia from tropospheric columns of the ozone measuring instrument (OMI).

Satellite NO2 data improve national land use regression models for ambient NO2 in a small densely populated country. Assessing the magnitude and recent trends in satellite-derived near-ground nitrogen dioxide over North America. Evaluation of modeling of NO2 concentrations driven by satellite-derived and bottom-up emission inventories using in situ measurements over China.

Spatial variations in SO2 and NO2 emissions caused by heating over the Beijing-Tianjin-Hebei region constrained by an adaptive nudging method with OMI data. Estimation of ground-level NO2 concentrations over central-eastern China using a satellite-based geographically and temporally weighted regression model. Satellite-based estimates of daily NO2 exposure in China using hybrid random forest and spatio-temporal Kriging model.