Multiple linear regression and regression with time series error models in forecasting PM
10concentrations in Peninsular Malaysia
Kar Yong Ng& Norhashidah Awang
Received: 28 March 2017 / Accepted: 19 December 2017
#Springer International Publishing AG, part of Springer Nature 2018
Abstract Frequent haze occurrences in Malaysia have made the management of PM10(particulate matter with aerodynamic less than 10μm) pollution a critical task.
This requires knowledge on factors associating with PM10 variation and good forecast of PM10concentra- tions. Hence, this paper demonstrates the prediction of 1-day-ahead daily average PM10concentrations based on predictor variables including meteorological param- eters and gaseous pollutants. Three different models were built. They were multiple linear regression (MLR) model with lagged predictor variables (MLR1), MLR model with lagged predictor variables and PM10
concentrations (MLR2) and regression with time series error (RTSE) model. The findings revealed that humid- ity, temperature, wind speed, wind direction, carbon monoxide and ozone were the main factors explaining the PM10variation in Peninsular Malaysia. Comparison among the three models showed that MLR2 model was on a same level with RTSE model in terms of forecast- ing accuracy, while MLR1 model was the worst.
Keywords Multiple linear regression . Regression with time series error . PM10. Forecast
Introduction
PM10has been a notorious air pollutant associated es- pecially with detrimental health impacts. Studies have shown that PM10aggravated respiratory and cardiopul- monary functions as well as increasing the morbidity and mortality rate of related diseases (Hassan et al.
2016). Furthermore, exposure to PM10during pregnan- cy may lead to birth abnormalities (Vinceti et al.2016).
Populations such as children, elders, and those with asthma and cardiovascular problems are particularly susceptible to PM10pollution (Hassan et al.2016).
Owing to its bad impacts, PM10is one of the major air pollutants regulated by Malaysian government and is included in the calculation of Air Pollution Index (API), which is a yardstick of air quality condition in Malaysia.
The Malaysian Ambient Air Quality Guidelines (MAAQG) has set the limit at 150 and 50 μgm−3for 24-h average and annual average PM10, respectively.
Once the daily average PM10 concentration exceeds 150 μgm−3 level, indicating the unhealthy status in API, high-risk populations are advised to stay indoor (DOE 2000). Three interim targets have been set to reduce the PM10levels, with the final stage in 2020 at limit of 100 μgm−3 (DOE 2017). As reported in the Malaysia Environmental Quality Report 2014, PM10
was one of the crucial air pollutants contributing to most of the unhealthy days in Peninsular Malaysia. The sources primarily came from local and transboundary haze. Moreover, industrial and urban areas recorded comparatively higher PM10concentrations than subur- ban and rural areas (DOE2015).
K. Y. Ng (*)
:
N. AwangSchool of Mathematical Sciences, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia
e-mail: [email protected] N. Awang
e-mail: [email protected]
Since PM10 has been proven to be dangerous to health and has deteriorated the air quality, it has to be studied and analyzed for better understanding of the mechanism of PM10 formation. In Malaysia, different statistical methods have been employed to investigate the relationship between PM10and other meteorological and gaseous variables. For instances, Shaharuddin et al.
(2008) incorporated the wavelet transform method with correlation coefficient to analyze the relationship be- tween PM10and meteorological parameters (i.e., tem- perature, rainfall and wind speed) using the data at Petaling Jaya station. The authors found that PM10
associated positively with temperature while associ- ated negatively with rainfall. The wind speed corre- lated positively and negatively with PM10 at time scale of 256 and 128 days, respectively. In a nut- shell, the correlations between PM10 and the three variables were significant only at low frequency with time scale approaching a year. This suggested sea- sonal effect of these variables on the PM10variation, and linkage to transboundary fires as the key source of PM10 from 2000 to 2004.
Liew et al. (2011) used simple multiple linear regression (MLR) to examine the relation between PM10 during summer monsoon and the local and synoptic meteorological parameters as well as local and foreign hotspot numbers at six stations in the Klang Valley area. The results showed that local surface temperature, humidity and wind speed to- gether with synoptic weather patterns and foreign hotspots significantly correlated to PM10 variation.
Ul-Saufie et al. (2013) predicted future daily PM10
concentrations at Nilai station by combining principal component analysis (PCA) with MLR and feed-forward back-propagation neural network (FFBP). The parame- ters used included meteorological and air pollutant var- iables. The findings showed that PCA improved the prediction accuracy for both types of models.
On the other hand, in concern of high concentrations of PM10, Shaadan et al. (2015) applied anomaly detec- tion to study the fluctuation of PM10concentrations and potential factors impacting PM10 anomalies at three stations in Klang Valley. They identified that there were more PM10anomalies during southwest and northeast monsoons and during weekdays rather than weekends.
Furthermore, wind speed was found to be associated positively with PM10concentrations.
Even though there have been numerous researches in the aspect of explaining the relationship between PM10
and exogenous variables, most of them focused on the Klang Valley region and only considered the meteoro- logical parameters in their models. Therefore, this paper intends to forecast the daily average PM10concentra- tions in Peninsular Malaysia based on lagged meteoro- logical and gaseous variables by applying simple MLR.
In addition, separate MLR models are to be constructed with the inclusion of previous days’ PM10concentra- tions as implemented by Ul-Saufie et al. (2013) and Afzali et al. (2014). For comparison, regression with time series error (RTSE) model which is a modified Box-Jenkins method is employed. This approach is selected as it provides room for autocorrelation among residuals or dependent variables (Liu2009) which is a common case when pollutant time series is used. As we concern, there were no air pollution researches using RTSE in Malaysia yet. The paper is organized as fol- lows. The subsequent section describes the data and methods used. Next, the results and discussion are pre- sented, and the conclusion is drawn in the final section.
Materials
Hourly data of PM10, NO2 (nitrogen dioxide), NO (nitrogen monoxide), SO2(sulfur dioxide), CO (car- bon monoxide) and O3(ozone) along with tempera- ture, humidity, wind speed and direction in year 2014 were obtained from Malaysia Department of Environment (DOE). To ensure the continuity of wind direction, two wind parameters, which were wind-x (Wx) and wind-y (Wy) components, were generated by using wind speed, WS and wind direc- tion, θ (Siwek and Osowski2012) where
Wx¼−WSsinθ
Wy¼−WScosθ: ð1Þ
These data were then processed to obtain 24-h daily averages. Only those days with missing data less than 12 h were taken into account. Otherwise, they were considered as missing values. The missing values were imputed by linear interpolation. This method of impu- tation is easy and has been proved to be efficient when the missing values are less than 5% (Junninen et al.
2004; Norazian et al.2008). The stations under study were selected carefully so that the variables at all sta- tions were at least 95% complete. This led to five stations, namely Seberang Jaya (CA09), Petaling Jaya
(CA16), Sungai Petani (CA17), Seremban (CA47) and Batu Muda (CA58) which are located in different envi- ronments, being selected. Stations CA09 and CA17 represent suburban background; CA16 represents indus- trial background, while CA47 and CA58 as the dele- gates of urban background. Out of a year data, the first 351 days and the remaining 14 days were used as modeling and validation datasets, respectively. Table1 summarizes the variables to be used in modeling and their corresponding symbols and units.
These variables were used in MLR and RTSE modeling. Two separate MLR models were con- structed for each station. One utilized only previous day’s gaseous and meteorological variables (MLR1), while the other additionally included previous days of PM10 concentrations (MLR2). Next, the RTSE models were established since the residuals from MLR1 were shown to be not independent (will be explained and demonstrated in later sections). All modeling procedures were performed using R soft- ware, and the significance level adopted is at 5%
level. Finally, the performances of these three models were compared.
Methods
Multiple linear regression
MLR is substantially used to explain the relationship between a dependent variable and several independent predictors. However, it is not necessary to be the causal relation, i.e., the dependent variable may not be caused
by the independent variables. The MLR model is de- fined as
Yi¼β0þβ1X1iþ⋯þβkXkiþεi; i
¼1;2;⋯;n ð2Þ
whereYis the dependent variable,β0,β1,⋯,βkare (k+ 1) parameters,X1,⋯,Xkarek-independent variables,εare the independent, identically and Gaussian-distributed er- rors, andnis the number of observations. The parameters are estimated by the ordinary least square method.
In order to make a 1-day-ahead forecast, the depen- dent variable was set to be today’s PM10, while the predictors were the one previous day of other variables listed in Table 1. The logarithm transformation was performed on PM10 (lnPM10) for all stations as they did not exhibit to be Gaussian distributed. Before modeling, the stepwise (SW) and all possible model (APM) methods were used to select suitable subsets of predictor variables for every station. The SW method selects the variables by dropping and including variables iteratively and the process ceases when the Akaike Information Criterion (AIC) reaches the lowest. On the other hand, the APM technique tries all combination of variables and chooses the subsets yielding the highest adjusted coefficient of multiple determinationR2adj. The larger subset between the two methods was used in modeling. The variance inflation factor (VIF) was used to examine the presence of multicollinearity among predictors. VIF value below 10 suggests that the specific predictor does not linearly correlate to other predictors in the model (Neter et al.1985). The models were built based on R2adj. The most insignificant variable was re- moved iteratively until a large decline inR2adj was ob- served or all variables were significant at 5% signifi- cance level, whichever reached earlier, to obtain a par- simonious model.
Regression with time series error
The RTSE model may be a good alternative to simple MLR when time series variables are used. This is due to the fact that residuals from the MLR model will proba- bly disobey the assumption of independent errors and thus lead to incorrect estimated parameters and standard errors. Different from MLR, the RTSE model assumes the residuals from regression to follow an ARMA (autoregressive moving average) or seasonal ARMA Table 1 Summary of variables included in regression modeling
Variable Symbol Unit
Daily average PM10concentration PM10 μgm−3 Daily average NO2concentration NO2 ppm Daily average NO concentration NO
Daily average SO2concentration SO2
Daily average CO concentration CO Daily average O3concentration O3
Daily average temperature Temp °C
Daily average humidity RH %RH
Daily average wind-xcomponent Wx ms−1 Daily average wind-ycomponent Wy
Daily average wind speed WS
(SARMA) process which allows the errors to be serially correlated. The RTSE has the form of
Yt¼β0þβ1X1tþ…þβkXktþηt; ϕpð ÞΦB P BS ηt ¼φqð ÞΨB Q BS εt; t¼1;2;…;n
ð3Þ whereYis the dependent variable,β0,β1,…,βkare (k+ 1) parameters,X1,…,Xkare k-independent var- iables, η are the errors characterized by ARMA or SARMA process, and n is the number of observa- t i o n s . T h e s e c o n d e q u a t i o n i s a g e n e r a l SARMA(p,q)(P,Q) model with ϕp(B) = (1−ϕ1B−
…−ϕpBp), φq(B) = (1 +φ1B+…+φqBq), ΦP(BS) = (1−ϕSBS−…−ϕPSBPS), and ΨQ(BS) = (1 +φSBS+
…+φQSBQS) whereBis the backward shift operator such that BaYt=Yt−a, S is the seasonal period, ϕ1,
…, ϕp, ϕS,…,ϕPS,φ1,…, φq, φS,…,φQSare the parameters to be estimated, and ε are the Gaussian white noise errors. When P=Q= 0, the equation simplifies to ARMA(p,q) model. The maximum likelihood method is used to simultaneously esti- mate the regression and ARMA parameters.
One important assumption of the RTSE model is the stationarity of data series, which means the mean level, variance and autocorrelation function of the data series are constant over time. The stationarity in mean was tested by using the Augmented Dickey-Fuller (ADF) test. If theYor anyone ofX’s is not stationary, all series are differenced d times before modeling. If seasonal pattern is observed, the series are differencedDtimes at seasonal lagS. This leads to regression at difference level with SARMA(p,q)(P,Q) errors:
Y0t¼β1X01tþ…þβkX0ktþη0t;
ϕpð ÞΦB P BS η0t ¼φqð ÞΨB Q BS εt; t¼1;2;…;n;
ð4Þ o r e q u i v a l e n t l y, r e g r e s s i o n a t l e v e l w i t h SARIMA(p,d,q)(P,D,Q) errors:
Yt¼β0þβ1X1tþ…þβkXktþηt; ϕpð ÞΦB P BS ð1−BÞd1−BSD
ηt¼φqð ÞΨB QBS εt; t¼1;2;…;n:
ð5Þ The same subsets of variables selected as in MLR1 were used for RTSE modeling. Similarly, logarithm
transformation was taken on PM10to achieve stationary and Gaussian series. The optimum model was acquired by minimizing AIC.
Performance evaluation measures
The performance of models at each station was evaluat- ed by six indices, namely the root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), maximum absolute percent- age error (maxAPE), fractional bias (FB) and percent bias (PBIAS). A good model is indicated by low values of RMSE, MAE, MAPE and maxAPE, near zero. FB and PBIAS approaching zero also indicate good agree- ment between observed and predicted values. On the other hand, positive FB and negative PBIAS reflect underestimation, and vice versa. Let Oi be the real observation, Pi be the predicted value, and n be the number of observations, the formulae of all the perfor- mance indices mentioned above are given as follows.
RMSE¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
n ∑n
i¼1
Oi−Pi ð Þ2 s
; ð6Þ
MAE¼1 n ∑n
i¼1jOi−Pij; ð7Þ
MAPE¼1 n ∑n
i¼1jOi−Pi
Oi j 100%; ð8Þ
maxAPE¼max Oi−Pi
Oi
100%; ð9Þ
FB¼2∑ni¼1ðOi−PiÞ
∑ni¼1ðOiþPiÞ; ð10Þ
PBIAS¼∑ni¼1ðPi−OiÞ
∑ni¼1Oi : ð11Þ Moreover, the overall performance of models was assessed by combining the prediction results of all stations.
Results and discussion
Summary statistics of PM10concentrations
Figure1and Table2display the boxplot and descriptive statistics of PM10concentrations in modeling datasets at the five stations, respectively.
Obviously, from Fig.1, it can be seen that the distri- butions of PM10 concentrations at all stations are skewed to the right, indicating that there were several extreme days of PM10 concentrations. From Table 2, station CA16 had the highest number of days whose PM10 concentrations surpassed the standard limit of 150 μgm−3set by MAAQG, whereas the stations in suburban area had relatively lower number of extreme days, as expected. In general, two stations located in the Klang Valley area, which are CA16 and CA58, were more frequent to experience high PM10concentrations (> 100 μgm−3). All stations, except CA17, have quite consistent means, ranging from 49.85 to 55.86μgm−3 with standard deviations from 23.38 to 31.87μgm−3. It is surprising to notice CA17 which is classified as a suburban station to have the highest mean and median values. This may due to the fact that it is situated near the industrial areas. Despite the highest central tendency values at CA17, it shows the smallest standard devia- tion, suggesting this location constantly experienced rather high concentrations. This is further supported by the exceptionally high percentage of days (76.92%) whose PM10concentrations are exceeding 50 μgm−3.
This concentration level is also a guideline limit provid- ed by WHO (WHO2006). Further scrutiny shows that the maximum concentrations for stations in the central region of Peninsular Malaysia (CA16, CA47 and CA58) occurred in March, while the stations in the northern region (CA09 and CA17) had their maxima during July.
This implies that haze happened during northeast and southwest monsoons. This result is in coherent with the findings by Shaadan et al. (2015).
Regression modeling
The modeling process began with checking the correla- tions between dependent variable (lnPM10) and predic- tor variables (1-previous-day meteorological and gas- eous parameters) to verify the linear associations be- tween lnPM10and its predictors besides helping in the decision whether to include these predictors. Table 3 shows Pearson’s correlation coefficients, R between lnPM10 and its predictors for all stations using the modeling datasets.
Basically, all pollutants, except NO, are positively associated with lnPM10, and most of them are close- ly correlated to lnPM10 (R> 0.3). lnPM10 is posi- tively correlated to temperature while negatively correlated to humidity for all stations. This is be- cause high temperature usually promotes the forma- tion of PM10precursors as well as burning activities.
Contrarily, high humidity is normally linked to more rainfall amount which washes out the PM10 concen- trations. Even though NO, wind speed and the wind vectors are less significantly associated with lnPM10, a lot of literatures have pointed out that the nitrogen oxide which includes NO is the main cause of PM10formation (Bhattacharjee et al.1999), and that PM10 could be influenced by wind speed and direction (Shaharuddin et al. 2008; Liew et al.
2011). Hence, all the abovementioned predictors were considered in the variable selection. Only the variables selected by SW and APM methods were finally included in MLR modeling.
After estimating MLR1 models, the residuals were diagnosed whether they complied with Gaussian white noise assumption before further investigation. All models are deemed to be no multicollinearity problem as the VIF values are all below 10. The normal Q-Q plot was used to examine the normality of residuals. On top of that, Fig.2(a)–(e) shows the plot of residuals against Fig. 1 Boxplot of PM10concentrations in modeling datasets for
all stations
fitted values to evaluate the constant variance assump- tion for all stations.
From the normal Q-Q plots (not displayed), the resid- uals lie around the straight line, except some slight devi- ations at the two tails, confirming the normality of resid- uals for all stations. The residuals at all stations also show constant variances as no specific patterns were observed from plots in Fig. 2. However, the Breusch-Godfrey Lagrange multiplier (LM) tests at lag 1 show that the residuals are serially correlated. The test statistics (LM(1)) are significant at 5% level, as shown in Table4, rejecting the null hypothesis of independent residuals.
This violation of assumption may cause incorrect coeffi- cient and standard error estimation, causing wrong infer- ence. Hence, two modifications were done to improve this situation. One was to include previous days’PM10
concentrations into the regression models. Another was to ease the uncorrelated residual assumption and to treat the residuals as a time series, i.e., an ARMA process, which makes the RTSE model. Similar residual diagnos- tics (not shown here) show that the residuals from MLR2
satisfy the white noise assumption. The autocorrelation of residuals in RTSE models was checked by the Ljung-Box (LB) test with six lags. The failure to reject the null hypothesis also confirms that the uncorrelated residual assumption was met. Table 4 presents the estimated coefficients for MLR1, MLR2 and RTSE models for all stations. The asterisk (*) indicates significant parameter at 5% significance level.
As displayed in Table4, the regression parameters of MLR1 explain 39–58% of the PM10 variations for all stations. Higher percentages of PM10 variations can be explained by 1 previous day of meteorological and pol- lutant variables in suburban areas than other areas. On the other hand, the lowerR2adjfor station CA16 suggests that there might be other factors governing the PM10forma- tion. For all, the inclusion of previous days’PM10con- centrations greatly enhances the explained proportion of PM10variations to over half. By MLR2 and RTSE, the number of significant variables also reduced.
For MLR1, it can be seen that humidity, CO and O3
were important factors linking to PM10concentrations Table 2 Descriptive statistics of PM10concentrations in modeling datasets for all stations
CA09 CA16 CA17 CA47 CA58
Mean,μgm−3 49.85 55.57 64.71 49.88 55.86
Std. Dev.,μgm−3 23.38 31.87 19.33 25.77 31.85
Min,μgm−3 15.39 18.59 32.48 20.96 15.88
Max,μgm−3 174.67 302.92 155.75 273.29 300.63
% of days exceed 50μgm−3 40.17 42.74 76.92 32.76 44.44
% of days exceed 100μgm−3 3.99 7.12 4.84 4.27 7.69
% of days exceed 150μgm−3 0.57 2.56 0.28 1.14 1.42
Table 3 Pearson’s correlation coefficients between lnPM10at timetand 1-previous-day predictors for all stations
Predictor CA09 CA16 CA17 CA47 CA58
NO2,t-1 0.39 0.21 0.38 0.10* 0.32
NOt-1 −0.15 −0.25 −0.09* −0.22 −0.22
SO2,t-1 0.34 0.30 0.35 0.28 0.35
COt-1 0.33 0.15 0.47 0.44 0.36
O3,t-1 0.61 0.37 0.62 0.47 0.47
Tempt-1 0.41 0.46 0.39 0.22 0.42
RHt-1 −0.44 −0.49 −0.48 −0.24 −0.36
Wx,t-1 0.08* −0.18 −0.06* 0.08* −0.04*
Wy,t-1 0.20 0.05* −0.15 0.13 0.17
WSt-1 0.11 0.21 0.01* 0.04* 0.12
*Non-significant at 5% level (remaining are significant)
for all stations. Furthermore, NO2, temperature, wind speed and direction (despite the lack of linear associa- tion) were also significantly related to PM10concentra- tions. In overall, PM10concentrations were positively correlated to the air pollutants and temperature, while negatively correlated to humidity and wind speed. The high wind speed that helps to disperse the PM10con- centrations in a location explains the inverse relationship between them. The common source of CO, NO2and O3
is associated with fuel combustion. Therefore, it can be deduced that the PM10concentrations at these stations were most likely related to vehicular emissions and meteorological factors.
In MLR2, humidity remains as a major factor for all stations besides the previous days’PM10concentrations.
Other important factors include wind speed and direc- tion, NO and O3. Two previous days of PM10concen- trations at station CA47 were included in the regression in order to achieve independence residuals. Yet, the negative coefficient of lnPM10,t-2is against expectation.
This may be because it was highly correlated to the 1- previous-day PM10 concentration but this would not affect the precision in forecasting (Neter et al.1985).
In RTSE model, since there were at least one of the variable series included in regression which were not stationary for all stations except CA09, all the series
including dependent and predictor variables were differenced once before modeling, which is displayed byd= 1 in the RTSE models. All RTSE models mani- fest significant autocorrelation among PM10concentra- tions since the ARMA parameters are significant at 5%
level. Similarly, humidity maintains as a key factor. In addition, temperature, CO and O3 were considered substantially correlated to the PM10variations, but the wind speed was no longer significant. This is in agreement with the smallRbetween lnPM10and wind speed. However, wind speed may still be an important factor correlating with PM10concentrations nonlinearly.
All in all, it can be inferred that humidity, temperature, wind speed and direction were the main meteorological factors dominating PM10concentrations, and CO and O3 signify that PM10 concentrations in Peninsular Malaysia were due to vehicular emissions. These results are in line with the findings of Liew et al.
(2011) and Ul-Saufie et al. (2013) which demonstrated that meteorological parameters explained large propor- tion of PM10 variations. Moreover, our findings also concur with the study of Afroz et al. (2003) indicating that transportation emissions contributed for more than 70% of air pollution during non-haze periods.
Figure3(a)–(e) shows the comparative graphs of the forecasts by the three methods and actual observations Fig. 2 Plots of residuals from MLR1 versus fitted values for stations (a) CA09, (b) CA16, (c) CA17, (d) CA47 and (e) CA58
Table 4 Estimated coefficients for MLR1, MLR2 and RTSE models
Variables CA09 CA16 CA17 CA47 CA58
MLR1 NO2,t-1 12.636* 14.851* 11.630*
NOt-1 10.249* 7.591 −10.220
SO2,t-1 38.669* 30.462* 27.973
COt-1 0.462* 0.203* 0.429* 0.367* 0.752*
O3,t-1 24.250* 10.305* 12.099* 19.530* 10.150*
Tempt-1 0.034* 0.043 −0.100* 0.048
RHt-1 −0.011* −0.023* −0.012* −0.023* −0.008*
Wx,t-1 −0.017 −0.033*
Wy,t-1 −0.014* 0.038* 0.069*
WSt-1 −0.087* −0.061* −0.060* 0.060
Intercept 2.476* 3.883* 4.637* 8.039* 2.090*
LM(1) 25.759* 68.047* 47.318* 94.813* 31.670*
R2adj 0.529 0.387 0.580 0.430 0.433
MLR2 lnPM10,t-1 0.546* 0.599* 0.551* 0.799* 0.680*
lnPM10,t-2 −0.129*
NO2,t-1 −11.935
NOt-1 9.431* 9.551*
SO2,t-1 25.481*
COt-1 0.105* −0.219
O3,t-1 11.412* 5.676*
Tempt-1 −0.067* 0.034
RHt-1 −0.009* −0.014* −0.007* −0.013* −0.007*
Wx,t-1 −0.010 0.014* 0.021*
Wy,t-1 0.010 0.020* 0.029*
WSt-1 −0.058* −0.035*
Intercept 2.028* 2.853* 2.388* 4.242* 1.094
LM(1) 0.334 1.021 1.090 0.241 0.016
R2adj 0.604 0.560 0.669 0.645 0.536
RTSE NO2,t-1 9.740* 16.888*
SO2,t-1 −48.506*
COt-1 0.184 0.232* 0.133
O3,t-1 10.748* 5.648* 4.529
Tempt-1 0.071* −0.039 0.050*
RHt-1 −0.010* −0.009 −0.008* −0.007* −0.012*
Wx,t-1 −0.023* 0.015
Wy,t-1 −0.011 0.019*
WSt-1 −0.086*
Intercept 4.211*
AR(1) 0.602* 0.507* −0.179*
AR(2) −2.990*
AR(3) −0.230*
AR(4) −0.157*
SAR(1) 0.158*
from 18 to 31 December 2014 for all stations. Table5 presents the individual and collective results of the model performance for all stations.
Roughly viewing from Fig.3, all models are able to capture the overall patterns of the series. MLR1 pro- duced forecasts with the largest variation, while RTSE Table 4 (continued)
Variables CA09 CA16 CA17 CA47 CA58
MA(1) −0.508* −0.924* −0.439*
MA(2) −0.178* −0.295*
d – 1 1 1 1
LB Q-stat 4.219 1.152 0.752 3.516 5.122
*Significant at 5% level
20 30 40 50 60 70
18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12
PM10concentration, µgm-3
Date (a)
15 25 35 45 55 65
18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12
PM10concentration, µgm-3
Date (b)
45 50 55 60 65 70 75
18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12
PM10concentration, µgm-3
Date (c)
25 30 35 40 45 50
18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12
PM10concentration, µgm-3
Date (d)
20 30 40 50 60 70 80
18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12
PM10concentration, µgm-3
Date (e)
actual MLR1
MLR2 RTSE
Fig. 3 Actual observations versus forecasts by the three models from 18 to 31 December 2014 for stations (a) CA09, (b) CA16, (c) CA17, (d) CA47 and (e) CA58
yielded forecasts with the smallest fluctuation. The smaller variation of RTSE forecasts may attribute to the stationarity characteristic of the ARMA model which inclines to produce forecasts oscillating around the long-term mean value. Thus, MLR1 works generally well in scenarios where PM10concentrations fluctuate a lot, particularly for stations CA16 and CA58. Nonethe- less, it sometimes tends to overestimate the peaks and underestimate the troughs. RTSE, however, can perform quite well in the cases where PM10concentrations vary with small changes, such as in CA09 and CA17. Addi- tional advantage of RTSE is that it is capable to seize the dynamic patterns more in time as compared to both the MLR models. MLR2 also generated quite accurate fore- casts for all stations, but obviously lagging 1 day behind due to the inclusion of 1-previous-day PM10concentra- tion. The overall evaluation statistics suggest that MLR2 and RTSE are comparable in their predictive ability. The lowest values of RMSE, MAPE and maxAPE for the MLR2 model show that it is the best model. On the other side, RTSE is chosen as the best model in preference to the two MLR models by the MAE, FB and PBIAS. In addition, the positive FB and negative PBIAS indicate that all three models underestimated the actual concen- trations in general. The RTSE model does not show any
superiority over simple MLR model. Therefore, in our case, a simple MLR model to include the lagged PM10is sufficient to account for the autocorrelation among PM10.
Conclusion
This study used three different dynamic regression models to forecast the 1-day-ahead daily average PM10
concentrations at five selected stations in Peninsular Malaysia. When the regression equations included only the lagged predictor variables, the residuals showed autocorrelation among themselves. After including the lagged PM10 concentrations or adjusting the errors in model to be correlated, the problem was not only solved, but also increased the forecast precision. The outcomes disclosed that humidity, temperature, wind speed, wind direction, CO and O3were essential parameters varying PM10concentrations in Peninsular Malaysia after recti- fying for the autocorrelated residuals. Moreover, the gaseous pollutants suggested traffic emissions as the main source of PM10 pollution. These results were in accordance with previous studies of Liew et al. (2011), Ul-Saufie et al. (2013) and Afroz et al. (2003). In Table 5 Comparative results of forecast performance by the three methods at all stations and their overall performance
CA09 CA16 CA17 CA47 CA58 Overall
RMSE MLR1 11.426 7.825 7.910 4.991 10.006 8.711
MLR2 9.066 7.345 5.298 4.078 10.944 7.754
RTSE 7.942 8.860 5.335 4.465 10.695 7.802
MAE MLR1 9.013 6.472 7.134 3.382 8.681 6.936
MLR2 6.842 5.802 4.289 3.225 9.503 5.932
RTSE 5.642 7.835 4.035 4.098 8.045 5.931
MAPE MLR1 19.640 19.342 13.098 9.377 21.782 16.648
MLR2 14.750 17.315 7.670 9.509 24.742 14.797
RTSE 11.845 25.973 7.192 11.747 19.172 15.186
maxAPE MLR1 48.003 47.898 26.418 33.493 39.071 38.977
MLR2 39.251 44.464 16.596 22.820 46.283 33.883
RTSE 34.458 72.767 18.076 23.693 28.883 35.575
FB MLR1 0.144 −0.034 −0.045 0.091 0.088 0.041
MLR2 0.1043 −0.010 0.046 −0.049 −0.005 0.022
RTSE 0.1041 −0.151 −0.005 −0.041 0.127 0.006
PBIAS MLR1 −0.135 0.034 0.046 −0.087 −0.084 −0.040
MLR2 −0.0991 0.010 −0.045 0.051 0.005 −0.021
RTSE −0.0989 0.163 0.005 0.042 −0.120 −0.006
The bold values indicate the lowest error/bias among the MLR1, MLR2 and RTSE models
addition, the prediction results showed that MLR2 is equipollent with the RTSE, while MLR1 gave the least accurate forecasts.
Acknowledgements We would like to acknowledge Malaysia Department of Environment (DOE) for providing data to accom- plish this study.
Funding information We would like to extend our gratitude to Universiti Sains Malaysia for Fellowship Scheme and Malaysia Ministry of Higher Education for financial assistance.
References
Afroz, R., Hassan, M. N., Ibrahim, N. A., et al. (2003). Review of air pollution and health impacts in Malaysia.Environmental Research, 92(2), 71–77.https://doi.org/10.1016/S0013-9351 (02)00059-2.
Afzali, A., Rashid, M., Sabariah, B., Ramli, M., et al. (2014). PM10
pollution: its prediction and meteorological influence in PasirGudang, Johor.Proceedings of the IOP Conference Series: Earth and Environmental Sciences, 18.
Bhattacharjee, H., Drescher, M., Good, T., Hartley, Z., Leza, J-D., Lin, B., Moss, J., Massey, R., Nishino, T., Ryder, S., Sachs, N., Tozan, Y., Taylor, C., & Wu, D. (1999). Section 1:
Introduction. In Particulate matter in New Jersey.
Princeton: Princeton University.
DOE Department of Environment. (2000). A guide to Air Pollution Index (API) in Malaysia.Ministry of Science, Technology and the Environment.
DOE Department of the Environment (2015). Malaysia Environmental Quality Report 2014. Ministry of Natural Resources and Environment Malaysia.
DOE Department of the Environment (2017). Air quality stan- dards: new Malaysia ambient air quality standard.
https://www.doe.gov.my/portalv1/en/info-umum/english-air- quality-trend/108. Assessed 24 Feb 2017.
Hassan, N. A., Hashim, Z., Hashim, J. H., et al. (2016). Impact of climate change on air quality and public health in urban areas.
Asia Pacific Journal of Public Health, 28(2S), 38S–48S.
https://doi.org/10.1177/1010539515592951.
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M., et al. (2004). Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 38(18), 2895–2907.https://doi.org/10.1016/j.
atmosenv.2004.02.026.
Liew, J., Latif, M. T., Tangang, F. T., et al. (2011). Factors influencing the variations of PM10 aerosol dust in Klang Valley, Malaysia during the summer. Atmospheric Environment, 45, 4370–4378.
Liu, P.-W. G. (2009). Simulation of the daily average PM10con- centrations at Ta-Liao with Box-Jenkins time series models and multivariate analysis.Atmospheric Environment, 43(13), 2104–2113.https://doi.org/10.1016/j.atmosenv.2009.01.055.
Neter, J., Wasserman, W., & Kutner, M. H. (1985).Applied linear statistical models: regression, analysis of variance, and exper- imental designs(2nd ed.). Homewood, Illinois: R. D. Irwin.
Norazian, M. N., Ahmad Shukri, Y., Nor Azam, R., Al Bakri, A.
M. M., et al. (2008). Estimation of missing values in air pollution data using single imputation techniques.
ScienceAsia, 34(3), 341–345. https://doi.org/10.2306 /scienceasia1513-1874.2008.34.341.
Shaadan, N., Jemain, A. A., Latif, M. T., Mohd Deni, S., et al.
(2015). Anomaly detection and assessment of PM10 func- tional data at several locations in the Klang Valley, Malaysia.
Atmospheric Pollution Research, 6(2), 365–375.https://doi.
org/10.5094/APR.2015.040.
Shaharuddin, M., Zaharim, A., Mohd Nor, M. J., Karim, O. A., Sopian, K., et al. (2008). Application of wavelet transform on airborne suspended particulate matter and meteorological temporal variations.WSEAS Transactions on Environment and Development, 4(2), 89–98.
Siwek, K., & Osowski, S. (2012). Improving the accuracy of prediction of PM10pollution by the wavelet transformation and an ensemble of neural predictors. Engineering Applications of Artificial Intelligence, 25(6), 1246–1258.
https://doi.org/10.1016/j.engappai.2011.10.013.
Ul-Saufie, A. Z., Yahaya, A. S., Ramli, N. A., Rosaida, N., Abdul Hamid, H., et al. (2013). Future daily PM10concentrations prediction by combining regression models and feedforward backpropagation models with principle component analysis (PCA).Atmospheric Environment, 77, 621–630.https://doi.
org/10.1016/j.atmosenv.2013.05.017.
Vinceti, M., Malagoli, C., Malavolti, M., Cherubini, A., Maffeis, G., Rodolfi, R., Heck, J. E., Astolfi, G., Calzolari, E., Nicolini, F., et al. (2016). Does maternal exposure to benzene and PM10during pregnancy increase the risk of congenital anomalies? A population-based case-control study.Science of the Total Environment, 541, 444–450. https://doi.
org/10.1016/j.scitotenv.2015.09.051.
WHO World Health Organization. (2006).WHO Air Quality guide- lines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide: Global update 2005, Summary of risk assessment.