10 concentrationsinPeninsularMalaysia MultiplelinearregressionandregressionwithtimeserieserrormodelsinforecastingPM

(1)

Multiple linear regression and regression with time series error models in forecasting PM

₁₀

concentrations in Peninsular Malaysia

Kar Yong Ng& Norhashidah Awang

Received: 28 March 2017 / Accepted: 19 December 2017

#Springer International Publishing AG, part of Springer Nature 2018

Abstract Frequent haze occurrences in Malaysia have made the management of PM10(particulate matter with aerodynamic less than 10μm) pollution a critical task.

This requires knowledge on factors associating with PM10 variation and good forecast of PM10concentra- tions. Hence, this paper demonstrates the prediction of 1-day-ahead daily average PM10concentrations based on predictor variables including meteorological parameters and gaseous pollutants. Three different models were built. They were multiple linear regression (MLR) model with lagged predictor variables (MLR1), MLR model with lagged predictor variables and PM10

concentrations (MLR2) and regression with time series error (RTSE) model. The findings revealed that humidity, temperature, wind speed, wind direction, carbon monoxide and ozone were the main factors explaining the PM10variation in Peninsular Malaysia. Comparison among the three models showed that MLR2 model was on a same level with RTSE model in terms of forecasting accuracy, while MLR1 model was the worst.

Keywords Multiple linear regression . Regression with time series error . PM10. Forecast

Introduction

PM10has been a notorious air pollutant associated es- pecially with detrimental health impacts. Studies have shown that PM10aggravated respiratory and cardiopul- monary functions as well as increasing the morbidity and mortality rate of related diseases (Hassan et al.

2016). Furthermore, exposure to PM10during pregnancy may lead to birth abnormalities (Vinceti et al.2016).

Populations such as children, elders, and those with asthma and cardiovascular problems are particularly susceptible to PM10pollution (Hassan et al.2016).

Owing to its bad impacts, PM10is one of the major air pollutants regulated by Malaysian government and is included in the calculation of Air Pollution Index (API), which is a yardstick of air quality condition in Malaysia.

The Malaysian Ambient Air Quality Guidelines (MAAQG) has set the limit at 150 and 50 μgm⁻³for 24-h average and annual average PM10, respectively.

Once the daily average PM10 concentration exceeds 150 μgm⁻³ level, indicating the unhealthy status in API, high-risk populations are advised to stay indoor (DOE 2000). Three interim targets have been set to reduce the PM10levels, with the final stage in 2020 at limit of 100 μgm⁻³ (DOE 2017). As reported in the Malaysia Environmental Quality Report 2014, PM10

was one of the crucial air pollutants contributing to most of the unhealthy days in Peninsular Malaysia. The sources primarily came from local and transboundary haze. Moreover, industrial and urban areas recorded comparatively higher PM10concentrations than suburban and rural areas (DOE2015).

K. Y. Ng (*)

:

N. Awang

School of Mathematical Sciences, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia

e-mail: [email protected] N. Awang

e-mail: [email protected]

(2)

Since PM10 has been proven to be dangerous to health and has deteriorated the air quality, it has to be studied and analyzed for better understanding of the mechanism of PM10 formation. In Malaysia, different statistical methods have been employed to investigate the relationship between PM10and other meteorological and gaseous variables. For instances, Shaharuddin et al.

(2008) incorporated the wavelet transform method with correlation coefficient to analyze the relationship between PM10and meteorological parameters (i.e., temperature, rainfall and wind speed) using the data at Petaling Jaya station. The authors found that PM10

associated positively with temperature while associated negatively with rainfall. The wind speed correlated positively and negatively with PM10 at time scale of 256 and 128 days, respectively. In a nut- shell, the correlations between PM10 and the three variables were significant only at low frequency with time scale approaching a year. This suggested seasonal effect of these variables on the PM10variation, and linkage to transboundary fires as the key source of PM10 from 2000 to 2004.

Liew et al. (2011) used simple multiple linear regression (MLR) to examine the relation between PM10 during summer monsoon and the local and synoptic meteorological parameters as well as local and foreign hotspot numbers at six stations in the Klang Valley area. The results showed that local surface temperature, humidity and wind speed to- gether with synoptic weather patterns and foreign hotspots significantly correlated to PM10 variation.

Ul-Saufie et al. (2013) predicted future daily PM10

concentrations at Nilai station by combining principal component analysis (PCA) with MLR and feed-forward back-propagation neural network (FFBP). The parameters used included meteorological and air pollutant variables. The findings showed that PCA improved the prediction accuracy for both types of models.

On the other hand, in concern of high concentrations of PM10, Shaadan et al. (2015) applied anomaly detection to study the fluctuation of PM10concentrations and potential factors impacting PM10 anomalies at three stations in Klang Valley. They identified that there were more PM10anomalies during southwest and northeast monsoons and during weekdays rather than weekends.

Furthermore, wind speed was found to be associated positively with PM10concentrations.

Even though there have been numerous researches in the aspect of explaining the relationship between PM10

and exogenous variables, most of them focused on the Klang Valley region and only considered the meteorological parameters in their models. Therefore, this paper intends to forecast the daily average PM10concentra- tions in Peninsular Malaysia based on lagged meteorological and gaseous variables by applying simple MLR.

In addition, separate MLR models are to be constructed with the inclusion of previous days’ PM10concentra- tions as implemented by Ul-Saufie et al. (2013) and Afzali et al. (2014). For comparison, regression with time series error (RTSE) model which is a modified Box-Jenkins method is employed. This approach is selected as it provides room for autocorrelation among residuals or dependent variables (Liu2009) which is a common case when pollutant time series is used. As we concern, there were no air pollution researches using RTSE in Malaysia yet. The paper is organized as follows. The subsequent section describes the data and methods used. Next, the results and discussion are pre- sented, and the conclusion is drawn in the final section.

Materials

Hourly data of PM10, NO2 (nitrogen dioxide), NO (nitrogen monoxide), SO2(sulfur dioxide), CO (carbon monoxide) and O3(ozone) along with temperature, humidity, wind speed and direction in year 2014 were obtained from Malaysia Department of Environment (DOE). To ensure the continuity of wind direction, two wind parameters, which were wind-x (Wx) and wind-y (Wy) components, were generated by using wind speed, WS and wind direction, θ (Siwek and Osowski2012) where

W_x¼−WSsinθ

W_y¼−WScosθ: ð1Þ

These data were then processed to obtain 24-h daily averages. Only those days with missing data less than 12 h were taken into account. Otherwise, they were considered as missing values. The missing values were imputed by linear interpolation. This method of imputation is easy and has been proved to be efficient when the missing values are less than 5% (Junninen et al.

2004; Norazian et al.2008). The stations under study were selected carefully so that the variables at all stations were at least 95% complete. This led to five stations, namely Seberang Jaya (CA09), Petaling Jaya

(3)

(CA16), Sungai Petani (CA17), Seremban (CA47) and Batu Muda (CA58) which are located in different envi- ronments, being selected. Stations CA09 and CA17 represent suburban background; CA16 represents industrial background, while CA47 and CA58 as the dele- gates of urban background. Out of a year data, the first 351 days and the remaining 14 days were used as modeling and validation datasets, respectively. Table1 summarizes the variables to be used in modeling and their corresponding symbols and units.

These variables were used in MLR and RTSE modeling. Two separate MLR models were constructed for each station. One utilized only previous day’s gaseous and meteorological variables (MLR1), while the other additionally included previous days of PM10 concentrations (MLR2). Next, the RTSE models were established since the residuals from MLR1 were shown to be not independent (will be explained and demonstrated in later sections). All modeling procedures were performed using R soft- ware, and the significance level adopted is at 5%

level. Finally, the performances of these three models were compared.

Methods

Multiple linear regression

MLR is substantially used to explain the relationship between a dependent variable and several independent predictors. However, it is not necessary to be the causal relation, i.e., the dependent variable may not be caused

by the independent variables. The MLR model is de- fined as

Yi¼β0þβ1X1iþ⋯þβkXkiþεi; i

¼1;2;⋯;n ð2Þ

whereYis the dependent variable,β0,β1,⋯,βkare (k+ 1) parameters,X1,⋯,Xkarek-independent variables,εare the independent, identically and Gaussian-distributed errors, andnis the number of observations. The parameters are estimated by the ordinary least square method.

In order to make a 1-day-ahead forecast, the dependent variable was set to be today’s PM10, while the predictors were the one previous day of other variables listed in Table 1. The logarithm transformation was performed on PM10 (lnPM10) for all stations as they did not exhibit to be Gaussian distributed. Before modeling, the stepwise (SW) and all possible model (APM) methods were used to select suitable subsets of predictor variables for every station. The SW method selects the variables by dropping and including variables iteratively and the process ceases when the Akaike Information Criterion (AIC) reaches the lowest. On the other hand, the APM technique tries all combination of variables and chooses the subsets yielding the highest adjusted coefficient of multiple determinationR²_adj. The larger subset between the two methods was used in modeling. The variance inflation factor (VIF) was used to examine the presence of multicollinearity among predictors. VIF value below 10 suggests that the specific predictor does not linearly correlate to other predictors in the model (Neter et al.1985). The models were built based on R²_adj. The most insignificant variable was re- moved iteratively until a large decline inR²_adj was observed or all variables were significant at 5% significance level, whichever reached earlier, to obtain a par- simonious model.

Regression with time series error

The RTSE model may be a good alternative to simple MLR when time series variables are used. This is due to the fact that residuals from the MLR model will proba- bly disobey the assumption of independent errors and thus lead to incorrect estimated parameters and standard errors. Different from MLR, the RTSE model assumes the residuals from regression to follow an ARMA (autoregressive moving average) or seasonal ARMA Table 1 Summary of variables included in regression modeling

Variable Symbol Unit

Daily average PM10concentration PM10 μgm⁻³ Daily average NO2concentration NO2 ppm Daily average NO concentration NO

Daily average SO2concentration SO2

Daily average CO concentration CO Daily average O3concentration O3

Daily average temperature Temp °C

Daily average humidity RH %RH

Daily average wind-xcomponent Wx ms⁻¹ Daily average wind-ycomponent W_y

Daily average wind speed WS

(4)

(SARMA) process which allows the errors to be serially correlated. The RTSE has the form of

Yt¼β0þβ1X1tþ…þβkXktþη_t; ϕpð ÞΦB P B^S η_t ¼φ_qð ÞΨB Q B^S εt; t¼1;2;…;n

ð3Þ whereYis the dependent variable,β0,β1,…,βkare (k+ 1) parameters,X1,…,Xkare k-independent variables, η are the errors characterized by ARMA or SARMA process, and n is the number of observa- t i o n s . T h e s e c o n d e q u a t i o n i s a g e n e r a l SARMA(p,q)(P,Q) model with ϕp(B) = (1−ϕ1B−

…−ϕpB^p), φq(B) = (1 +φ1B+…+φqB^q), ΦP(B^S) = (1−ϕSB^S−…−ϕPSB^PS), and ΨQ(B^S) = (1 +φSB^S+

…+φQSB^QS) whereBis the backward shift operator such that B^aYt=Yt−a, S is the seasonal period, ϕ1,

…, ϕp, ϕS,…,ϕPS,φ1,…, φq, φS,…,φQSare the parameters to be estimated, and ε are the Gaussian white noise errors. When P=Q= 0, the equation simplifies to ARMA(p,q) model. The maximum likelihood method is used to simultaneously esti- mate the regression and ARMA parameters.

One important assumption of the RTSE model is the stationarity of data series, which means the mean level, variance and autocorrelation function of the data series are constant over time. The stationarity in mean was tested by using the Augmented Dickey-Fuller (ADF) test. If theYor anyone ofX’s is not stationary, all series are differenced d times before modeling. If seasonal pattern is observed, the series are differencedDtimes at seasonal lagS. This leads to regression at difference level with SARMA(p,q)(P,Q) errors:

Y⁰_t¼β1X⁰_1tþ…þβkX⁰_ktþη⁰_t;

ϕ_pð ÞΦB P B^S η⁰_t ¼φ_qð ÞΨB Q B^S εt; t¼1;2;…;n;

ð4Þ o r e q u i v a l e n t l y, r e g r e s s i o n a t l e v e l w i t h SARIMA(p,d,q)(P,D,Q) errors:

Yt¼β0þβ1X1tþ…þβkXktþηt; ϕpð ÞΦB P B^S ð1−BÞ^d1−B^SD

ηt¼φqð ÞΨB QB^S εt; t¼1;2;…;n:

ð5Þ The same subsets of variables selected as in MLR1 were used for RTSE modeling. Similarly, logarithm

transformation was taken on PM10to achieve stationary and Gaussian series. The optimum model was acquired by minimizing AIC.

Performance evaluation measures

The performance of models at each station was evaluat- ed by six indices, namely the root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), maximum absolute percentage error (maxAPE), fractional bias (FB) and percent bias (PBIAS). A good model is indicated by low values of RMSE, MAE, MAPE and maxAPE, near zero. FB and PBIAS approaching zero also indicate good agreement between observed and predicted values. On the other hand, positive FB and negative PBIAS reflect underestimation, and vice versa. Let Oi be the real observation, Pi be the predicted value, and n be the number of observations, the formulae of all the performance indices mentioned above are given as follows.

RMSE¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

n ∑ⁿ

i¼1

O_i−P_i ð Þ² s

; ð6Þ

MAE¼1 n ∑ⁿ

i¼1jOi−Pij; ð7Þ

MAPE¼1 n ∑ⁿ

i¼1jOi−Pi

Oi j 100%; ð8Þ

maxAPE¼max Oi−Pi

O_i

100%; ð9Þ

FB¼2∑ⁿ_i¼1ðOi−PiÞ

∑ⁿ_i¼1ðO_iþP_iÞ; ð10Þ

PBIAS¼∑ⁿ_i_¼₁ðP_i−OiÞ

∑ⁿ_i¼1Oi : ð11Þ Moreover, the overall performance of models was assessed by combining the prediction results of all stations.

(5)

Results and discussion

Summary statistics of PM10concentrations

Figure1and Table2display the boxplot and descriptive statistics of PM10concentrations in modeling datasets at the five stations, respectively.

Obviously, from Fig.1, it can be seen that the distri- butions of PM10 concentrations at all stations are skewed to the right, indicating that there were several extreme days of PM10 concentrations. From Table 2, station CA16 had the highest number of days whose PM10 concentrations surpassed the standard limit of 150 μgm⁻³set by MAAQG, whereas the stations in suburban area had relatively lower number of extreme days, as expected. In general, two stations located in the Klang Valley area, which are CA16 and CA58, were more frequent to experience high PM10concentrations (> 100 μgm⁻³). All stations, except CA17, have quite consistent means, ranging from 49.85 to 55.86μgm⁻³ with standard deviations from 23.38 to 31.87μgm⁻³. It is surprising to notice CA17 which is classified as a suburban station to have the highest mean and median values. This may due to the fact that it is situated near the industrial areas. Despite the highest central tendency values at CA17, it shows the smallest standard devia- tion, suggesting this location constantly experienced rather high concentrations. This is further supported by the exceptionally high percentage of days (76.92%) whose PM10concentrations are exceeding 50 μgm⁻³.

This concentration level is also a guideline limit provid- ed by WHO (WHO2006). Further scrutiny shows that the maximum concentrations for stations in the central region of Peninsular Malaysia (CA16, CA47 and CA58) occurred in March, while the stations in the northern region (CA09 and CA17) had their maxima during July.

This implies that haze happened during northeast and southwest monsoons. This result is in coherent with the findings by Shaadan et al. (2015).

Regression modeling

The modeling process began with checking the correlations between dependent variable (lnPM10) and predictor variables (1-previous-day meteorological and gaseous parameters) to verify the linear associations between lnPM10and its predictors besides helping in the decision whether to include these predictors. Table 3 shows Pearson’s correlation coefficients, R between lnPM10 and its predictors for all stations using the modeling datasets.

Basically, all pollutants, except NO, are positively associated with lnPM₁₀, and most of them are close- ly correlated to lnPM10 (R> 0.3). lnPM10 is positively correlated to temperature while negatively correlated to humidity for all stations. This is because high temperature usually promotes the formation of PM10precursors as well as burning activities.

Contrarily, high humidity is normally linked to more rainfall amount which washes out the PM10 concentrations. Even though NO, wind speed and the wind vectors are less significantly associated with lnPM10, a lot of literatures have pointed out that the nitrogen oxide which includes NO is the main cause of PM10formation (Bhattacharjee et al.1999), and that PM10 could be influenced by wind speed and direction (Shaharuddin et al. 2008; Liew et al.

2011). Hence, all the abovementioned predictors were considered in the variable selection. Only the variables selected by SW and APM methods were finally included in MLR modeling.

After estimating MLR1 models, the residuals were diagnosed whether they complied with Gaussian white noise assumption before further investigation. All models are deemed to be no multicollinearity problem as the VIF values are all below 10. The normal Q-Q plot was used to examine the normality of residuals. On top of that, Fig.2(a)–(e) shows the plot of residuals against Fig. 1 Boxplot of PM10concentrations in modeling datasets for

all stations

(6)

fitted values to evaluate the constant variance assumption for all stations.

From the normal Q-Q plots (not displayed), the residuals lie around the straight line, except some slight deviations at the two tails, confirming the normality of residuals for all stations. The residuals at all stations also show constant variances as no specific patterns were observed from plots in Fig. 2. However, the Breusch-Godfrey Lagrange multiplier (LM) tests at lag 1 show that the residuals are serially correlated. The test statistics (LM(1)) are significant at 5% level, as shown in Table4, rejecting the null hypothesis of independent residuals.

This violation of assumption may cause incorrect coefficient and standard error estimation, causing wrong infer- ence. Hence, two modifications were done to improve this situation. One was to include previous days’PM10

concentrations into the regression models. Another was to ease the uncorrelated residual assumption and to treat the residuals as a time series, i.e., an ARMA process, which makes the RTSE model. Similar residual diagnos- tics (not shown here) show that the residuals from MLR2

satisfy the white noise assumption. The autocorrelation of residuals in RTSE models was checked by the Ljung-Box (LB) test with six lags. The failure to reject the null hypothesis also confirms that the uncorrelated residual assumption was met. Table 4 presents the estimated coefficients for MLR1, MLR2 and RTSE models for all stations. The asterisk (*) indicates significant parameter at 5% significance level.

As displayed in Table4, the regression parameters of MLR1 explain 39–58% of the PM10 variations for all stations. Higher percentages of PM10 variations can be explained by 1 previous day of meteorological and pollutant variables in suburban areas than other areas. On the other hand, the lowerR²_adjfor station CA16 suggests that there might be other factors governing the PM10forma- tion. For all, the inclusion of previous days’PM10con- centrations greatly enhances the explained proportion of PM10variations to over half. By MLR2 and RTSE, the number of significant variables also reduced.

For MLR1, it can be seen that humidity, CO and O3

were important factors linking to PM10concentrations Table 2 Descriptive statistics of PM10concentrations in modeling datasets for all stations

CA09 CA16 CA17 CA47 CA58

Mean,μgm⁻³ 49.85 55.57 64.71 49.88 55.86

Std. Dev.,μgm⁻³ 23.38 31.87 19.33 25.77 31.85

Min,μgm⁻³ 15.39 18.59 32.48 20.96 15.88

Max,μgm⁻³ 174.67 302.92 155.75 273.29 300.63

% of days exceed 50μgm⁻³ 40.17 42.74 76.92 32.76 44.44

Table 3 Pearson’s correlation coefficients between lnPM10at timetand 1-previous-day predictors for all stations

Predictor CA09 CA16 CA17 CA47 CA58

NO2,t-1 0.39 0.21 0.38 0.10* 0.32

NOt-1 −0.15 −0.25 −0.09* −0.22 −0.22

SO2,t-1 0.34 0.30 0.35 0.28 0.35

CO_t-1 0.33 0.15 0.47 0.44 0.36

O3,t-1 0.61 0.37 0.62 0.47 0.47

Tempt-1 0.41 0.46 0.39 0.22 0.42

RHt-1 −0.44 −0.49 −0.48 −0.24 −0.36

Wx,t-1 0.08* −0.18 −0.06* 0.08* −0.04*

Wy,t-1 0.20 0.05* −0.15 0.13 0.17

WSt-1 0.11 0.21 0.01* 0.04* 0.12

*Non-significant at 5% level (remaining are significant)

(7)

for all stations. Furthermore, NO2, temperature, wind speed and direction (despite the lack of linear associa- tion) were also significantly related to PM10concentra- tions. In overall, PM10concentrations were positively correlated to the air pollutants and temperature, while negatively correlated to humidity and wind speed. The high wind speed that helps to disperse the PM10con- centrations in a location explains the inverse relationship between them. The common source of CO, NO2and O3

is associated with fuel combustion. Therefore, it can be deduced that the PM10concentrations at these stations were most likely related to vehicular emissions and meteorological factors.

In MLR2, humidity remains as a major factor for all stations besides the previous days’PM10concentrations.

Other important factors include wind speed and direction, NO and O3. Two previous days of PM10concen- trations at station CA47 were included in the regression in order to achieve independence residuals. Yet, the negative coefficient of lnPM10,t-2is against expectation.

This may be because it was highly correlated to the 1- previous-day PM10 concentration but this would not affect the precision in forecasting (Neter et al.1985).

In RTSE model, since there were at least one of the variable series included in regression which were not stationary for all stations except CA09, all the series

including dependent and predictor variables were differenced once before modeling, which is displayed byd= 1 in the RTSE models. All RTSE models mani- fest significant autocorrelation among PM10concentra- tions since the ARMA parameters are significant at 5%

level. Similarly, humidity maintains as a key factor. In addition, temperature, CO and O3 were considered substantially correlated to the PM10variations, but the wind speed was no longer significant. This is in agreement with the smallRbetween lnPM10and wind speed. However, wind speed may still be an important factor correlating with PM10concentrations nonlinearly.

All in all, it can be inferred that humidity, temperature, wind speed and direction were the main meteorological factors dominating PM10concentrations, and CO and O3 signify that PM10 concentrations in Peninsular Malaysia were due to vehicular emissions. These results are in line with the findings of Liew et al.

(2011) and Ul-Saufie et al. (2013) which demonstrated that meteorological parameters explained large proportion of PM10 variations. Moreover, our findings also concur with the study of Afroz et al. (2003) indicating that transportation emissions contributed for more than 70% of air pollution during non-haze periods.

Figure3(a)–(e) shows the comparative graphs of the forecasts by the three methods and actual observations Fig. 2 Plots of residuals from MLR1 versus fitted values for stations (a) CA09, (b) CA16, (c) CA17, (d) CA47 and (e) CA58

(8)

Table 4 Estimated coefficients for MLR1, MLR2 and RTSE models

Variables CA09 CA16 CA17 CA47 CA58

MLR1 NO2,t-1 12.636* 14.851* 11.630*

NOt-1 10.249* 7.591 −10.220

SO2,t-1 38.669* 30.462* 27.973

COt-1 0.462* 0.203* 0.429* 0.367* 0.752*

O3,t-1 24.250* 10.305* 12.099* 19.530* 10.150*

Tempt-1 0.034* 0.043 −0.100* 0.048

RHt-1 −0.011* −0.023* −0.012* −0.023* −0.008*

Wx,t-1 −0.017 −0.033*

Wy,t-1 −0.014* 0.038* 0.069*

WSt-1 −0.087* −0.061* −0.060* 0.060

Intercept 2.476* 3.883* 4.637* 8.039* 2.090*

LM(1) 25.759* 68.047* 47.318* 94.813* 31.670*

R²_adj ^0.529 ^0.387 ^0.580 ^0.430 ^0.433

MLR2 lnPM10,t-1 0.546* 0.599* 0.551* 0.799* 0.680*

lnPM10,t-2 −0.129*

NO2,t-1 −11.935

NOt-1 9.431* 9.551*

SO2,t-1 25.481*

COt-1 0.105* −0.219

O3,t-1 11.412* 5.676*

Tempt-1 −0.067* 0.034

RHt-1 −0.009* −0.014* −0.007* −0.013* −0.007*

Wx,t-1 −0.010 0.014* 0.021*

Wy,t-1 0.010 0.020* 0.029*

WSt-1 −0.058* −0.035*

Intercept 2.028* 2.853* 2.388* 4.242* 1.094

LM(1) 0.334 1.021 1.090 0.241 0.016

R²_adj ^0.604 ^0.560 ^0.669 ^0.645 ^0.536

RTSE NO2,t-1 9.740* 16.888*

SO2,t-1 −48.506*

COt-1 0.184 0.232* 0.133

O3,t-1 10.748* 5.648* 4.529

Temp_t-1 0.071* −0.039 0.050*

RHt-1 −0.010* −0.009 −0.008* −0.007* −0.012*

Wx,t-1 −0.023* 0.015

Wy,t-1 −0.011 0.019*

WSt-1 −0.086*

Intercept 4.211*

AR(1) 0.602* 0.507* −0.179*

AR(2) −2.990*

AR(3) −0.230*

AR(4) −0.157*

SAR(1) 0.158*

(9)

from 18 to 31 December 2014 for all stations. Table5 presents the individual and collective results of the model performance for all stations.

Roughly viewing from Fig.3, all models are able to capture the overall patterns of the series. MLR1 pro- duced forecasts with the largest variation, while RTSE Table 4 (continued)

Variables CA09 CA16 CA17 CA47 CA58

MA(1) −0.508* −0.924* −0.439*

MA(2) −0.178* −0.295*

d – 1 1 1 1

LB Q-stat 4.219 1.152 0.752 3.516 5.122

*Significant at 5% level

20 30 40 50 60 70

18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12

PM10concentration, µgm-3

Date (a)

15 25 35 45 55 65

18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12

Date (b)

45 50 55 60 65 70 75

18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12

Date (c)

25 30 35 40 45 50

18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12

Date (d)

20 30 40 50 60 70 80

18/12 19/12 20/12 21/12 22/12 23/12 24/12 25/12 26/12 27/12 28/12 29/12 30/12 31/12

Date (e)

actual MLR1

MLR2 RTSE

Fig. 3 Actual observations versus forecasts by the three models from 18 to 31 December 2014 for stations (a) CA09, (b) CA16, (c) CA17, (d) CA47 and (e) CA58

(10)

yielded forecasts with the smallest fluctuation. The smaller variation of RTSE forecasts may attribute to the stationarity characteristic of the ARMA model which inclines to produce forecasts oscillating around the long-term mean value. Thus, MLR1 works generally well in scenarios where PM10concentrations fluctuate a lot, particularly for stations CA16 and CA58. Nonethe- less, it sometimes tends to overestimate the peaks and underestimate the troughs. RTSE, however, can perform quite well in the cases where PM10concentrations vary with small changes, such as in CA09 and CA17. Addi- tional advantage of RTSE is that it is capable to seize the dynamic patterns more in time as compared to both the MLR models. MLR2 also generated quite accurate forecasts for all stations, but obviously lagging 1 day behind due to the inclusion of 1-previous-day PM10concentra- tion. The overall evaluation statistics suggest that MLR2 and RTSE are comparable in their predictive ability. The lowest values of RMSE, MAPE and maxAPE for the MLR2 model show that it is the best model. On the other side, RTSE is chosen as the best model in preference to the two MLR models by the MAE, FB and PBIAS. In addition, the positive FB and negative PBIAS indicate that all three models underestimated the actual concentrations in general. The RTSE model does not show any

superiority over simple MLR model. Therefore, in our case, a simple MLR model to include the lagged PM10is sufficient to account for the autocorrelation among PM10.

Conclusion

This study used three different dynamic regression models to forecast the 1-day-ahead daily average PM10

concentrations at five selected stations in Peninsular Malaysia. When the regression equations included only the lagged predictor variables, the residuals showed autocorrelation among themselves. After including the lagged PM10 concentrations or adjusting the errors in model to be correlated, the problem was not only solved, but also increased the forecast precision. The outcomes disclosed that humidity, temperature, wind speed, wind direction, CO and O3were essential parameters varying PM10concentrations in Peninsular Malaysia after recti- fying for the autocorrelated residuals. Moreover, the gaseous pollutants suggested traffic emissions as the main source of PM10 pollution. These results were in accordance with previous studies of Liew et al. (2011), Ul-Saufie et al. (2013) and Afroz et al. (2003). In Table 5 Comparative results of forecast performance by the three methods at all stations and their overall performance

CA09 CA16 CA17 CA47 CA58 Overall

RMSE MLR1 11.426 7.825 7.910 4.991 10.006 8.711

MLR2 9.066 7.345 5.298 4.078 10.944 7.754

RTSE 7.942 8.860 5.335 4.465 10.695 7.802

MAE MLR1 9.013 6.472 7.134 3.382 8.681 6.936

MLR2 6.842 5.802 4.289 3.225 9.503 5.932

RTSE 5.642 7.835 4.035 4.098 8.045 5.931

MAPE MLR1 19.640 19.342 13.098 9.377 21.782 16.648

MLR2 14.750 17.315 7.670 9.509 24.742 14.797

RTSE 11.845 25.973 7.192 11.747 19.172 15.186

maxAPE MLR1 48.003 47.898 26.418 33.493 39.071 38.977

MLR2 39.251 44.464 16.596 22.820 46.283 33.883

RTSE 34.458 72.767 18.076 23.693 28.883 35.575

FB MLR1 0.144 −0.034 −0.045 0.091 0.088 0.041

MLR2 0.1043 −0.010 0.046 −0.049 −0.005 0.022

RTSE 0.1041 −0.151 −0.005 −0.041 0.127 0.006

PBIAS MLR1 −0.135 0.034 0.046 −0.087 −0.084 −0.040

MLR2 −0.0991 0.010 −0.045 0.051 0.005 −0.021

RTSE −0.0989 0.163 0.005 0.042 −0.120 −0.006

The bold values indicate the lowest error/bias among the MLR1, MLR2 and RTSE models

(11)

addition, the prediction results showed that MLR2 is equipollent with the RTSE, while MLR1 gave the least accurate forecasts.

Acknowledgements We would like to acknowledge Malaysia Department of Environment (DOE) for providing data to accom- plish this study.

Funding information We would like to extend our gratitude to Universiti Sains Malaysia for Fellowship Scheme and Malaysia Ministry of Higher Education for financial assistance.

References

Afroz, R., Hassan, M. N., Ibrahim, N. A., et al. (2003). Review of air pollution and health impacts in Malaysia.Environmental Research, 92(2), 71–77.https://doi.org/10.1016/S0013-9351 (02)00059-2.

Afzali, A., Rashid, M., Sabariah, B., Ramli, M., et al. (2014). PM10

pollution: its prediction and meteorological influence in PasirGudang, Johor.Proceedings of the IOP Conference Series: Earth and Environmental Sciences, 18.

Bhattacharjee, H., Drescher, M., Good, T., Hartley, Z., Leza, J-D., Lin, B., Moss, J., Massey, R., Nishino, T., Ryder, S., Sachs, N., Tozan, Y., Taylor, C., & Wu, D. (1999). Section 1:

Introduction. In Particulate matter in New Jersey.

Princeton: Princeton University.

DOE Department of Environment. (2000). A guide to Air Pollution Index (API) in Malaysia.Ministry of Science, Technology and the Environment.

DOE Department of the Environment (2015). Malaysia Environmental Quality Report 2014. Ministry of Natural Resources and Environment Malaysia.

DOE Department of the Environment (2017). Air quality stan- dards: new Malaysia ambient air quality standard.

https://www.doe.gov.my/portalv1/en/info-umum/english-air- quality-trend/108. Assessed 24 Feb 2017.

Hassan, N. A., Hashim, Z., Hashim, J. H., et al. (2016). Impact of climate change on air quality and public health in urban areas.

Asia Pacific Journal of Public Health, 28(2S), 38S–48S.

https://doi.org/10.1177/1010539515592951.

Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M., et al. (2004). Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 38(18), 2895–2907.https://doi.org/10.1016/j.

atmosenv.2004.02.026.

Liew, J., Latif, M. T., Tangang, F. T., et al. (2011). Factors influencing the variations of PM10 aerosol dust in Klang Valley, Malaysia during the summer. Atmospheric Environment, 45, 4370–4378.

Liu, P.-W. G. (2009). Simulation of the daily average PM10con- centrations at Ta-Liao with Box-Jenkins time series models and multivariate analysis.Atmospheric Environment, 43(13), 2104–2113.https://doi.org/10.1016/j.atmosenv.2009.01.055.

Neter, J., Wasserman, W., & Kutner, M. H. (1985).Applied linear statistical models: regression, analysis of variance, and exper- imental designs(2nd ed.). Homewood, Illinois: R. D. Irwin.

Norazian, M. N., Ahmad Shukri, Y., Nor Azam, R., Al Bakri, A.

M. M., et al. (2008). Estimation of missing values in air pollution data using single imputation techniques.

ScienceAsia, 34(3), 341–345. https://doi.org/10.2306 /scienceasia1513-1874.2008.34.341.

Shaadan, N., Jemain, A. A., Latif, M. T., Mohd Deni, S., et al.

(2015). Anomaly detection and assessment of PM10 func- tional data at several locations in the Klang Valley, Malaysia.

Atmospheric Pollution Research, 6(2), 365–375.https://doi.

org/10.5094/APR.2015.040.

Shaharuddin, M., Zaharim, A., Mohd Nor, M. J., Karim, O. A., Sopian, K., et al. (2008). Application of wavelet transform on airborne suspended particulate matter and meteorological temporal variations.WSEAS Transactions on Environment and Development, 4(2), 89–98.

Siwek, K., & Osowski, S. (2012). Improving the accuracy of prediction of PM10pollution by the wavelet transformation and an ensemble of neural predictors. Engineering Applications of Artificial Intelligence, 25(6), 1246–1258.

https://doi.org/10.1016/j.engappai.2011.10.013.

Ul-Saufie, A. Z., Yahaya, A. S., Ramli, N. A., Rosaida, N., Abdul Hamid, H., et al. (2013). Future daily PM10concentrations prediction by combining regression models and feedforward backpropagation models with principle component analysis (PCA).Atmospheric Environment, 77, 621–630.https://doi.

org/10.1016/j.atmosenv.2013.05.017.

Vinceti, M., Malagoli, C., Malavolti, M., Cherubini, A., Maffeis, G., Rodolfi, R., Heck, J. E., Astolfi, G., Calzolari, E., Nicolini, F., et al. (2016). Does maternal exposure to benzene and PM10during pregnancy increase the risk of congenital anomalies? A population-based case-control study.Science of the Total Environment, 541, 444–450. https://doi.

org/10.1016/j.scitotenv.2015.09.051.

WHO World Health Organization. (2006).WHO Air Quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide: Global update 2005, Summary of risk assessment.