Air quality prediction computational methods

4.1 Introduction

The prediction of air quality is an important activity for a number of reasons - individuals are interested in being able to ensure safety in their living, working and leisure environ- ments, organisations are interested for ensuring compliance with regulations, governments are interested for all of the above reasons. Any comprehensive indication of air quality in an area is useful to aid decisions, implement actions and prevent illness [Friedman et al., 2001]. As part of an overall trend towards moving towards "green cities" efficient and ac- curate prediction aids in a number of other environmental initiatives also.

This chapter outlines computational methods used for air quality prediction, how they are executed in a theoretical sense, and thinks about how they have performed in existing tasks. Prediction methods commonly are categorised into two groups - statistical models and artificial intelligence models. There are many computing techniques currently in use for air-quality. Most commonly known now in current times is machine learning. Statistical models use historical data, useful for small time-frames, and newer computational models have an advantage of being able to produce longer time-frame predictions on account of being able to consume larger amounts of data and process these more efficiently. A generic expression of this sort of model is:

Yt=f(Xt), (4.1)

whereYtis the observation value at time instancetandXtis a collection of environmental variables. A distinctive feature of AI models are that they use multiple variables whereas a statistical model looks at a single variable as the input.

4.2 Historical context

[Samuel, 1959] is credited with first describing machine learning as being "the use of statistical techniques which give computers the ability to "learn" (i.e. progressively improve performance on a specific task) with data and without being specifically programmed". It,

therefore, follows that machine learning can be used as a method to better fine-tune air- quality.

[Stingone et al., 2017] utilised machine learning techniques to assess the impact of multiple air pollutants on health outcomes. A two-stage approach was used with a view to identifying correlations between air pollutant exposure profiles and children’s cognitive skills. This demonstrated that machine learning could identify relevant exposure profiles in complex data related to air quality. The approach used Classifier and Regression Trees (CaRT), while many algorithms could have been used in this instance the R package rpart[Therneau et al., 2018]was used.

[Ren et al., 2018] utilised machine learning to determine the correlation betweenP M₁0 exposure and maternal exposure resulting in birth defects. Random forest (RF) and gra- dient boosting (GB) models and compared with logistic regression. Performance of these indicated that the machine learning models performed better than logistic regression, ex- cept on training data.

[Li et al., 2018] conducted research on air pollutant concentration using a machine learning prediction method which was based on self-adaptive neuro-fuzzy weighted ex- treme learning machines—using data from air monitoring stations in Taiwan.

4.3 Hidden Markov Models (HMM)

Hidden Markov Models offer a richer mathematical structure compared to typical non- linear regression and Artificial Neural Networks. Used for mathematical functional map- ping between hidden patterns and observations, they have become increasingly popular effective in other artificial intelligence tasks such as speech recognition and handwriting character recognition. Late in the 1980s they began to be used increasingly for other uses such as biological analysis, machine translation, gene prediction, activity recognition and protein folding [Sun et al., 2013].

A hidden Markov model (HMM) is a simple dynamic type of Bayesian network in- tended to be used for a stochastic process on optimal nonlinear filtering problems [Baum and Petrie, 1966], [Baum and Eagon, 1967], [Baum and Sell, 1968]. The model generally uses a probabilistic framework to model a time series of multivariate observations. The inputs of the HMM model include sequenceX^RT, transition probabilities and initial probabilities [Song, 2014].

Given dataX =x1, x2, ..., xt, ..., xT, allxtis generated by a hidden statest. The states follow a Markov chain and the future event is independent of the past. The HMM can be shown as:

P(s_t+1|s_t, s_t−1, ..., s₀) =P(s_t+1|S_t). (4.2) With the transition probabilities given as,

αk,I =P(St+1=I|st=k), (4.3)

wherek, I = 1,2, ..., M andM is the total number of states. The initial probabilities of states can be represented asπk,

4.3. HIDDEN MARKOV MODELS (HMM) 51

I=1

ak,i = 1

f oranyk,

k=1

π_k = 1

(4.4)

Then, transition probabilities as.

P(s1, s2, ..., sT) =P(s1)P(s2|1)P(s3|s2)...P(sT|sT1)

=π_s₁α_s₁_,s₂...α_S_T₋₁_,S_T (4.5) transition probabilities can be estimated given two conditional probabilitiesLk(t)and Hk,I(t). Firstly, in statekat positiont, entry the observed sequenceX, let

Lk(t) =P(st=k|u) =X

P(s|u)I(st=k), (4.6) then, in the statekat positiontand in stateIat positiont+ 1, after inputtingX,

Hk,I(t) =P(st=k, st+1=I|u)

P(s|u)I(st=k)I(st+ 1 =I), (4.7) Finally, the state ofXT+1can be predicted by the estimated transition probabilities and estimating probabilities (instead of initial probabilities). A model can then be trained, and Y can be calculated by the state ofXT+1.

Figure 4.1: Hidden Markov Model

Figure 4.1 depicts how one state transitions to the next; initially a hidden state distri- bution is initialised as at time t1. Then, the hidden state transfers from the initial state to the next state according to a state-transition probability matrix (A). Each state emits observations according to an emission probability (B), creating an observation sequence. The sequence ends at time tl where l corresponds to the length of the observation sequence.

HMMs have been employed in many air pollution studies given the ability to produce the probabilities of future states. Research by [Couvreur et al., 1998] indicates that ap- proaches using HMM models show good performance when tasked with prediction of emission events, such as car, truck, moped, aircraft and trains taken from the MADRAS database [Song, 2014]. An HMM prediction study by [Dong et al., 2009] was able to pre- dictP M2.5for a Chicago airport using mass spectrometer data with good success, as was a study by [Senawongse et al., 2005] that predicted phosphorylation relating to biological samples.

Figure 4.2: Probabilistic parameters of a hidden Markov model (example)

Figure 4.2 shows an example Hidden Markov Model whereX are the states,yare the possible observations,aare state transition probabilities andbare the output probabilities.

[Marin, 2006] discusses the main benefits of HMM which relate to mining data as being related to their solid stochastic-theoretical foundation which stays behind the known algorithms and their efficiency. They are essential for bioinformatic uses, and have useful applications in a number of other types, such as sequence matching in sets with elements that differ, such as insertions, deletions and evaluation of the associated penalty. HMM models can be unsupervised, can work with sequences of varying lengths or where align- ment, datamining, classification, structural analysis or pattern recognition are required.

There are two specific limitations with Hidden Markov Models;

1. The number of parameters grows where there number of states, or alphabet, grows.

This results in reduced performance, and the accuracy being reduced because of the local maximum/minimum.

2. The emission and transition functions are independent, and both depend on the hidden state - The example given by [Marin, 2006] is of a correlation in which the emission of the symbolXin the stateS_iis often followed by the emission of the symbolY in the stateS_jand thatX⁰in the stateS_iis often followed byY⁰in the stateS_j. This results in information being lost.

4.4. NEURAL NETWORKS (NN) 53

4.4 Neural Networks (NN)

There are numerous types of Neural Networks as depicted in Figure??. In it’s basic form it’s operates by using neurons. These are given inputs which are multiplied with weights.

The development of a new or augmention of an existing type of Neural Network is not within the scope of this research, but they are important because a Neural Network is the method by which the correlational aspect needs to be validated against to determine whether the accuracy or performance has been improved or degraded.

Neural Networks, came about due to early work in the area by [McCulloch and Pitts, 1943] and are models that function in the way that a biological neural networks (Brain) work in their layout. Processing units (neurons) are connected with a number of weighted links where ”knowledge” is stored and for which information is passed through. The performance and accuracy of these is superior to statistical methods alone according to work done in the air-quality domain by [Grivas and Chaloulakou, 2006], [Elangasinghe et al., 2014] and [Rahman et al., 2015].

Neural Networks are non-linear type of model, based on a statistical approach and have seen success when tasked with putting together some of the more advanced relationships that can be found between a given input and output, or with patterns in a dataset [Mitchell, 1999].

These Neural network models are able to combine multiple training algorithms to de- tect complex, non-linear relationships between dependent and independent variables and all possible interactions between predictor variables. Neural Network models may also have multiple solutions associated with local minima abnd thus may not function as well over different sample sets [Song, 2014].

Figure 4.3: Simple Neural Network

Fig 4.3 demonstrates a simple neural network example with groups of cross-connected nodes. Time series prediction using as ANN seeks to computex(tb +4t)by analysing x withintperiod. Suppose an ANN hasncomposition functions. The ANN functionf(x) is defined by other composition functionsgi(x). Thusf(x) = (g1(x), g2(x), ...gn(x)). The commonly used type of composition is the nonlinear weighted sum shown in equation 4.8:

f(x) =K X

wigi(x)

, (4.8)

WhereKis some predefined function, such as the hyperbolic tangent. It will be con- venient to refer to a collection of functionsg_ias simply a vectorg = (g₁, g₂, ..., g_n). it is by

giving a specific task to solve, and a class of functionsF, using a set of observations to find f^F which solves the task in some optimal sense.

Cost functionC:F →R, is an important concept in learning, as it is a measurement of how far away the solution in question is from an optimal solution, and from the problem that is to be solved. Learning algorithms are able to search through the solution space to find a function that has the smallest possible cost. For the optimal solutionf^,C(f^{)C(f)f F}. No solution has a cost that is less than the cost of the optimal solution.

When used in applications where the solution is dependent on some observed data, the cost must by necessity be a function of the observations; otherwise, it wouldn’t be possible to model anything related to the data. It is frequently defined as a statistic approach to which only approximations can be made. As a simple example consider the problem of finding the modelf which minimisesC=E[(f(x)−y)²], for data pairs(x, y)drawn from some distributionD. In situations, we would only haveN samples fromD and thus, for the above example, we would only minimiseCb = _N¹

i=1

(f(xi)−yi)². Thus, the cost is minimised to a sample of the data rather than the entire data set.

Air quality prediction has seen a rise in the number of NN-based solutions over recent years, but early work showed the promise that has built upon such as:

• Multi-Layer Perceptron (MLP) model, predicted hourly N Ox and N O2 concentrations in London [Gardner and Dorling, 1999],

• Neural network, predicted hourly ground O₃ concentrations usingCH₄, N M HC, CO,CO₂,N O,N O₂,SO₂andO₃[Abdul-Wahab and Al-Alawi, 2002],

• Neural network, predictedSO2[Chelani et al., 2002].

4.4.1 Perceptrons/Multi-layer Perceptrons

The perceptron [Rosenblatt, 1958] was the first type of neural network and set the ground- work for further research into a number of applications, including the prediction of air quality.

The original perception began with a simplistic one neuron, andnamount of inputs.

The number of inputs were multiplied by a weight resulting in one output.

This developed into the Multi-Layer perceptron that added complexity by implement- ing three layers:

1. Input layer- a numerical input, or where there is types of data that is not numeric originally it needs to be converted.

2. Hidden layer(s)- contains neurons which manipulate the input in such a way that is can have weights or biases applied. The term hidden refers to the manner in which the layer functions - rather than inputs or outputs which are worked with directly by someone developing the model.

3. Output layer- are the final product of the network and may represent different results depending on how the network is configured.

Multi-Layer Perceptrons utilise non-linear activation functions such as hyperbolic tangent or logistic functions.

4.5. CONVOLUTIONAL NEURAL NETWORK (CNN) 55

4.5 Convolutional Neural Network (CNN)

Convolutional Neural Networks vary the functionality of a Multi-Layer Perceptron by adding more than one convolutional layer which can be interconnected or pooled. Opera- tions are performed on the data input, and then passed to the next layer. Commonly used in image or video recognition, natural language processing, recommender systems, seman- tic parsing, paraphrase detection, signal processing and classification they often yield good results.

[Kowalski et al., 2020] describe studies using Convolutional Neural Networks with air quality, specificallyP M10. Their research shows superiority over multi-layer perceptrons, but more modern methods give better results.

4.6 Recurrent Neural Networks (RNN)/Long-Short Term Mem- ory (LSTM)

A Recurrent Neural Network operates by passing the outputs of particular layer back as an input in order to predict the outcome. The first layer’s product of the sum of weights and features differs from subsequent layers where the process begins. Each time-step has some information remembered from it’s previous step and thus acts like a memory.

The memory function acts by storing the last output in a state matrix, but otherwise are like a normal Multi-Layer Perceptron. The use of the previous outputs and states makes it useful in tasks like financial markets prediction or other time-series types of data.

4.6.1 Long Short-Term Memory Neural Networks (LSTM)

Long Short Term Memory networks build on the concept of sate matrices by having two states - long and short term. If one of the states persists in the model, it moves to long term and is considered higher weighted when uses with new data. The LTSM variant is good at finding patterns with continuous data sources.

[Livieris et al., 2020] describes an LTSM as having three gates to a memory cell (Ct) - input, forget and output. For each time step (t):

1. the input gate (i_t) determines which information is added to the cell state (S_t) memory

2. the forget gate (f_t) decides which information is discarded by decision via the trans- formation function in the layer

3. the output gate (O_t) decides which cell information will be the output.

This allows the the LSTM to be useful for both long and short term correlation features within a time series.

In the [Yu et al., 2019] review of RNN LSTM cells and architectures, they explain that the simplest way to increase the depth and the capacity of LSTM network is to stack layers together in a way in which the output of the (L1)th layer at timetis treated as input of the Lth layer.

The input-output connections are the only connections between the LSTM layers of the network. The structure of a stacked LSTM is: Leth^L_t andh^L−1_t be the outputs in theLth and L−1th layer, with each layerLproducing a hidden stateH_t^Lbased on the current output

of the previous layerh^L−1_t and timeh^L_t−1. The forget gatef_t^Lof layerLcalculates the input for cell stateC_t−1^L , as in:

ft=σ(W_f^L[h^L_t−1, h^L−1_t +b^L_f), (4.9) whereσ(·)is a sigmoid function,W_f^Landbf are the weights matrix and bias vector of layerLof forget gate. The input gatei^L_t of theLlayer computes the values to be added to the memory cellc^L_t by

i^L_t =σ(W_i^L[h^L_t1, h^L1_t ] +b^L_i), (4.10) whereW_i^Lis the weights matrix of layerLof the input gate. The output gateo^L_t of the Lth layer filter the information and calculated the output value by:

o^L_t =σ(W_o^L[h^L_t1, h^L1_t ] +b^L_o), (4.11) whereW_f^o andb_o are the weights matrix and bias vector of the output gate in theL layer, respectively. Finally, the output of the memory cell is computed by:

ht=o^L_t ·tanh(c^L_t), (4.12) where·denotes the point-wise vector multiplication,tanhthe hyperbolic tangent function,

c^L_t =f_t^L·c^L_t−1+i^L_t ·c^−L_t−1, (4.13) and:

c^−L_t =tanh(W_c^L[h^L_t−1, h^L−1_t ] +b^L_c. (4.14)

4.6.2 Bi-directional Recurrent Neural Networks

Similar to LSTM, but contrasting in that they have two hidden layers connected to the input and output, [Schuster and Paliwal, 1997] and subsequently [Graves and Schmidhuber, 2005] devised BRNN as a variant which was developed to overcome certain limitations evi- dent in a standard RNN. Conceptually, the state of neurons is split where the positive direction (i.e. forward states) and the negative time direction (i.e. backward states) are. Outputs from forward states don’t have a connection with the backward state inputs. BRNN can be visualised as in Figure 4.4

4.7. REINFORCEMENT LEARNING 57

Figure 4.4: BRNN internal connections [Yu et al., 2019]

4.7 Reinforcement Learning

Described by [Chang et al., 2019], Reinforcement learning utilises self-adjustment and op- timisation by using feedback from it’s environment. As such, it’s architecture based on au- tomated agents, goal-oriented learning and decision making. The agents interact only with the environment, which it does not exercise control over. In each step of the calculations (t) the environment state is received by the agent. It has applicability in many other domains other than computer science, such as engineering, neuroscience, mathematics, psychology, and economics [Van Hasselt et al., 2016] .

[Zhong et al., 2020] discuss a need to infer fine-grained urban air quality based on existing monitoring stations and identify the challenge in effectively selecting some relevant stations for air quality inference. Their solution is a Reinforcement Learning model which comprises of:

• Station selector - dynamic selection of most relevant monitoring stations for inference,

• Air quality regressor - inference of the selected stations through a neural network.

Experimentation with air quality data from the Microsoft Beijing dataset further aug- mented with road data shows that Reinforcement Learning deployed as in their research showed superior results when the RMSE and Accuracy are compared against Linear, Gaus- sian, kNN, ADAIN methods.

[Hammoudeh, 2018] outlines the fundamental components of a Reinforcement learning system:

1. Policy (π) - defines the manner in which the agent works in the environment, and ultimately the goal is to find the optimal policy as part of the reinforcement learning process.

2. Reward Signal (R) - is a gauge of how good or bad an event is and defines the goal of the problem for the agent as maximising the total reward. As a factor for policy updates, they can be immediate or delayed.

3. Value Function - a prediction of the total future rewards by evaluating states and selecting the appropriate actions.

As shown in Equation 4.15, when starting from stateS the expected return is the state value functionVs

V(s) =E(Gtt=s) (4.15)

4.8 Support Vector Machines (SVM)

Support Vector Machines (SVM), also known as ’Support Vector Networks’, analyse data and identify patterns and are used for classification and regression analysis. They are a supervised learning method. Using a set of data, they categorise each input into one of two possible types. Shown to achieve better performance and accuracy, Support vector machines are a better option for time-series data analysis when it comes to determining the impact of pollutants on air-quality. [Vapnik, 1999], [Drucker et al., 1997], [Drucker et al., 1997].

Figure 4.5: SVM algorithms - (a) classification, (b) regression [Liang et al., 2020]

Figure 4.5 shows classification and regression for SVMs; [Liang et al., 2020] tells us that (a) shows classification, where data points that lie at the edge of an area closest to the hyperplanes - considered as support vectors.The margin is a space between the two regions. The hyperplanes will determine the number of classes that are incurred in the dataset, and the output of unseen data is predicted according to which class holds the most similarity with the new data. (b) illustrates regression. The approximation of this hyperplane to a non- linear function is constructed at the maximal margin with linear regression.ε-insensitive is an additional parameter which tolerates some deviations that lie inside theεregion tube.

[Bai et al., 2016] describe the steps to build an atmospheric pollutant concentration forecast model using a support vector machine:

1. Build an effective forecast factor,

2. Select kernel function and parameter values

3. Train the sample to provide the SVM forecast model with optimized parameters, get the support vector, and then determine the structure of the SVM,

4. Train the support vector predictor to forecast the test samples.

Dalam dokumen Deep correlation learning for urban air quality (Halaman 65-79)