Air quality is a complex subject. There are a wide variety of factors which impact on the quality as a whole, and therefore a wide array of ways to measure these factors. It is proven that when forecasting or predicting air-quality, the addition of variables such as meteoro- logical and geographic information increases accuracy. There is room for further work to be done using newer artificial intelligence methods such as deep learning and deep neural networks. What remains to be explored further as well is the investigation of how these factors relate to each other and the significance of the inter-relationship between them.
Where there has been work in the past, it is limited to some common air pollutants such asCO2, N O2 and particulate matter. It is a complex process, and there are many more measurements that can potentially impact a forecast.
There is a substantive gap in current literature around the correlations that may be present between air pollutants and other impacting factors when less common parameters are taken into account. This gap relates to the correlative study being done, rather than the predictive nature. It is, of course, known that certain atmospheric chemistry reactions will take place and thereby affect the concentrations and dispersion of other compounds in air but what is clearly missing is an exploration and analytical architecture to quantify and qualify for a greater number of parameters that may be less commonly seen.
Correlation analysis has shown its usefulness in other fields, for example, finance and the stock market, but when we turn our attention to air quality, the data is there but has not been discovered to the extent that is possible.
5.1 Existing research on Correlation
In research by [Levy et al., 2013] which involves evaluating multi-pollutant exposure for ur- ban air-quality, the research examined air pollutant concentrations and inter-relationships at the intra-urban scale to obtain insight into the nature of the urban mixture of air pollu- tants. They took measurements of 23 air pollutant parameters in several Canadian cities in 2009 and observed that there was variability in pollution levels and the statistical cor- relations between different pollutants according to season and neighbourhood. Nitrogen oxide species (nitric oxide,N O2, nitrogen oxides, and total oxidised nitrogen species) had the highest overall spatial correlations in the dataset. Ultra-fine particles and hydrocarbon- like organic aerosol concentration, a derived measure used as a specific indicator of traffic particles, also had very high correlations.
63
To this end, the findings indicated that the multi-pollutant mix varies considerably throughout the city, both in time and in space, and that no single pollutant would have been a perfect measurement for the overall mix under all of the circumstances. Neverthe- less, when based on the overall average spatial correlations within pollutants measured, nitrogen oxide would appear to be the best available indicator of the spatial variation of exposure to urban air pollutants outdoors.
Figure 5.1 shows Pearson correlation coefficients for pairs of pollutants for all mea- surement days combined in the [Levy et al., 2013] study (A; 34 measurement days), and for measurement days in the autumn (B; 6 days), summer (C; 17 days), and winter (D; 11 days), with non-significant correlations (p > 0.05) indicated by a black dot, and the mag- nitude of each correlation indicated on the colour bar to the right. Ratios of mean pollu- tant levels measured in the summer and winter compared with mean values based on all measurement days combined, and ratios of mean pollutant levels measured in the winter compared with the summer.
With increasing public interest in the quality of air and the increased availability of large amounts of environmental data made possible by the lower cost of sensor equipment, there are now vast amounts of data available for analysis. Data that is collected often have many parameters, and there are many measurements made, along with potentially varying de- grees of inter-dependencies. Due to this increasing complexity and size, traditional meth- ods are no longer suitable for analysis, and new methods which are more efficient need to be devised, for which data mining is the technique of choice. Data mining is a computing technique commonly used in the analysis of vast datasets to find patterns, extract specific information or make predictions.
Data mining algorithms can be broken down into the following categories, according to recent research on the use of data mining in air quality epidemiology by [Bellinger et al., 2017]
• Prediction
• Knowledge Discovery
Moreover, within these, there are four sub-categories:
• Classification and regression
• Clustering
• Association rule mining
• Outlier/Anomaly Detection
Besides, there are also other types of data analysis being developed, such as spatial data mining and graph data mining that are starting to come to light as research continues.
Source apportionment, as mentioned in our exploration of methods of accurate air qual- ity impact assessment, commonly relies on data mining techniques.
Table 5.1 summarises a group of source apportionment studies that used data min- ing techniques. The studies focus on the impact of chemical emissions and other air- borne agents in conjunction with climatological factors [Chen et al., 2010], [Chen et al., 2015,?] [Singh et al., 2013]. The focus is on the allocation of particular airborne pollutants to their potential sources like industrial sites, regions and major intersections. Their studies mainly focused on outdoor and urban air pollution as it is the most widely known issue.
5.1. EXISTING RESEARCH ON CORRELATION 65
Figure 5.1: Pearson correlation of pollutant parameters across seasons, Levy et al.(2013)
In particular, principal component analysis (PCA) has been applied to identify correla- tions and the importance of particular meteorological parameters, traffic, fuel-fired equip- ment and industries in causing air pollution [Chen et al., 2010], [Chen et al., 2015,?] [Singh et al., 2013]. Alternative approaches have utilised clustering-based solutions with correla- tion analysis to accomplish the task of source apportionment [Chen et al., 2010].
Author Year Environment Pollutant/Agents Technique(s) Objective [Chen et al., 2010] 2010 Outdoor air pollution Inorganic acids & basic
air pollutants
Hierarchical Clustering Explore relationship between climate and air pollutants
[Singh et al., 2013] 2013 Outdoor air pollution AQI PCA, SVM, DT Predicting air quality and identifying
air pollution sources.
[Fernández-Camacho et al., 2015]
2015 Urban air and noise pol- lution by traffic
NOx, O3, SO2, Black Car- bon
Fuzzy Clustering Find the relationship of noise to the traffic emission
[Chen et al., 2015] 2015 Outdoor air pollution Multiple air pollutants Clustering Source apportionment for air pollu- tants
[Li et al., 2018] 2017 Outdoor air pollution PM Trajectory clustering Use clustering to understand how sea-
sonality and meteorology effects pollu- tion sources for Beijing
Table 5.1: Summary of air pollution source apportionment studies using data mining tech- niques [Bellinger et al., 2017]
Table 5.2 represents studies that have been undertaken using data mining to understand the relationship between air pollutants and health conditions.
Author Year Environment Pollutants/ agents Technique(s) Objective
[Chen et al., 2010] 2010 Outdoor air pollution Inorganic acids & basic air pollutants
Hierarchical Clustering Explore relationship between climate and air pollutants
[Zhu et al., 2012] 2012 Urban outdoor air pollu- tion
SO2,N O2,P M10, Respiratory diseases
ARM, GMDH Forecasting the number of respiratory patients based on the seasonal effects of air pollution
[Pandey et al., 2013] 2013 Outdoor air pollution U F P,P M DT, RF Test machine learning classifiers for
predicting air quality and assess the impact of weather and traffic related variables onU F PandP M.
[Payus et al., 2013] 2013 Outdoor air pollution SO2, N O2,
P M10,CO,O_3$
ARM Find associations between combina-
tions of air pollutants with respiratory illness.
[Bobb et al., 2014] 2014 Mixture of chemicals Multiple chemicals, neu- rodevelopment, hemody- namics
Bayesian kernel machine regression (BKMR)
Identifying mixtures (e.g., metals) and components responsible for various health effects (e.g., neurodevelopment) [Gass et al., 2014] 2014 Outdoor air pollution CO,N O2,O3,P M Classification and regres-
sion trees
Apply classification and regression trees to generate hypothesis about ex- posure to mixtures of pollutants and health effects. They work with chil- dren’s asthma emergency visit [Fernández-Camacho
et al., 2015]
2015 Urban air and noise pol- lution by traffic
N Ox, O3, SO2, Black Carbon
Fuzzy clustering Find the relationship of noise to the traffic emission
[Bell and Edwards, 2015] 2015 General chemical expo- sure
219 chemicals ARM Find relationships between chemicals
and health bio-markers or diseases
[Qin et al., 2015] 2015 Outdoor air pollution P M ARM Exploring relationships of P M
spatial-temporal variations and how cities influence each other [Reid et al., 2016] 2016 Outdoor air quality with
wildfire
$PM_{2.5} Respiratory diseases
Generalised estimating equation and generalised boosting model
Finding the relationship between wildfire and associated increment in P M2.5 affects people with respiratory diseases
[Toti et al., 2016] 2016 Outdoor air pollution, paediatric asthma
SO2, N O, P M, N O2
ARM Exploring relationships of Air Pollu-
tion Exposure on Asthma [Mirto et al., 2016] 2016 Outdoor air pollution &
climate changes
Generic Spatial data mining, hot
spot analysis
Finding correlations between diseases (e.g. respiratory and cardiovascular diseases, cancer, male human infertil- ity) and air pollution due to climatic factors
[Li, 2017] 2017 Outdoor air pollution P M Trajectory clustering Apply clustering to identify transport
pathways, sources and seasonal vari- ations of particulate matter (P M2.5 andP M10) in Beijing for regulation purposes
[Stingone et al., 2017] 2017 Outdoor air pollution National air toxics assess- ment
DT Apply machine learning to identify air
pollutants exposure profiles when ex- ploring multiple pollutants (104 ambi- ent air toxics) and then estimate the magnitude of the profile’s effect on math scores in kindergarten children [Ghanem et al., 2004] 2004 Outdoor air pollution SO2,
C6H6,NO,NO_2,O3$
Hierarchical clustering Monitor chemicals and outline chal- lenges related to collection and pro- cessing.
Table 5.2: Summary of studies using data mining to understand the relationship between air pollution and health conditions [Bellinger et al., 2017]
5.2 Overview of Correlation Techniques
In the field of multivariate data analysis, there are a number of techniques available that may be applicable to an equally wide range of use-cases and research needs. As depicted
5.2. OVERVIEW OF CORRELATION TECHNIQUES 67
Figure 5.2: Classification of correlation techniques [Hair et al., 2006]
in Figure, and adopted as a classification in this thesis, [Hair et al., 2006] categorises these as either being either Interdependence techniques or dependence techniques.
Interdependence Techniques
• Exploratory Factor Analysis - Principal Components and common factor analysis
• Cluster analysis Dependence Techniques
• Multiple regression and multiple correlations
• Multivariate analysis of variance and covariance
• Multiple discriminant analysis
• Logistic regression
• structural equation modelling and confirmatory composite analysis
• Partial least squares structural equation modelling and confirmatory component anal- ysis
• canonical correlation analysis
• Conjoint analysis
• Perceptual mapping (also called multidimensional scaling)
• Correspondence analysis
5.2.1 Interdependence techniques
Interdependence techniques are where the underlying data cannot be classified as either
’dependent’ or ’independent’. Variables are therefore all analysed at the same time in order to find a way these variables can be structured using exploratory or confirmatory factor analysis if the overall structure is to be analysed, or cluster analysis if the grouping is to be analysed.
Exploratory factor analysis (principal components, Common factor analysis)
Exploratory factor analysis, incorporating both principal components and common factor analysis, is the statistical method of analysis where there are a large number of variables, and the purpose is to find out their common underlying dimensions (or factors). They can then be reduced dimensionally into a smaller set without losing the underlying informa- tional content. This empirical estimation method is a way to objectively created summation scales.
Principal Component Analysis (PCA)
PCA is method of feature extraction commonly used in environmental data analysis. The method can be used for conversion of a set of observations correlated variables into a set of values of linearly uncorrelated variables by an orthogonal transformation [Song, 2014].
PCA works for data (Xn×d) by producing a linear transformation (W) that maps the origi- nald−dimensional space onto al−dimensional feature subspace wherel < d. The reduced feature vectors are defined asx0i = WTxi, i = 1, ..., n, x0iX;n×l . Here, each column ofW represents an eigenvectorv, which can be obtained by solving the following eigen- decomposition,
Avi=λivi, (5.1)
where A = XXT represents the covariance matrix and λi represents the eigenvalue associated with the eigenvectorvi.
Cluster analysis
Cluster analysis techniques seek to identify meaningful subgroups ("clusters") by taking a sample set and classifying the members of the set into smaller, mutually exclusive groups based on similarities. Whereas in discriminant analysis defines groups, cluster analysis does not and uses the technique itself to identify the groups. There are usually three steps involved in cluster analysis:
1. measurement of some form of familiarity or identification of what the association is in order to how many groups exist in the sample
2. Members of the set are split into the groups ("clusters")
5.2. OVERVIEW OF CORRELATION TECHNIQUES 69 3. analysis to identify how the variables are made up, which may be achieved by using
discriminant analysis.
5.2.2 Dependence techniques
These techniques are divided according to two characteristics - the number of dependent variables, and the sort of measurement scale that can be used.
Multiple Regression Analysis
Multiple regression should be used when there is a single metric dependent variable for which it’s assumed a relationship to two or more independent variables, and the goal is to predict changes in the dependent variables in response to the other independent variables - usually using the "least squares’ statistical rule.
Y1=X1+X2+X3+...+Xn, (5.2) whereY is metric andXis either metric or non-metric.
Multivariate analysis of variance and covariance (MANOVA/MANCOVA)
Multivariate analysis of variance and covariance ("MANOVA") is an extension to univariate analysis of variance ("ANOVA") and looks to explore the relationship between several cat- egorical independent variables, referred to as "treatments", and two or more metric and in- dependent variables. When used with multivariate analysis of covariance ("MANCOVA") after experiments have been conducted, the effect of any uncontrolled metric independent variables ("covariates") can be removed - similar to how a third variable can be removed when using bivariate partial correlation.The technique is useful in situations where where several non-metric treatment variables are used to test the variance in a group where there are two or more metric dependent variables.
Y1+Y2+Y3+...+Yn =X1+X2+X3...+Xn, (5.3) whereY is metric, andXis non-metric
Y1=X1+X2+X3+...+Xn, (5.4) whereY is metric andXseries is non-metric
Multiple Discriminant Analysis
The objective of multiple discriminant analysis is to understand differences in a group of variables, and predict the likelihood that a variable could belong group based on several non-metric independent variables. Multiple discriminant analysis should be used if the single dependent is non-metric (i.e. dichotomous or multichotomous). Much like multiple regression, the independent variables are assumed to metric. It is applicable in instances where the total sample group is able to be split into groups based on a non-metric depen- dent variable which characterises several known classes [Hair et al., 2006].
Y1=X1+X2+X3+...+Xn, (5.5) whereY is non-metric andXis metric
Logistic Regression Analysis
Logistic regression ("logit analysis") is a combination model which uses multiple regres- sion and multiple discriminant analysis. It bears a similarity to multiple regression anal- ysis in that one more independent variables are used to predict an independent variable but is distinguished by the dependent variable being non-metric as in discriminant anal- ysis. The non-metric scale of the dependent variable requires differences in the way it is estimated and the assumptions about how it’s distributed which is it’s differentiation from multiple regression. When the dependent variable specified correctly and appropriate esti- mation techniques used, the other basic factors involved in multiple regression are likewise utilised. The distinction between logistic regression and discriminant analysis is that it only applies to binary dependent variables, can accommodate all types of independent variables - both metric and non-metric - and don’t require the assumption of multivariate normality.
Logistic regression is an appropriate technique where there are not two or more levels of dependent variable as discriminant analysis is better suited to this.
Y1=X1+X2+X3+...+Xn, (5.6) whereY is binary non-metric andXis metric or non-metric
Structural equation modelling (SEM) and confirmatory factor analysis
Structural equation modelling ("SEM"), also referred to as covariance-based SEM, allows separate relationships for each set of variables, and is based upon an analysis of only the common variance beginning with calculation of the covariance matrix. Structural equation modelling is an appropriate and efficient technique when this is a series of separate multi- ple regression equations that are simultaneously estimated. The characterising aspects of a structural equation model are:
1. The structural model, 2. The measurement model.
The structural model is the path which serves to relate independent and dependent variables and along with theory, prior experience or other general guidelines on the topic, it allows a distinction to be drawn where independent variables predict the dependent variables. Models that use multiple dependent variables (e.g. multivariate analysis of vari- ance/covariance (MANOVA/MANCOVA) and canonical correlation) aren’t useful in this scenario because they only can accommodate a single relationship between the dependent and independent variables. The measurement model uses several variables ("indicators") for a single independent or dependent variable.
Y1=X11 +X12 +X13 +...+X1n, Y2=X21 +X22 +X23 +...+X2n Ym=Xm1 +Xm2 +Xm3 +...+Xmn,
5.2. OVERVIEW OF CORRELATION TECHNIQUES 71 Confirmatory factor analysis enables the contribution of each scale variable to be as- sessed and how well that scale measures the concept ("reliability"). These scales can then be integrated into the estimate of dependent and independent variables’ relationships in a structural model. Similar to the use of another technique, exploratory factor analysis, where the scale items are modelled and those factor scores then used in the regression.
Partial least squares structural equation modelling
An alternative to structural equation modelling, partial least squares structural equation modelling ("PLS-SEM"), uses an analysis of total variance and consists of measurement and structural models. Again, theory, prior experience or general guidelines on the topic enable a researcher to make a distinction on which independent variables predict the dependent ones. The first step is examining the measurement model - known as confirmatory com- posite analysis. As with confirmatory factor analysis, the contribution of each measured variable to it’s construct is assessed as well as evaluating it’s reliability and/or validity for the measurement model. If the measurement model is determined to be sound, the structural model is analysed. Variance based structural equation models focus primarily on prediction and explanation of the relationships rather than confirming an established theory as in covariance based structural equation modelling.
Canonical Correlation Explored fully in Section 6.1
Y1+Y2+Y3+...+Yn=X1+X2+X2+...+Xn, (5.7) whereY is either metric or non metric andXis either metric or non metric.
Conjoint analysis
Conjoint analysis is a dependence technique is useful in the evaluation of objects such as products, services or ideas where it enables evaluation of complexity while still maintain- ing a realistic decision context. It has application in market research for assessment of the importance of attributes and the level of these attributes.
In a review of conjoint analysis in environmental evaluation, [Alriksson and Öberg, 2008] were of the opinion conjoint analysis could be possibly developed into a useful tool or method for environmental risk analysis and communication - both of which, in their opinion, are necessary for an efficient approach to risk governance. They concluded:
"Compared to marketing and transportation, the number of environmental conjoint studies is rather small but increasing, and the method has proven to work effectively in eliciting preferences on environmental issues. In environmental issues, experimenters often use choice experiments, especially concerning ecosystem management and envi- ronmental evaluations. When it comes to evaluating preferences concerning agricul- ture, forestry, energy and products, a more traditional approach of conjoint analysis is favoured."
Conjoint analysis takes the form:
Y1=X1+X2+X3+...+Xn, (5.8) whereY either metric or non-metric andXis non-metric