Data Correlation - Deep correlation learning for urban air quality

Air quality is a complex subject. There are a wide variety of factors which impact on the quality as a whole, and therefore a wide array of ways to measure these factors. It is proven that when forecasting or predicting air-quality, the addition of variables such as meteorological and geographic information increases accuracy. There is room for further work to be done using newer artificial intelligence methods such as deep learning and deep neural networks. What remains to be explored further as well is the investigation of how these factors relate to each other and the significance of the inter-relationship between them.

Where there has been work in the past, it is limited to some common air pollutants such asCO2, N O2 and particulate matter. It is a complex process, and there are many more measurements that can potentially impact a forecast.

There is a substantive gap in current literature around the correlations that may be present between air pollutants and other impacting factors when less common parameters are taken into account. This gap relates to the correlative study being done, rather than the predictive nature. It is, of course, known that certain atmospheric chemistry reactions will take place and thereby affect the concentrations and dispersion of other compounds in air but what is clearly missing is an exploration and analytical architecture to quantify and qualify for a greater number of parameters that may be less commonly seen.

Correlation analysis has shown its usefulness in other fields, for example, finance and the stock market, but when we turn our attention to air quality, the data is there but has not been discovered to the extent that is possible.

5.1 Existing research on Correlation

In research by [Levy et al., 2013] which involves evaluating multi-pollutant exposure for urban air-quality, the research examined air pollutant concentrations and inter-relationships at the intra-urban scale to obtain insight into the nature of the urban mixture of air pollutants. They took measurements of 23 air pollutant parameters in several Canadian cities in 2009 and observed that there was variability in pollution levels and the statistical correlations between different pollutants according to season and neighbourhood. Nitrogen oxide species (nitric oxide,N O₂, nitrogen oxides, and total oxidised nitrogen species) had the highest overall spatial correlations in the dataset. Ultra-fine particles and hydrocarbon- like organic aerosol concentration, a derived measure used as a specific indicator of traffic particles, also had very high correlations.

To this end, the findings indicated that the multi-pollutant mix varies considerably throughout the city, both in time and in space, and that no single pollutant would have been a perfect measurement for the overall mix under all of the circumstances. Neverthe- less, when based on the overall average spatial correlations within pollutants measured, nitrogen oxide would appear to be the best available indicator of the spatial variation of exposure to urban air pollutants outdoors.

Figure 5.1 shows Pearson correlation coefficients for pairs of pollutants for all measurement days combined in the [Levy et al., 2013] study (A; 34 measurement days), and for measurement days in the autumn (B; 6 days), summer (C; 17 days), and winter (D; 11 days), with non-significant correlations (p > 0.05) indicated by a black dot, and the magnitude of each correlation indicated on the colour bar to the right. Ratios of mean pollutant levels measured in the summer and winter compared with mean values based on all measurement days combined, and ratios of mean pollutant levels measured in the winter compared with the summer.

With increasing public interest in the quality of air and the increased availability of large amounts of environmental data made possible by the lower cost of sensor equipment, there are now vast amounts of data available for analysis. Data that is collected often have many parameters, and there are many measurements made, along with potentially varying de- grees of inter-dependencies. Due to this increasing complexity and size, traditional methods are no longer suitable for analysis, and new methods which are more efficient need to be devised, for which data mining is the technique of choice. Data mining is a computing technique commonly used in the analysis of vast datasets to find patterns, extract specific information or make predictions.

Data mining algorithms can be broken down into the following categories, according to recent research on the use of data mining in air quality epidemiology by [Bellinger et al., 2017]

• Prediction

• Knowledge Discovery

Moreover, within these, there are four sub-categories:

• Classification and regression

• Clustering

• Association rule mining

• Outlier/Anomaly Detection

Besides, there are also other types of data analysis being developed, such as spatial data mining and graph data mining that are starting to come to light as research continues.

Source apportionment, as mentioned in our exploration of methods of accurate air quality impact assessment, commonly relies on data mining techniques.

Table 5.1 summarises a group of source apportionment studies that used data mining techniques. The studies focus on the impact of chemical emissions and other airborne agents in conjunction with climatological factors [Chen et al., 2010], [Chen et al., 2015,?] [Singh et al., 2013]. The focus is on the allocation of particular airborne pollutants to their potential sources like industrial sites, regions and major intersections. Their studies mainly focused on outdoor and urban air pollution as it is the most widely known issue.

5.1. EXISTING RESEARCH ON CORRELATION 65

Figure 5.1: Pearson correlation of pollutant parameters across seasons, Levy et al.(2013)

In particular, principal component analysis (PCA) has been applied to identify correlations and the importance of particular meteorological parameters, traffic, fuel-fired equipment and industries in causing air pollution [Chen et al., 2010], [Chen et al., 2015,?] [Singh et al., 2013]. Alternative approaches have utilised clustering-based solutions with correlation analysis to accomplish the task of source apportionment [Chen et al., 2010].

Author Year Environment Pollutant/Agents Technique(s) Objective [Chen et al., 2010] 2010 Outdoor air pollution Inorganic acids & basic

air pollutants

Hierarchical Clustering Explore relationship between climate and air pollutants

[Singh et al., 2013] 2013 Outdoor air pollution AQI PCA, SVM, DT Predicting air quality and identifying

air pollution sources.

[Fernández-Camacho et al., 2015]

2015 Urban air and noise pollution by traffic

NOx, O3, SO2, Black Car- bon

Fuzzy Clustering Find the relationship of noise to the traffic emission

[Chen et al., 2015] 2015 Outdoor air pollution Multiple air pollutants Clustering Source apportionment for air pollutants

[Li et al., 2018] 2017 Outdoor air pollution PM Trajectory clustering Use clustering to understand how sea-

sonality and meteorology effects pollution sources for Beijing

Table 5.1: Summary of air pollution source apportionment studies using data mining techniques [Bellinger et al., 2017]

Table 5.2 represents studies that have been undertaken using data mining to understand the relationship between air pollutants and health conditions.

Author Year Environment Pollutants/ agents Technique(s) Objective

[Chen et al., 2010] 2010 Outdoor air pollution Inorganic acids & basic air pollutants

Hierarchical Clustering Explore relationship between climate and air pollutants

[Zhu et al., 2012] 2012 Urban outdoor air pollution

SO2,N O2,P M10, Respiratory diseases

ARM, GMDH Forecasting the number of respiratory patients based on the seasonal effects of air pollution

[Pandey et al., 2013] 2013 Outdoor air pollution U F P,P M DT, RF Test machine learning classifiers for

predicting air quality and assess the impact of weather and traffic related variables onU F PandP M.

[Payus et al., 2013] 2013 Outdoor air pollution SO2, N O2,

P M10,CO,O_3$

ARM Find associations between combina-

tions of air pollutants with respiratory illness.

[Bobb et al., 2014] 2014 Mixture of chemicals Multiple chemicals, neurodevelopment, hemody- namics

Bayesian kernel machine regression (BKMR)

Identifying mixtures (e.g., metals) and components responsible for various health effects (e.g., neurodevelopment) [Gass et al., 2014] 2014 Outdoor air pollution CO,N O2,O3,P M Classification and regres-

sion trees

Apply classification and regression trees to generate hypothesis about exposure to mixtures of pollutants and health effects. They work with children’s asthma emergency visit [Fernández-Camacho

et al., 2015]

2015 Urban air and noise pollution by traffic

N Ox, O3, SO2, Black Carbon

Fuzzy clustering Find the relationship of noise to the traffic emission

[Bell and Edwards, 2015] 2015 General chemical exposure

219 chemicals ARM Find relationships between chemicals

and health bio-markers or diseases

[Qin et al., 2015] 2015 Outdoor air pollution P M ARM Exploring relationships of P M

spatial-temporal variations and how cities influence each other [Reid et al., 2016] 2016 Outdoor air quality with

wildfire

$PM_{2.5} Respiratory diseases

Generalised estimating equation and generalised boosting model

Finding the relationship between wildfire and associated increment in P M2.5 affects people with respiratory diseases

[Toti et al., 2016] 2016 Outdoor air pollution, paediatric asthma

SO2, N O, P M, N O2

ARM Exploring relationships of Air Pollu-

tion Exposure on Asthma [Mirto et al., 2016] 2016 Outdoor air pollution &

climate changes

Generic Spatial data mining, hot

spot analysis

Finding correlations between diseases (e.g. respiratory and cardiovascular diseases, cancer, male human infertil- ity) and air pollution due to climatic factors

[Li, 2017] 2017 Outdoor air pollution P M Trajectory clustering Apply clustering to identify transport

pathways, sources and seasonal variations of particulate matter (P M2.5 andP M10) in Beijing for regulation purposes

[Stingone et al., 2017] 2017 Outdoor air pollution National air toxics assessment

DT Apply machine learning to identify air

pollutants exposure profiles when exploring multiple pollutants (104 ambi- ent air toxics) and then estimate the magnitude of the profile’s effect on math scores in kindergarten children [Ghanem et al., 2004] 2004 Outdoor air pollution SO2,

C6H6,NO,NO_2,O3$

Hierarchical clustering Monitor chemicals and outline chal- lenges related to collection and pro- cessing.

Table 5.2: Summary of studies using data mining to understand the relationship between air pollution and health conditions [Bellinger et al., 2017]

5.2 Overview of Correlation Techniques

In the field of multivariate data analysis, there are a number of techniques available that may be applicable to an equally wide range of use-cases and research needs. As depicted

5.2. OVERVIEW OF CORRELATION TECHNIQUES 67

Figure 5.2: Classification of correlation techniques [Hair et al., 2006]

in Figure, and adopted as a classification in this thesis, [Hair et al., 2006] categorises these as either being either Interdependence techniques or dependence techniques.

Interdependence Techniques

• Exploratory Factor Analysis - Principal Components and common factor analysis

• Cluster analysis Dependence Techniques

• Multiple regression and multiple correlations

• Multivariate analysis of variance and covariance

• Multiple discriminant analysis

• Logistic regression

• structural equation modelling and confirmatory composite analysis

• Partial least squares structural equation modelling and confirmatory component analysis

• canonical correlation analysis

• Conjoint analysis

• Perceptual mapping (also called multidimensional scaling)

• Correspondence analysis

5.2.1 Interdependence techniques

Interdependence techniques are where the underlying data cannot be classified as either

’dependent’ or ’independent’. Variables are therefore all analysed at the same time in order to find a way these variables can be structured using exploratory or confirmatory factor analysis if the overall structure is to be analysed, or cluster analysis if the grouping is to be analysed.

Exploratory factor analysis (principal components, Common factor analysis)

Exploratory factor analysis, incorporating both principal components and common factor analysis, is the statistical method of analysis where there are a large number of variables, and the purpose is to find out their common underlying dimensions (or factors). They can then be reduced dimensionally into a smaller set without losing the underlying informa- tional content. This empirical estimation method is a way to objectively created summation scales.

Principal Component Analysis (PCA)

PCA is method of feature extraction commonly used in environmental data analysis. The method can be used for conversion of a set of observations correlated variables into a set of values of linearly uncorrelated variables by an orthogonal transformation [Song, 2014].

PCA works for data (X^n×d) by producing a linear transformation (W) that maps the origi- nald−dimensional space onto al−dimensional feature subspace wherel < d. The reduced feature vectors are defined asx⁰_i = W^Txi, i = 1, ..., n, x⁰_iX;n×l . Here, each column ofW represents an eigenvectorv, which can be obtained by solving the following eigen- decomposition,

Av_i=λ_iv_i, (5.1)

where A = XX^T represents the covariance matrix and λi represents the eigenvalue associated with the eigenvectorvi.

Cluster analysis

Cluster analysis techniques seek to identify meaningful subgroups ("clusters") by taking a sample set and classifying the members of the set into smaller, mutually exclusive groups based on similarities. Whereas in discriminant analysis defines groups, cluster analysis does not and uses the technique itself to identify the groups. There are usually three steps involved in cluster analysis:

1. measurement of some form of familiarity or identification of what the association is in order to how many groups exist in the sample

2. Members of the set are split into the groups ("clusters")

5.2. OVERVIEW OF CORRELATION TECHNIQUES 69 3. analysis to identify how the variables are made up, which may be achieved by using

discriminant analysis.

5.2.2 Dependence techniques

These techniques are divided according to two characteristics - the number of dependent variables, and the sort of measurement scale that can be used.

Multiple Regression Analysis

Multiple regression should be used when there is a single metric dependent variable for which it’s assumed a relationship to two or more independent variables, and the goal is to predict changes in the dependent variables in response to the other independent variables - usually using the "least squares’ statistical rule.

Y1=X1+X2+X3+...+Xn, (5.2) whereY is metric andXis either metric or non-metric.

Multivariate analysis of variance and covariance (MANOVA/MANCOVA)

Multivariate analysis of variance and covariance ("MANOVA") is an extension to univariate analysis of variance ("ANOVA") and looks to explore the relationship between several cat- egorical independent variables, referred to as "treatments", and two or more metric and independent variables. When used with multivariate analysis of covariance ("MANCOVA") after experiments have been conducted, the effect of any uncontrolled metric independent variables ("covariates") can be removed - similar to how a third variable can be removed when using bivariate partial correlation.The technique is useful in situations where where several non-metric treatment variables are used to test the variance in a group where there are two or more metric dependent variables.

Y₁+Y₂+Y₃+...+Y_n =X₁+X₂+X₃...+X_n, (5.3) whereY is metric, andXis non-metric

Y₁=X₁+X₂+X₃+...+X_n, (5.4) whereY is metric andXseries is non-metric

Multiple Discriminant Analysis

The objective of multiple discriminant analysis is to understand differences in a group of variables, and predict the likelihood that a variable could belong group based on several non-metric independent variables. Multiple discriminant analysis should be used if the single dependent is non-metric (i.e. dichotomous or multichotomous). Much like multiple regression, the independent variables are assumed to metric. It is applicable in instances where the total sample group is able to be split into groups based on a non-metric dependent variable which characterises several known classes [Hair et al., 2006].

Y1=X1+X2+X3+...+Xn, (5.5) whereY is non-metric andXis metric

Logistic Regression Analysis

Logistic regression ("logit analysis") is a combination model which uses multiple regression and multiple discriminant analysis. It bears a similarity to multiple regression analysis in that one more independent variables are used to predict an independent variable but is distinguished by the dependent variable being non-metric as in discriminant analysis. The non-metric scale of the dependent variable requires differences in the way it is estimated and the assumptions about how it’s distributed which is it’s differentiation from multiple regression. When the dependent variable specified correctly and appropriate estimation techniques used, the other basic factors involved in multiple regression are likewise utilised. The distinction between logistic regression and discriminant analysis is that it only applies to binary dependent variables, can accommodate all types of independent variables - both metric and non-metric - and don’t require the assumption of multivariate normality.

Logistic regression is an appropriate technique where there are not two or more levels of dependent variable as discriminant analysis is better suited to this.

Y₁=X₁+X₂+X₃+...+X_n, (5.6) whereY is binary non-metric andXis metric or non-metric

Structural equation modelling (SEM) and confirmatory factor analysis

Structural equation modelling ("SEM"), also referred to as covariance-based SEM, allows separate relationships for each set of variables, and is based upon an analysis of only the common variance beginning with calculation of the covariance matrix. Structural equation modelling is an appropriate and efficient technique when this is a series of separate multiple regression equations that are simultaneously estimated. The characterising aspects of a structural equation model are:

1. The structural model, 2. The measurement model.

The structural model is the path which serves to relate independent and dependent variables and along with theory, prior experience or other general guidelines on the topic, it allows a distinction to be drawn where independent variables predict the dependent variables. Models that use multiple dependent variables (e.g. multivariate analysis of variance/covariance (MANOVA/MANCOVA) and canonical correlation) aren’t useful in this scenario because they only can accommodate a single relationship between the dependent and independent variables. The measurement model uses several variables ("indicators") for a single independent or dependent variable.

Y₁=X₁1 +X₁2 +X₁3 +...+X₁n, Y₂=X₂1 +X₂2 +X₂3 +...+X₂n Y_m=X_m1 +X_m2 +X_m3 +...+X_mn,

5.2. OVERVIEW OF CORRELATION TECHNIQUES 71 Confirmatory factor analysis enables the contribution of each scale variable to be assessed and how well that scale measures the concept ("reliability"). These scales can then be integrated into the estimate of dependent and independent variables’ relationships in a structural model. Similar to the use of another technique, exploratory factor analysis, where the scale items are modelled and those factor scores then used in the regression.

Partial least squares structural equation modelling

An alternative to structural equation modelling, partial least squares structural equation modelling ("PLS-SEM"), uses an analysis of total variance and consists of measurement and structural models. Again, theory, prior experience or general guidelines on the topic enable a researcher to make a distinction on which independent variables predict the dependent ones. The first step is examining the measurement model - known as confirmatory composite analysis. As with confirmatory factor analysis, the contribution of each measured variable to it’s construct is assessed as well as evaluating it’s reliability and/or validity for the measurement model. If the measurement model is determined to be sound, the structural model is analysed. Variance based structural equation models focus primarily on prediction and explanation of the relationships rather than confirming an established theory as in covariance based structural equation modelling.

Canonical Correlation Explored fully in Section 6.1

Y1+Y2+Y3+...+Yn=X1+X2+X2+...+Xn, (5.7) whereY is either metric or non metric andXis either metric or non metric.

Conjoint analysis

Conjoint analysis is a dependence technique is useful in the evaluation of objects such as products, services or ideas where it enables evaluation of complexity while still maintain- ing a realistic decision context. It has application in market research for assessment of the importance of attributes and the level of these attributes.

In a review of conjoint analysis in environmental evaluation, [Alriksson and Öberg, 2008] were of the opinion conjoint analysis could be possibly developed into a useful tool or method for environmental risk analysis and communication - both of which, in their opinion, are necessary for an efficient approach to risk governance. They concluded:

"Compared to marketing and transportation, the number of environmental conjoint studies is rather small but increasing, and the method has proven to work effectively in eliciting preferences on environmental issues. In environmental issues, experimenters often use choice experiments, especially concerning ecosystem management and environmental evaluations. When it comes to evaluating preferences concerning agricul- ture, forestry, energy and products, a more traditional approach of conjoint analysis is favoured."

Conjoint analysis takes the form:

Y1=X1+X2+X3+...+Xn, (5.8) whereY either metric or non-metric andXis non-metric

Dalam dokumen Deep correlation learning for urban air quality (Halaman 79-91)