Deep correlation learning for urban air quality
Analysis and prediction in New Zealand Regan Lee Andrews
A thesis presented in partial fulfilment for the degree of
Doctor of Computing
Unitec Institute of Technology Auckland, New Zealand
August, 2021
Analysis and prediction in New Zealand
Regan Lee Andrews
Primary Supervisor: Dr Guillermo Ramirez-Prado Associate Supervisor: Assoc. Prof. Hamid Sharifzadeh
AbstractAir pollution and air quality are undoubtedly a very high priority for govern- ments, organisations and people of all countries, with developing countries such as India, China and Brazil being especially affected by a high level of pollutants. There are a large number of studies which have shown that the quality of air has a direct impact on public health and preventable diseases. For example, particulate matter in the air (microscopic particles that a diameter of less than 2.5µm) is able to be inhaled into the lungs, increasing risk of respiratory and general health disorders. High levels of this increase a person’s risk of developing a number of serious long-term disorders and ultimately fatal diseases.
There are many ways in which air quality has been predicted over the years, with meth- ods falling into the two categories of statistical models and computational models. With technology advances being able to measure more, create data on more and process more, we are now in the era where Artificial Intelligence has become the gold-standard for gen- erating efficient and accurate predictions. For these predictions to be accurate, we require not only accurate data but the correct data to be in used in the first place. Many meth- ods have been devised and used in environmental studies to discover the associations and correlations between the different types of variables that can be found.
Correlation analysis is the focus of our research so we may study the influential fac- tors involved in air quality movements, use the ones which have an impact and ignore the ones that do not. This is hypothesised to increase not only the accuracy, but also the efficiency we can now deal with computing air quality. We choose to use Canonical Corre- lation Analysis (CCA) due to it’s relative obscurity in environmental scientific studies, but increasingly common usage in other fields such as biological sciences and social sciences such as psychology.
The most central statistic in a CCA is the canonical correlation between the two syn- thetic variables, and this statistic is, in effect, a Pearsonr. The computations involved in a CCA are done with the goal of maximising this simple and common correlation [Sherry and Henson, 2005].
The solution in this research was implemented in Python using minor alterations of an established and published deep-learning-based in-depth canonical correlation analysis solution [Andrew et al., 2013] to good effect, with the best correlations defined as > +0.50 or > -0.50 which reduced the dataset of 73 variables, spread over 12 monitoring regions to more manageable level which were indicative of changes really occurring in the air mixture.
Using the commonly used performance measures - MSE, RMSE, MAE andR2 - the new correlative findings were run using five multi-layer neural networks of different sizes, linear regression, XGBoost, AdaBoost, SVM and kNN to understand whether this results in an accuracy that is better or worse than using all 73 variables in the original dataset. In
ii
iii almost all instances, the performance is considerably better. Using the MAE as the indicator here, it appears that an AdaBoost algorithm is best in accuracy, with a 97.142% decrease over the original followed by an MLP neural network using 1024,1024,1024 hidden layers.
The time taken for learning is considerable on very high-end computing hardware, and use against a baseline of quite a low-level of sophistication may explain this type of increase.
To this end, Deep Canonical Correlation Analysis (DCCA) can be seen to increase both the efficiency and accuracy of existing air quality prediction activities under most use-cases.
Keywords:Auckland, New Zealand, air pollution analysis, air quality, prediction, forecast- ing, correlation, in-depth canonical correlation analysis, deep canonical correlation analysis
List of Publications
Submitted:
Andrews, Regan L. (2020). A technical review of air quality impact analysis methods. Unitec ePress
v
Contents
List of Publications v
List of Figures xi
List of Tables xiii
List of Abbreviations xv
1 Introduction 1
1.1 Background and context . . . 1
1.1.1 Health effects of air pollution . . . 1
1.1.2 New Zealand . . . 3
1.1.3 M¯aori perspective . . . 4
1.1.4 History . . . 4
1.2 Motivation and Gaps . . . 5
1.3 Research Challenges . . . 6
1.4 Research Questions . . . 7
1.5 Contributions . . . 7
1.6 Thesis organisation . . . 8
1.6.1 Definitions . . . 9
2 The Environment and Data 13 2.1 Air composition of the atmosphere . . . 13
2.1.1 Stratification . . . 14
2.2 Data Collection . . . 15
2.3 Problems and challenges with data analysis . . . 16
2.4 Air quality data in Auckland . . . 18
2.4.1 Particulate Matter -P M2.5 . . . 18
2.4.2 Particulate Matter -P M10 . . . 18
2.4.3 Carbon Oxides -CO,CO2 . . . 21
2.4.4 Nitrates -N Ox,N2,N2O,N O,N O2 . . . 23
2.4.5 Sulphur Dioxide -SO2 . . . 25
3 Air quality impact analysis and modelling 29 3.1 Introduction . . . 29
3.2 Pressure, state and impacts on air quality . . . 30
3.3 Anthropogenic impacts . . . 30
3.3.1 Vehicles . . . 30 vii
3.3.2 Aviation . . . 34
3.3.3 Ocean . . . 35
3.3.4 Industrial . . . 36
3.3.5 Construction dust (fugitive emissions) . . . 37
3.4 Natural Impacts . . . 39
3.5 Determination of impact . . . 40
3.5.1 Emissions inventories . . . 40
3.5.2 Air-quality models . . . 41
4 Air quality prediction computational methods 49 4.1 Introduction . . . 49
4.2 Historical context . . . 49
4.3 Hidden Markov Models (HMM) . . . 50
4.4 Neural Networks (NN) . . . 53
4.4.1 Perceptrons/Multi-layer Perceptrons . . . 54
4.5 Convolutional Neural Network (CNN) . . . 55
4.6 Recurrent Neural Networks (RNN)/Long-Short Term Memory (LSTM) . . . . 55
4.6.1 Long Short-Term Memory Neural Networks (LSTM) . . . 55
4.6.2 Bi-directional Recurrent Neural Networks . . . 56
4.7 Reinforcement Learning . . . 57
4.8 Support Vector Machines (SVM) . . . 58
4.9 Random Forests . . . 60
4.10 Bagging . . . 61
4.11 Boosting . . . 61
4.12 Stacking . . . 61
4.13 Conclusion . . . 62
5 Data Correlation 63 5.1 Existing research on Correlation . . . 63
5.2 Overview of Correlation Techniques . . . 66
5.2.1 Interdependence techniques . . . 68
5.2.2 Dependence techniques . . . 69
5.2.3 Causality and other techniques . . . 72
6 Canonical correlation analysis and deep learning 75 6.1 Introduction . . . 75
6.2 Canonical correlation analysis in the environmental domain . . . 78
6.3 Canonical correlation analysis . . . 79
6.4 Variations of Canonical Correlation Analysis . . . 80
6.5 Differentiation of CCA techniques . . . 81
6.5.1 Relationship between CCA and other multivariate and univariate tech- niques . . . 81
6.5.2 CCA advantages . . . 82
6.5.3 CCA use-cases . . . 85
6.5.4 CCA related methods . . . 88
6.5.5 CCA extensions . . . 88
6.5.6 CCA limitations . . . 89
6.6 Deep Canonical Correlation Analysis . . . 91
6.6.1 DCCA methodology . . . 91
CONTENTS ix
6.6.2 Non-saturating non-linearity . . . 93
7 Research design and implementation 95 7.1 Introduction . . . 95
7.1.1 Experimental environment . . . 96
7.2 Data attributes and pre-processing tasks . . . 97
7.2.1 Data attributes and sample size . . . 97
7.2.2 Geographical locations of data . . . 99
7.2.3 Data pre-processing . . . 101
7.3 Prediction algorithms . . . 102
7.3.1 Deep Neural Network . . . 102
7.3.2 Support Vector Machine (SVM) . . . 104
7.3.3 AdaBoost . . . 105
7.3.4 XGBoost . . . 105
7.3.5 kNN . . . 107
7.3.6 Linear Regression . . . 107
7.4 Canonical correlation analysis . . . 110
7.5 Deep canonical correlation analysis . . . 120
7.5.1 Hyperparameters . . . 120
7.5.2 DCCA gradient . . . 120
7.5.3 Python implementation . . . 122
8 Results analysis and discussion 123 8.1 Standard Models . . . 123
8.1.1 Pearson correlation coefficient (ρ) . . . 123
8.1.2 Spearman rank correlation coefficient (ρ,rs) . . . 128
8.1.3 Kendall’s tau-b (τb) correlation coefficient . . . 130
8.1.4 Limitations of standard correlation techniques . . . 132
8.2 Performance Evaluation . . . 134
8.2.1 Selection of performance evaluation methods . . . 135
8.2.2 Mean squared error (MSE) . . . 135
8.2.3 Root mean square error (RMSE) . . . 137
8.2.4 Mean absolute error (MAE) . . . 138
8.2.5 R-squared (R2) . . . 140
8.2.6 Conclusion . . . 141
9 Conclusion and Future Works 143 9.1 Research Contributions . . . 144
9.2 Research Limitations . . . 144
9.3 Conclusion . . . 145
9.4 Future Work . . . 147
Appendices 151 A Data descriptive statistics 153 A.1 Min, Max, Mean, Standard Deviation . . . 153
B DCCA Full Correlations (Pearson) 157
C Python DCCA Code 181
Bibliography 191
List of Figures
1.1 Health effects of pollution (ISEE) . . . 3
2.1 P M25emissions by sector for 2015, [MfE, 2015] . . . 19
2.2 Breakdown ofP M2.5emissions from energy related activities, [MfE, 2015] . . 19
2.3 P M2.5emissions by source for 2015 (t/yr), [MfE, 2015] . . . 20
2.4 P M1.0emissions by sector, [MfE, 2015] . . . 20
2.5 Breakdown ofP M10emissions from energy related activities, [MfE, 2015] . . 21
2.6 P M10emissions by source for 2015, [MfE, 2015] . . . 22
2.7 COemissions by sector for 2015, [MfE, 2015] . . . 22
2.8 Breakdown ofCOemissions from energy related activities, [MfE, 2015] . . . . 23
2.9 COemissions by source for 2015, [MfE, 2015] . . . 24
2.10 N Oxemissions by sector for 2015, [MfE, 2015] . . . 24
2.11 Breakdown ofN Oxemissions from energy related activities, [MfE, 2015] . . . 25
2.12 N Oxemissions by source for 2015, [MfE, 2015] . . . 26
2.13 SO2emissions by sector for 2015, [MfE, 2015] . . . 26
2.14 Breakdown ofSO2emissions from energy related activities, [MfE, 2015] . . . 27
2.15 Breakdown ofSO2emissions from energy related activities, [MfE, 2015] . . . 27
3.1 Categorisation of impacts . . . 29
3.2 Size range of airborne dust [WHO, 2006] . . . 33
3.3 Phases of aircraft flight . . . 34
3.4 Wood burning haze in suburban Auckland, New Zealand . . . 38
3.5 Deterministic model . . . 42
3.6 Gaussian plume . . . 43
4.1 Hidden Markov Model . . . 51
4.2 Probabilistic parameters of a hidden Markov model (example) . . . 52
4.3 Simple Neural Network . . . 53
4.4 BRNN internal connections [Yu et al., 2019] . . . 57
4.5 SVM algorithms - (a) classification, (b) regression [Liang et al., 2020] . . . 58
4.6 Random Forest withmnumber of trees [Liang et al., 2020] . . . 61
5.1 Pearson correlation of pollutant parameters across seasons . . . 65
5.2 Classification of correlation techniques [Hair et al., 2006] . . . 67
6.2 Canonical Functions . . . 76
6.1 Technical details of CCA and relationship between CCA and its variants. [Zhuang et al., 2020] . . . 77
xi
6.3 Depiction of CCA network [Wang et al., 2018] . . . 85
6.4 CCA selections [Zhuang et al., 2020] . . . 87
6.5 Choosing a CCA method [Wang et al., 2018] . . . 91
7.1 Overall Architecture . . . 96
7.2 Individual location DCCA-selected architecture . . . 97
7.3 Individual location baseline architecture . . . 98
7.4 Composition and splitting of datasets . . . 99
7.6 Air monitoring stations in Auckland region . . . 100
7.7 Lifecycle of research architecture . . . 101
7.8 Multi-layer Perceptron . . . 102
7.5 Dataset variables - View 1 View 2 . . . 108
7.9 XGBoost Feature importance . . . 109
8.1 Correlation chart, Pearson, Penrose . . . 124
8.2 Correlation chart - DCCA, Pearson, Penrose . . . 125
8.3 Pearson correlation of complete baseline data . . . 126
8.4 Deep canonical correlation of baseline data . . . 127
8.5 Correlation chart, Spearman, Penrose . . . 129
8.6 Correlation chart - DCCA, Spearman, Penrose . . . 130
8.7 Correlation chart, Kendall, Penrose . . . 131
8.8 Correlation chart - DCCA, Kendall, Penrose . . . 132
8.9 Example situations in which linear correlation should not be used [Aggarwal and Ranganathan, 2016] . . . 134
8.10 MSE - Base vs. DCCA . . . 136
8.11 RMSE - Base vs. DCCA . . . 138
8.12 MAE - Base vs. DCCA . . . 139
8.13 R2- Base vs. DCCA . . . 141
9.1 Future Architecture . . . 148
9.2 Ensemble deep correlation learning architecture . . . 148
List of Tables
2.1 Composition of gases in Earth’s atmopshere ( [Cox, 2015], [Lide, 2004]) . . . . 14
3.1 Common sources of vehicle emissions, [Litman, 2003] . . . 30
3.2 Calculated emission factors, [MfE, 2015] . . . 33
3.3 NZ Aircraft LTOs and emission factors [MfE, 2015] . . . 35
3.4 Meteorological air-quality models comparison . . . 40
4.1 SVM Kernels . . . 60
5.1 Summary of air pollution source apportionment studies using data mining techniques [Bellinger et al., 2017] . . . 66
5.2 Summary of studies using data mining to understand the relationship be- tween air pollution and health conditions [Bellinger et al., 2017] . . . 66
6.1 Advantages and limitations of each CCA-related technique [Zhuang et al., 2020] . . . 84
7.1 Links to data held externally . . . 99
7.2 Name and map co-ordinates for air monitoring stations in Auckland region in scope for this research . . . 100
7.3 Deep neural network neurons per hidden layer . . . 104
7.4 Canonical correlation analysis set definitions . . . 110
7.5 Canonical correlation analysis variable key . . . 111
7.6 Canonical correlation analysis on air data . . . 111
7.7 Multivariate Tests of Significance (S = 14, M = 21 , N = 83223 1/2) . . . 112
7.8 Eigenvalues and Canonical Correlations . . . 112
7.9 Dimension Reduction Analysis . . . 112
7.10 Variance in dependent variables explained by canonical variables . . . 113
7.11 Variance in covariates explained by canonical variables . . . 113
7.12 Set 1 - standardised canonical correlation coefficients . . . 114
7.13 Set 2 - standardised canonical correlation coefficients . . . 115
7.14 Set 1 - unstandardised canonical correlation coefficients . . . 115
7.15 Set 2 - unstandardized canonical correlation coefficients . . . 116
7.16 Set 1 - canonical loadings . . . 117
7.17 Set 2 - canonical loadings . . . 117
7.18 Set 1 - cross loadings . . . 118
7.19 Set 2 - cross loadings . . . 119
7.20 Canonical correlation analysis proportion of variance explained . . . 119 xiii
8.1 Experiments with base dataset . . . 135
8.2 Experiments with DCCA dataset . . . 135
8.3 DCCA MSE decrease(%) . . . 137
8.4 DCCA RMSE decrease (%) . . . 138
8.5 DCCA MAE decrease (%) . . . 140
8.6 DCCAR2increase(%) . . . 141
9.1 Additional data types to include in future architecture . . . 149
A.1 Data descriptive statistics . . . 155
B.1 Full DCCA Correlations . . . 179
C.1 linear_cca _init_ . . . 181
C.2 linear_cca fit parameters . . . 182
C.3 linear_cca _get_result parameters . . . 182
C.4 linear_cca transform parameters . . . 183
C.5 cca_loss _init_ parameters and attributes . . . 183
C.6 cca_loss loss parameters . . . 183
C.7 DeepPairedNetworks(nn.Module) Parameters and attributes . . . 185
C.8 DeepPairedNetworks(nn.Module) forward parameters . . . 185
C.9 MlpNet(nn.Module) init parameters and attributes . . . 185
C.10 class DCCA parameters . . . 186
C.11 class DCCA attributes . . . 187
C.12 View shapes . . . 188
List of Abbreviations
AI Artifical Intelligence ANN Artificial Neural Network API Air Pollution Index AQI Air Quality Index
ARC Auckland Regional Council
ARIMA Autoregressive integrated moving average model ARMA Autoregressive moving-average model
CCA Canonical Correlation Analysis DCCA Deep Canonical Correlation Analysis EPA Environmental Protection Agency GA Genetic Algorithm(s)
GLM General Linear Model
HAPINZ Health and Pollution in New Zealand HMM Hidden Markov Models
KCCA Kernel Canonical Correlation Analysis KNN k-Nearest Neighbours
LDA Linear Discriminant Analysis LSTM Long-Short Term Memory
MA Moving Average
MAE Mean Absolute Error MLP Multi-layer Perceptron
NIWA National Institute of Water and Atmospheric Research PCA Principal Components Analysis
PSI Pollutant Standard Index RMSE Root Mean Squared Error RNN Recurrent Neural Network SVM Support Vector Machine SVR Support Vector Regression
xv
Chapter 1
Introduction
1.1 Background and context
Air pollution and air quality issues are undoubtedly a very high priority for governments, organisations and people of all countries, with developing countries such as India, China and Brazil being especially affected by a high level of pollution in the air. There are a large number of studies which have shown that the quality of air has a direct impact on public health and preventable diseases. For example, particulate matter in the air (microscopic particles that a diameter of less than 2.5µm) is able to be inhaled into the lungs, increasing risk of respiratory and general health disorders. High levels of this increase a person’s risk of developing a number of serious long-term disorders and ultimately fatal diseases.
A large number of developing countries suffer from heavy air pollution. For instance, extreme air pollution events have been occurring with higher frequently in China in recent years, especially in the Beijing, Tianjin, and Hubei districts. According to reports on the state of the environment in China as recently as 2015, among 338 monitored cities, 265 (78.4%) were below the national healthy air-quality standard, and the percentage of days below the standard reached 23.3% on average [Li et al., 2016].
1.1.1 Health effects of air pollution
The World Health Organisation, a specialised agency of the United Nations that is con- cerned with international public health, named Air Pollution as the number one issue on their "Ten threats to Global Health in 2019 " [WHO, 2019]. Notwithstanding the ensuing COVID-19 pandemic that arose in 2020, is still the top threat.
Accordingly, there has been increasing amounts of research dedicated to better pre- dicting the air quality of a city or across an entire country through measurement of the pollutants in the air and identification, quantification and assimilation of these to provide both better current air quality metrics as well as forecasting or prediction of future metrics.
Air pollutants - particularly particulate matter, ozone and nitrogen dioxide - are well studied and are all known to affect the lung in a number of ways. Particulate matter can be inhaled directly into the lung through airways. Particulate matter particles that are smaller than 2.5µm(P M2.5) are able to get into the smallest of airways and the pulmonary alveoli. Ultra-fine particles under 100 nanometres in size are also able to enter the blood circulation and thereby reach other organs in turn. The damaging health effects of these particles are mainly attributable to the diverse range of chemical and physical properties
1
they have, which is known to cause oxidative stress and inflammatory reactions through- out the body. The particles that are inhaled are always a mixture of many local sources, and can also contain those from further afield. Experimental research studies performed over many decades have been able to identify a lot of these particles, and of those, ones from combustion processes have been demonstrated as particularly hazardous to human health [Cassee et al., 2013], [ISEE/ERS, 2019].
Ozone (O3) and nitrogen dioxide (N O2) are classified as respiratory irritant gases, and they have been shown to have oxidation properties. These gases are able to penetrate the lung deeply, causing oxidative stress [Halliwell et al., 1992], and provoking inflammatory reactions and interacting with the structure of the pulmonary wall. In contrast, Nitrogen monoxide (N O), which is produced by the human body itself, is relatively harmless.
Research performed on human cells with controlled exposure to pollutants, as well as epidemiological studies, contribute to the overall scientific evidence on health effects of air pollutants. In a preliminary literature review on the subject, more than 71,000 contempo- rary studies were located on the subject. The objective of most studies is an examination of the health impact(s) of air pollution chemicals with a focus on understanding the precise mechanisms by which they interact with cells and organs. Human exposure studies often are limited to the short-term effects. Some meta-studies show that large epidemiological observational studies are predominantly the method used to assess the long-term effects on public health of air pollution. Large-scale studies are useful as they often include children or participants with pre-existing health conditions.
Air pollution is responsible for number of serious diseases. Particulate matter is known to cause cardiovascular disorders and in particular respiratory disorders which can result in a lower overall life expectancy [Boldo et al., 2006], [Hoek et al., 2013]. Such effects are said to range all the way from short-term health effects such as hospital admissions to death. Effects such as this may occur quickly, and are in response to high levels of partic- ulate matter. Studies have produced evidence for lung cancer and cardiovascular diseases as a result of air pollution, are these are now recognised ascausal. Evidence of air pollu- tion for other respiratory diseases is said to beprobably causal[EPA, 2009]. In addition, it is probable that air pollution exposure has effects on the whole body [Thurston et al., 2017], especially on fetal development during pregnancy [Stieb et al., 2012], [Sun et al., 2016], lung and brain development in children [Schultz et al., 2017], [Clifford et al., 2016], diabetes [He et al., 2017] and dementia [Power et al., 2016].
Ozone leads, in the short-term, to increased mortality due to respiratory diseases and an increase in the number of respiratory-related emergency treatment and hospitalisations.
These effects are classified as "causal" [EPA, 2013]. Short-term ozone exposure is classified as "probably causal" with respect to the increase in the overall mortality and cardiovascular mortality. Long-term exposure to ozone pollution is correlated with increased respiratory- related mortality, as well as an increased number of asthma cases and acute or worsened respiratory distress amount asthmatic patients [EPA, 2013].
Nitrogen dioxide leads to exacerbation of health problems among asthmatics and is classified as causal [Möhner, 2018]. It has been classified as “probably causal” for the occur- rence of other respiratory diseases [Möhner, 2018]. Recent studies [Turner et al., 2016], [Bee- len et al., 2014], [Brunekreef et al., 2009], [Cesaroni et al., 2013], [Carey et al., 2013] and a systematic review [Schneider et al., 2018] also indicated an association between carbon dioxide and cardiovascular diseases [Turner et al., 2016], [Beelen et al., 2014], [Brunekreef et al., 2009], [Cesaroni et al., 2013], [Carey et al., 2013], [Schneider et al., 2018] and dia- betes [Schneider et al., 2018]. Figure 1.1 gives an overview of the effects on the whole body observed in population-based studies.
1.1. BACKGROUND AND CONTEXT 3
Figure 1.1: Health effects of pollution (ISEE)
1.1.2 New Zealand
Air pollution in New Zealand is most often researched in relation to the public health effect that is known to occur, as it is in other countries. To this end, most New Zealand central government departments, agencies and affiliated local government bodies such as regional councils and district health boards have regularly commissioned studies and funded re- search to understand better and further promote greener air quality practices for many years.
The health effects of air pollution as an area of study focusing on New Zealand saw the first major concerted effort in 2007 by way of a comprehensive study undertaken by [Fisher et al., 2007]. Their study, known as the Health and Air Pollution in New Zealand study (HAPINZ), looked at the identification and quantification of the health risks in New Zealand due to people’s exposure to air pollution. By explicitly identifying the effects of air pollution throughout New Zealand and linking these effects to the various sources of air pollution the study provided information that helped in formulating effective policy options with the end goal of leading to real and measurable improvements in the health of New Zealanders. With 67 urban areas, the study was presented in 2004 based on data from 2001 and was subsequently revised in 2012, with data from 2006 largely to account for new
developments in the monitoring ofP M10required as a national standard.
1.1.3 M¯aori perspective
New Zealand’s population is around 5 million people, with an indigenous people - M¯aori - consisting of around 14.9% of the total population. New Zealand takes it’s cultural heritage very seriously, and issues for M¯aori are given strong consideration in almost every area of government.
Research by [Jones et al., 2014] postulates that as with many other indigenous peoples worldwide, the impact of colonisation has led to dispossession of land and destabilisation of cultural foundations for M¯aori as well as a number of social, economic, and political marginalisation impacts. This has happened because colonial systems and structures are said to be responsible for maintaining an inequitable distribution of the determinants of health for M¯aori people. Accordingly, M¯aori are therefore said to be disproportionately exposed to adverse social and economic conditions with consequent higher morbidity and mortality. Life expectancy for M¯aori on average seven years lower than for non-M¯aori people and additionally there are significantly higher rates of most major diseases such as cardiovascular disease, cancers, chronic respiratory disease and diabetes. This confluence of factors all increase the overall vulnerability to climate change (air quality related) health impacts for M¯aori in New Zealand.
The Waikato Regional Council (New Zealand), shows great cultural awareness in this area and its’ position on air quality monitoring lists a number of climate change-related impacts on the M¯aori people. They explain that M¯aori are concerned about the effects air pollution has on:
• Customary resources- plants and animals require clear and pure airways. Native plants provide food and natural medicines. In the past, the only thing that disturbed the natural quality of the air was the cooking fires of tupuna.
• Visibility- the moon, stars and rainbows are important in M¯aori mythology. The stars are particularly important as they represent the generations that have passed into the night. The movements of the moon and the lunar calendar help tell the time of year for sowing and harvesting
• the spiritual values of waahi tapu and other taonga.
1.1.4 History
Early work in the air quality domain first appeared in the 1930’s where [Bosanquet and Pearson, 1936] described the dispersion of smoke from chimneys. [Sutton, 1947] went on to describe in more detail the dispersion of smoke pollutants from factory chimneys, taking into account the Gaussian distribution of the vertical and crosswind components with the effect of reflection from the ground of the plume. With increasing regulation of environ- mental pollution in the 1960’s there was a leap forward in the amount of research done on air pollutant plume dispersion modelling.
The first method was the Pollutant Standard Index (PSI) and was developed in 1976 by the United States Environmental Protection Agency. This was later revised to become the Air Quality Index (AQI). The PSI accounted for five air pollutants: sulphur dioxide (SO2), nitrogen dioxide (N O2), carbon monoxide (CO), ozone (O3) and particulate matter (P M10).
The AQI addedP M2.5and an 8-hour ozone (O3) concentration average. Though used by a
1.2. MOTIVATION AND GAPS 5 number of countries, the AQI standards adopted were not all the same [Cheng et al., 2007].
China utilised it’s own air quality standards, GB3095-2012, and previously GB3095-l996, which adjusted the acceptable concentration limit values ofP M2.5 and 8-hour O3. The AQI has limitations, and other air quality models have been subsequently proposed. Air quality forecasting was another way for countries to determine what measurement may be in the future, usually for early warning purposes.
1.2 Motivation and Gaps
The intent in this research is to improve forecasting accuracy and performance for common air quality pollutant variables that are important to human health by seeking to understand how pollutants relate to one another. The locale chosen to study is limited in scope to cover Auckland, New Zealand as it is the country’s most populous city, and therefore much more data is available.
An in-depth review of current literature on the application of deep learning and corre- lation analysis in the domain of air quality has shown there are areas that have yet to be investigated for accuracy and performance gains in so much as using newer deep learning variants of statistical methods. In addition, there is a current trend in academic research of using historical data only as opposed to pre-correlation of a large number of contributing factors to show what is relevant and what is not.
In a computing problem we may select a dataset of any type with the aim of predicting the movements of these figures at a future time without necessarily concerning ourselves with the relationships, there is interest in the environmental science domain in understand- ing the relative movements of multiple pollutants [Bai et al., 2020]. In this sense, the type of data and it’s interrelationships is an important consideration and outcome. This better un- derstanding of the cause and effect relationship informs policy making, enforcement activ- ities, safety guidelines to ensure certain interactions are minimised or take place in certain regions to minimise exposure levels for health risk reduction and the ability to understand the complex chemical mixture of the atmosphere [Ma et al., 2020a], [Núñez-Alonso et al., 2019]
Consider the Pearson correlation matrix, a tool which shows an example of how certain parameters may have a reliance on one another. It represents a basic and common starting point and is included as Table 8.3 and used on our dataset as part of a baseline investiga- tion. Optimisation of a model by using correlations, therefore, with the addition of a wider spread of relevant data types should show us what is more likely to influence the overall air quality in a geographical area.
Present "classical" statistical tools can quickly become overwhelmed with the sheer quantity and velocity of data that is now available [Bzdok, 2017], [Bzdok and Yeo, 2017], [Smith and Nichols, 2018]. Whether for commercial application or academic research pur- poses, we are now at the point where both the developments in technology and the avail- ability of massively rich datasets provide the means for us to build new toolboxes and uncover new insights using these new methods of analysis.
In the area of health, and particularly in the neurological speciality of cognitive neu- roscience, canonical correlation analysis has recently provided a much more robust under- standing of the subtleties that population variations introduce to MRI results by simultane- ously identifying these sources amongst two datasets [Wang et al., 2018]. To this end, using deep learning with this technique, in line with recent developments, should theoretically be applicable to air quality.
[Di Leonardo et al., 2014] explains their choice of canonical correlation analysis for their environmental research problem as that in a natural system such as ours that is influenced by human activities, some of these variables can be considered as independent variables while others are dependent on the previous variables. They state they’ve found the reason CCA held an advantage over multiple linear regression was that it provides estimates with the best possible correlation.
The choice of canonical correlation analysis, performed using newer deep learning com- puting advances, allows the relationships between the air quality variables to be reduced to a lesser quantity of statistics while still maintaining the important facets of that data.
In some ways, the choice of deep canonical correlation analysis is motivated by the selec- tion of other similar techniques by researchers - it is, quite simply put, another dimension reduction technique.
[Bai et al., 2020] state that pollutant analysis and pollution source tracing are critical issues in air quality management, in which correlation analysis is important for pollutant relation modelling. They state that there is an urgent demand to explore the influence relation of pollutants and the potential pollution sources. In their literature review, they note that mainstream of correlation analysis are methods such as partial correlation ( [Baba et al., 2004], [Li et al., 2017]), principal component ( [Wold et al., 1987], [Rahmani and Atia, 2017]), and grey correlation analysis ( [Kuo et al., 2008], [Tang et al., 2019]) which have been applied widely in different fields. Deficiencies exist in these methods largely on account of them being unable to handle real-time data.
1.3 Research Challenges
dcThis research’s objective is to improve the accuracy and performance of air quality pre- dictions by leveraging advances in computing. For this research, we consider the currently missing relationship between different air quality variables that are strongly correlated to make a prediction, rather than relying solely on historical air quality data. Using all avail- able data sources allows our correlations (and therefore predictions) to feature increased accuracy.
In order to do this, we use deep learning, which has been successfully applied to a number of complex workloads (e.g. image classification, natural language processing, pre- diction tasks, object detection). A large number of University researchers and corporations such as Facebook and Google have demonstrated empirical successes with these types of tasks in a diverse number of ground-breaking applications ( [Najafabadi et al., 2015], [Frad- kov, 2020]).
There have, however, been comparatively few studies performed already applying deep correlation analysis techniques to air-quality prediction due to it’s recent development, but those studies that have used them in other domains have already shown superior perfor- mance over traditional machine learning (i.e. non-multiview data based) and less com- plex neural network methods ( [Zhao et al., 2017], [Rybarczyk and Zalakeviciute, 2018], [Iskandaryan et al., 2020], [Fradkov, 2020]).
Using air quality data presents some particular challenges, which specifically mean that the outcome of correlations and predictions will only ever be as accurate as the quality of the data they rely on. Section 3 looks in depth at the way that air pollution is measured in a contemporary setting, both in New Zealand and internationally.
Section 2.3 outlines the problems and challenges with data analysis in detail in the con- text of environmental data.
1.4. RESEARCH QUESTIONS 7
1.4 Research Questions
To develop a solution that has both accurate predictions as well as efficiency we need to look at not just a single variable, such as the commonly usedP M2.5 or a couple of other common ones, but all variables being collected in Auckland that impact human health as these are all of interest to authorities, organisations and people alike. Including a method to discriminate what is useful and what is not makes the solution inherently more useful and much more compelling to academia and industry to utilise. A solution that ignores all but one type of variables is very limiting in the impact it is able to have on decision making.
There are a number of questions that existing air quality deep analysis research world- wide this far is working toward answering in more recent years. These questions will form the basis of this research;
• RQ1: Can we relate all the variables involved in air-quality monitoring to establish which ones have dependencies on the others?
• RQ2: Can we further model this inter-relationship and automate this by using the available data?
• RQ3: Can we use this inter-relationship to increase further the accuracy of air qual- ity prediction forP M2.5?
• RQ4: In the event that RQ3 is successful, can we apply the deep correlation archi- tecture to urban air quality monitoring?
The solution to these research questions must meet the following requirements, or suc- cess criteria, in order to be considered a success:
• The approach should be able to improve theaccuracyof air quality predictions.
• The approach should be able to improve theefficiencyof air quality predictions.
• Accuracy and efficiency increases resulting from the approach should be able to be measured and verified quantitatively.
• The approach should have efficient operation time-frames so as to ensure that any accuracy and efficiency gains are not negated by increasing work needing to be done for preparation.
• The approach should reflect real-world best practices, and result in system attributes of commercial quality such as ease-of-use, automation, resilience and ability to alter operating parameters or variables without significant re-work.
1.5 Contributions
This thesis presents an architecture utilising Deep Canonical Correlation Analysis, which itself builds upon (standard) Canonical Correlation Analysis - a tool that has a number of use cases in fields such as biomedical research.
Presented here is a road-map for how Deep Canonical Correlation Analysis can be used to effectively and efficiently understand and gain insight from dense air quality data sets, and comparative experiments on a data set from Auckland, New Zealand showing it’s practical application.
Limitations of other techniques, as well as limitations to the way in which the air pollu- tant chemicals are collected, measured, or estimated are outlined as part of the process of engineering an optimal solution.
1.6 Thesis organisation
Chapter 1 provides an introductory look at the problem of air pollution, including the health effects and historical context. Through this, motivation to solve the issue from an- other angle is found and determines a gap in current research to date. Research questions (4) are formulated and will serve as the guiding principles from which the rest of the thesis will build upon.
Chapter 2 represents an exploration of the environment and the data available. Ar- eas such as the composition of Earth’s atmosphere are explored in order to understand in depth what it is we are, in fact, measuring and how it is commonly represented. Some ele- ments around atmospheric and air data are explored to ensure that common problems and challenges are at the forefront to ensure they are captured at the start and overcome where possible. Auckland’s air data is investigated and outlined in depth for the most common pollutants.
Chapter 3 outlines all the different types of pollution that are generally understood, placing special attention on the New Zealand context to ensure that any specifics that may be encountered aren’t overlooked or under/over-estimated. The ways that these pollutants types are determined (i.e. measured, estimated or otherwise inventoried) are reviewed to understand that we may experience less accurate results because of the ways in which these mathematical models make certain assumptions. The accuracy of a deep learning- based system which takes a huge amount of historical data to increase accuracy will itself be limited by the way in which these measurements are produced.
Chapter 4 begins by looking at the computing field’s contributions to the field of air quality prediction, from early to more modern. Common methodologies in the air quality prediction field are explored.
Chapter 5 looks in great depth at the science of data correlation as a technique by study- ing every way in which we may currently hope to make linkages between different vari- ables, explaining these with examples, where appropriate.
Chapter 6 shows the selected methodology - canonical correlation analysis - in the basic form and the many different variants, guidelines/use-cases, advantages and limitations.
Chapter 7 implements the solution for this research problem, using current best-practices and tools.
Chapter 8 analyses and discusses the results of implementation in Chapter 7 by using baseline and DCCA-modified datasets and running a number of different algorithms and performance measures. By weighing these and discussing them, a case can be made that DCCA represents increases in accuracy when used on pre-processed Auckland multi-site air quality data.
Chapter 9 closes the thesis with a summary of the contribution, a discussion of impor- tant limitations and a look at potential future work to further develop this work into other areas.
1.6. THESIS ORGANISATION 9
1.6.1 Definitions
Throughout this thesis, a number of very specific computer science, statistical and envi- ronmental terms and abbreviations are used. Presented in this subsection are descriptions of these terms. Canonical correlation analysis terminologies are described separately from the more generic terms to highlight the extra distinctions and differences.
General terminology
• Air Quality - refers to the overall composition of the Air with respect to the gener- ally accepted pollutants recognised as being harmful. Examples of these areP M2.5, P M10,N O2,SO2,CO2,N Ox.
• Factor/Impacting Factor- refers to something that has an effect on air quality vari- ables. For example, Meteorology or Traffic.
• Variable/Air Quality Variable - in the context of this work refers to the pollutants being measured and/or predicted by work in this research or other related research.
For example,P M2.5.
Canonical correlation analysis terminology
This specific Canonical Correlation Analysis terminology in this sub-section is adopted from [Hair et al., 2006] and [Sherry and Henson, 2005].
• Canonical correlation coefficient- measures the strength of the overall relationship between two linear composites ("canonical variates"), one variate for the independent variables and one for the dependent variables. It is a representation of the bivariate correlation between two canonical variates in a canonical function.
• Canonical cross-loadings- Correlation of each observed independent or dependent variable with the opposite canonical variate.
• Canonical function- Relationship (correlational) between two linear composites ("canon- ical variates"). Each canonical function has two canonical variates - one for each set of dependent variables and one for the of independent variables. The strength of the relationship is given by the canonical correlation coefficient.
• Canonical Loading - simple linear correlation between the independent variables and their respective canonical variates. These can be interpreted like factor loadings - they are also known as canonical structure correlations. Each independent variable has different canonical loadings for each canonical function.
• Canonical roots- Squared canonical correlation coefficients providing an estimate of the amount of shared variance between the respective canonical variates of depen- dent and independent variables. Canonical roots are also known as Eigenvalues.
• Canonical variates- Linear combinations representative of the optimal weighted sum of two or more variables. Formed for both dependent and independent variables in each canonical function. Also known as linear composites, linear compounds or linear combinations.
• Canonical correlation coefficient(Rc)is the Pearsonrrelationship between the two synthetic variables on a given canonical function. Because of the scaling created by the standardised weights in the linear equations, this value cannot be negative and only ranges from 0 to 1. TheRcis directly analogous to the multipleRin regression.
• Squared canonical correlation (R2c) is the simple square of the canonical correla- tion. It represents the proportion of variance (i.e., variance-accounted-for effect size) shared by the two synthetic variables. Because the synthetic variables represent the observed predictor and criterion variables, the indicates the amount of shared vari- ance between the variable sets. It is directly analogous to theR2 effect in multiple regression.
• Canonical function (or variate)is a set of standardised canonical function coefficients (from two linear equations) for the observed predictor and criterion variable sets.
There will be as many functions as there are variables in the smaller variable set. Each function is orthogonal to every other function, which means that each set of synthetic predictor and criterion variables will be perfectly uncorrelated with all other synthetic predictor and criterion variables from other functions. Because of this orthogonality, the functions are analogous to components in principal component analysis. A single function would be comparable to the set of standardised weights found in multiple regression (albeit only for the predictor variables). This orthogonality is convenient because it allows one to separately interpret each function.
• Standardised canonical function coefficientsare the standardised coefficients used in the linear equations discussed previously to combine the observed predictor and criterion variables into two respective synthetic variables. These weights are applied to the observed scores inZ−score form (thus the standardised name) to yield the synthetic scores, which are then in turn correlated to yield the canonical correlation.
The weights are derived to maximise this canonical correlation, and they are directly analogous to beta weights in regression.
• Structure coefficient (rs) is the bivariate correlation between an observed variable and a synthetic variable. In CCA,it is the Pearsonrbetween an observed variable (e.g., a predictor variable) and the canonical function scores for the variable’s set (e.g., the synthetic variable created from all the predictor variables via the linear equation).
Because structure coefficients are simply Pearsonr statistics, they may range from –1 to +1, inclusive. They inform interpretation by helping to define the structure of the synthetic variable, that is, what observed variables can be useful in creating the synthetic variable and therefore may be useful in the model. These coefficients are analogous to those structure coefficients found in a factor analysis structure matrix or in a multiple regression as the correlation between a predictor and the predicted Y’
scores [Courville and Thompson, 2001], [Henson, 2002].
• Squared canonical structure coefficients (R2S)are the square of the structure coeffi- cients. This statistic is analogous to any otherr2-type effect size and indicates the proportion of variance an observed variable linearly shares with the synthetic vari- able generated from the observed variable’s set.
• Canonical communality coefficient (h2)is the proportion of variance in each variable that is explained by the complete canonical solution or at least across all the canonical functions that are interpreted. This statistic tells of how useful the observed variable was for the entire analysis.
1.6. THESIS ORGANISATION 11
• Redundancy Index- Amount of variance in a dependent or independent canonical variate, and explained by the other canonical variate in the canonical function. It can be calculated by both the dependent or independent canonical variates in each canonical function.
Chapter 2
The Environment and Data
Environmental data refers to the information collected from monitoring and is reflective of the variations that occur in the measurement area. It is gathered for analysis and allows statistical or other methods to be used to gain an understanding of what environmental issues exist as well as other city planning matters of importance such as long-term trend identification. Monitoring equipment is often setup in distributed networked stations and provides a constant data stream on several variables.
Environmental issues are conditions which can potentially be harmful to humans. As described in Chapter 3, there are not only many factors that can impact the quality of air, but measuring and calculating these can have varying levels of accuracy and precision.
From a New Zealand standpoint, the best available measurement currently being taken is P M10. On a regional basis, this is the air pollutant that is the most extensively recorded.
2.1 Air composition of the atmosphere
Earth’s atmospheric composition is comprised of a mix of gases that surrounds the planet, held in place by gravity, and which we commonly know as ’air’. This atmosphere exerts pressure on the surface of the planet, which allows for liquid water to exist, thereby sus- taining life. It also absorbs solar radiation (UV) which creates a warming effect not unlike a greenhouse - resulting in the common name "the greenhouse effect". This gaseous layer likewise contributes to the reduction of weather extremes that could occur during day and night, known as the "diurnal temperature variation".
From a chemical composition perspective, dry air by volume is comprised of the gases shown in Table 2.1
Depending on location, water vapour is also usually present - generally, about 0.4%, increasing to around 1% at sea level. The composition varies with altitude, resulting in differing temperatures and pressure, and natural air suitable to humans and animals is only found in the Earth’s troposphere. Other layers exist, divided by temperature, composition and other factors.
By in large, the atmosphere of Earth is made up of three main factors - nitrogen, oxygen and argon. Water vapour - a greenhouse gas - is present at around 0.25% and may vary in its concentration due to heat, cold, humidity and other mixes of gases (known as trace gases) [Wallace and Hobbs, 2006], [Hansen et al., 2007]. Greenhouse gases commonly present are CO2,CH4,O3,Ar, and noble gases such asN e,He,Kr,Xe. There are also potentially
13
Gas Percentage N2 78.084%
O2 20.946%
Ar 0.9340%
CO2 0.0407%
N e 0.001818%
He 0.000524%
CH4 0.00018%
H2 0.000055%
Kr 0.000114%
Table 2.1: Composition of gases in Earth’s atmopshere ( [Cox, 2015], [Lide, 2004])
traces of other gases dependent on the regional characteristics. We look at the detail of this in Chapter 3. The concentration of gases remains generally constant until around 10 000m
2.1.1 Stratification
Air is separated in the atmosphere by five distinct layers - Exosphere, Thermosphere, Meso- sphere, Stratosphere and Troposphere. The composition of the air mixture is influenced by the air pressure and density and heavily by temperature. While a complete treatment on all layers of Earth’s atmosphere is both not required for this research and not feasible, concen- trated on here is the troposphere, which is closest to the Earth’s surface and the layer where the activity of interest to us is to be found. Earth’s atmosphere can be split - or stratified - into five distinct layers. From highest to lowest all five layers are:
• Exosphere: 700-10,000 km
• Thermosphere: 80-700 km
• Mesosphere: 50-80 km
• Stratosphere: 12-50 km
• Troposphere: 0-12 km Troposphere
The troposphere begins from Earth’s surface and reaches upwards to about 12 km. This altitude has variances and can be about 9 km at the geographic poles, and to about 17 km at the Equator. This variation can be caused by weather. At the top of the troposphere, a boundary called the tropopause marks the beginning of the next layer. This boundary may distinguished by an inversion of the temperature in some places, which in others an isothermal layer.
Despite variations, temperature can decline with increasing altitude in the troposphere, largely explainable by the energy transfer from the earth’s surface. This means that the lowest part of the troposphere - the Earth’s surface - typically is the warmest section of the troposphere. About 80% of the mass of the atmosphere is contained in the troposphere.
This density can be attributed to the immense weight of the atmosphere that sits above it
2.2. DATA COLLECTION 15 leading to compression. 50% of this total atmospheric weight is contained within the lower 5.6 km of the troposphere.
Most of the weather is generated in the troposphere because nearly all of the atmo- spheric water vapour is located there. All of the weather associated cloud types are there, and most conventional aviation activities are contained within here - especially for pro- peller aircraft.
Other layers
Within the five principal layers above, distinguished by temperature, other properties iden- tify several secondary layers.
The Ozone layer is in the stratosphere. The concentrations of Ozone (O3) here are vary from 2-8 parts per million, a figure quite a bit bigger than what can be found in the lower atmosphere. Nevertheless, this is quite small in comparison to the main components of the atmosphere. The Ozone layer is mostly located in the lower portion of the stratosphere and is around 15–35 km. The thickness of it can vary seasonally and depending on geo- graphic location. Around 90% of Ozone contained in the atmosphere can be found within the stratosphere.
The ionosphere is an solar radiation ionised area, and is where auroras can be seen.
During daylight it increases from 50 km to 1,000 km and is inclusive of the mesosphere, thermosphere and a portion of the exosphere also. The ionisation in the mesosphere is usually absent during nighttime meaning auroras are seen in the thermosphere and the lower exosphere only, when they are present. The ionosphere is where the inner edge of the magnetosphere is formed, and is of importance because it influences things such as radio signals.
The homosphere/heterosphere are differentiated by the peculiarities of the atmospheric gas mixture. The homosphere is a surface based layer made up of the lowest portion of the thermosphere. This part has a particular chemical composition that doesn’t have any reliance of the molecular weights on account of these gases being mixed by air turbulence.
The homosphere reaches the turbopause, about 100 km from the edge of space and is placed at around 20 km above the mesopause.
The heterosphere is made up of the exosphere, and the remainder of the thermosphere.
In this layer, the chemical make-up is variable because of the considerable distance in which the particles are able to move before colliding with one another. Because of this, the motions cause mixing thereby allowing the gases to become stratified by weight. Heavier particles (e.g.02,N O2) are found only at the bottom and lighter particles (e.g.H) at the top.
The planetary boundary is a layer in the troposphere which is, as the name might sug- gest, closest to the surface of the Earth and is thus impacted the most by it. In daytime the gas is well mixed, but becomes quite stable stratified with only minor mixing occurring.
This layer’s size can be quite variable as a result - ranging from about 100 m on a clear night, to 3,000 m or more in a dry region in the afternoons. It has an average temperature in the range of 14-15 degrees Celsius.
2.2 Data Collection
In New Zealand, the capturing of air quality data and measurements is done using ’air- sheds’, which are ’gazetted’ (published) formally by the government. For management
purposes, airsheds are considered to mean ’air quality management areas’ which are des- ignated as the responsibility of the regional council it falls under.
Councils in New Zealand have identified populated areas in the country which are known to be in excess of air quality guidelines. Some of the airsheds have also been focused on for reasons relating the the population that falls into the air-shed zone, the uniqueness of the weather patterns in it’s region or because emissions from local industrial activity have a specific need to be managed closely. The airsheds coverage is about two-thirds of where the New Zealand population resides, which is about 1.5% of land area.
When the National Environmental Standards were introduced by the Ministry for Envi- ronment (MfE), 71 airsheds where identified. 38 of these were actively being monitored as of 2012. By taking an inventory of the air-shed characteristics, MfE is able to build up a pro- file of some of the factors that are important in making air quality management decisions.
Where there is a breach inP M10 standard, for instance, the regional council responsible needs to develop a plan to improve the air quality [Kuschel et al., 2012].
2.3 Problems and challenges with data analysis
There are several problems and challenges associated with utilising environmental data as several factors that inform the analytics process. Some issues confronting air quality analysis are other more general problems that confront data analysis on any subject area.
This section explores the challenges of those problems.
Where air quality data is collected, there are sensors involved to make that process possible. By it’s very nature, a sensor of this type is exposed to potentially volatile environ- mental conditions. Thus, the data collected may be impacted by these conditions in a way that can produce noisy data. Conditions such as lighting, wind factors such as direction and mixing, equipment limitations such as failure or unreliability of components, sam- pling procedures or measurement process particular to different sensors all can contribute towards spurious data being captured [Celik, 2009], [Scrimger, 1985]. The noise experi- enced in air quality data, and other environmental data alike, needs to have care taken in it’s analysis to ensure predictions aren’t distorted by these factors [McCune, 1997].
The collection of environmental data takes place across time with additional spatial con- siderations. This spatio-temporal data may be collected by monitoring stations with this distinction identified. With spatial dimensions already identified, Big Data such as this is now a much more popular way to uncover insights on the state of air quality in an area.
When considering the impact this massive new data stream has, from a computational analysis perspective, there are other factors such as the sort of dependence one monitor- ing station may have on another. Such new challenges may be the differing time frames of observations, missing data, weather conditions, failures of equipment or components, maintenance of the site through the observation period.
Continuous (or endless) data streams are an inevitable part of collecting ongoing envi- ronmental information, and therefore a factor that needs consideration. As environmental data is most often collected on a hourly basis by most countries and continuously, possi- bly without end, the handling of this from a number of different perspectives needs to be factored in to a solution. Generally speaking,SO2,N O2,O3,P Mxreadings are captured hourly and available in these type of data streams [Babcock et al., 2002]. Capabilities in continuous streams have database considerations that are out of the scope of this research, though the handling of transformation of this data to allow pre-processing is important as the overhead of transactions on this sort of data can be intensive from a hardware perspec-
2.3. PROBLEMS AND CHALLENGES WITH DATA ANALYSIS 17 tive [Terry et al., 1992].
For instance, errors in data that arrives rapidly produces uncertainties, and minimising the amount of uncertainties in real-time is problematic and a challenge approached in other areas of research [Titov et al., 2005]. Concepts such as overall accuracy of the data, the speed in which the process is managed and the total robustness of the solution are relevant.
Data sets that have missing or incomplete entries is a common occurrence, brought about by a number of possible causes. Data acquisition errors, insufficient or incomplete sampling resulting in sets that aren’t fit for purpose any more, errors in measurements cause by equipment failure or strange environmental phenomena that temporarily render an outlier result [Junninen et al., 2004]. This type of incomplete dataset is reported often in research, and leads to different results than otherwise would have been found had the set been complete [Elliott and Hawthorne, 2005]. Because the intended methodologies, procedures, algorithms can now no longer take place with as much confidence there is an interruption in trend reporting or model training [Xue et al., 2005]. For the research in this project, missing entries would theoretically have an impact on the integrity of any correlation calculations, skewing the results in a different direction [Li et al., 2005]. While there are a wide range of methods that can be used to handle these types of scenarios, agreement on which is best is mixed [Graham, 2009].
There are several other issues with data analysis discussed by [Hair et al., 2006];
• Volume- due to the rise of ’big data’ there are more and more pieces of data being collected than ever before. Quantifying this data is sometimes a matter of speculation, and there may be data that has never historically been collected before.
• Variety - there is much more variety in the types and locations of data than previ- ously. With the addition of potentially vast numbers of variables into the data analy- sis phase, there is a question of which variables are included. There is also the possi- bility of having non-metric data available and how these can be adapted can pose a challenge.
• Velocity- with larger amounts of data available, and the different attributes of this data, there is much more data produced much quicker than historically. How to use this information effectively becomes vital to both performance and accuracy.
• Veracity- the veracity of this large amount of data coming in at more significant inter- vals poses the problem of ensuring that this data is complete. There may be several errors in the data and the collection process such as missing values, measurement error and one-off events which can skew prediction and forecast results.
• Variability/Value- the flow or availability of information that is being recorded may be affected, and this timeliness of information can affect the subsequent analytical or predictive activities.
• Validity - the degree to which the data is representative of what it is supposed to be measuring, for example, asking the wrong questions will yield technically correct data, but it is of no use for the particular problem being investigated.
Additionally, identified by [Harford, 2014] are four areas that may cause problems;
1. Over-rating accuracy when ignoring false positives 2. Replacement of causation with correlation
3. Sampling bias not being recognised
4. "Letting the data speak" (i.e. ignoring the presence of spurious correlations)
2.4 Air quality data in Auckland
Air quality in Auckland, and New Zealand as a whole, is affected by emissions from dif- ferent sources. The Ministry for Environment identified that a particular source of note is domestic heating, with a number of health impacts identified around the country from a number of others sources too. Identification of what these were was part of a unitary plan, and groups throughout New Zealand were asked to study these sources through data from their airsheds and provide emission inventories which aid in identifying them. 15 regional councils provided feedback in 2012 indicating that major emission sources were domestic heating, transport, industrial/commercial activities and outdoor burning.
How much these sources were responsible for air quality degradation has variances present, especially in Winter where considerations for the domestic heating activities from fuel combustion were higher. The estimate indicates about 72% of Auckland emissions were generated from domestic heating versus 4% in Summer.
Descriptive statistic for air data collected for Auckland can be found in appendix A.
2.4.1 Particulate Matter - P M
2.5New Zealand’sP M2.5 emissions from anthropogenic sources were estimated during the National Inventory as approximately 34,000t. Visualised by sector in Figure 2.1 and Figure 2.2 shows specific emissions from energy-related activities asP M2.5is primarily (68%) from this type of activity. The other largest source is outdoor burning of agricultural content (23%). Heating of residential homes is shown in Figure 2.3 as contributing 33% ofP M2.5in total - a significant amount.
2.4.2 Particulate Matter - P M
10A notable quantity ofP M10 emissions are generated from natural means, quite separate from any anthropogenic influence. These natural sources are said to be sea-salt, windblown dust, bush fires, volcanoes, pollens and other biogenic material.
TotalP M10emissions from anthropogenic sources in New Zealand were estimated as 46,000t per annum in 2015. Visualised by sector in Figure 2.4. Much likeP M2.5,P M10 emissions are significantly produced by the energy sector (54%). Outdoor burning is also a large contributor (22%), comprised of burning activities from:
• Fields and agricultural waste (3%)
• Bio-mass (15%)
• Garden waste (4%)
Other types of emission make up a large portion ofP M10being generated, specifically:
• Road dust (13%)
• Industrial processes (5%)
2.4. AIR QUALITY DATA IN AUCKLAND 19
Figure 2.1:P M25emissions by sector for 2015, [MfE, 2015]
Figure 2.2: Breakdown ofP M2.5emissions from energy related activities, [MfE, 2015]
Figure 2.3:P M2.5emissions by source for 2015 (t/yr), [MfE, 2015]
Figure 2.4:P M1.0emissions by sector, [MfE, 2015]
2.4. AIR QUALITY DATA IN AUCKLAND 21
Figure 2.5: Breakdown ofP M10emissions from energy related activities, [MfE, 2015]
• Animal housing (1%)
P M10emissions are broken down by source in Figure 2.6, and show some large sources of interest:
• Residential heating (25%)
• Transport (9%)
• Other sectors (21%)
2.4.3 Carbon Oxides - CO, CO
2Carbon Oxides are a group of pollutants present in air, predominantlyCOandCO2. Car- bon Monoxide (CO) is a colourless, tasteless and odour-free gas emitted from a number of anthropogenic and natural sources, and from the oxidation of other pollutants. Incomplete burning of fuels due to insufficient O2 produces a significant amount. Carbon Dioxide (CO2) makes up around 0.0407% of atmospheric air, and this level has varied over the course of human history with the interchange between the earth and life, and the absorp- tion into the sea being factors [Godish et al., 2014].
[MfE, 2015] estimates totalCOemissions from anthropogenic sources in New Zealand as 531,500t per annum in 2015. Visualised in Figure 2.7 is estimatedCOemissions sorted by sector. Figure 2.8 shows a breakdown of energy sector-specific emissions.
In New Zealand, air pollution is particularly bad in Winter due to the burning of wood fires for home heating, generating abundantCOemissions. The energy sector accounts for 79% of emissions in the 2015 inventory. Of this there were a number of larger sources of note:
Figure 2.6:P M10emissions by source for 2015, [MfE, 2015]
Figure 2.7:COemissions by sector for 2015, [MfE, 2015]
2.4. AIR QUALITY DATA IN AUCKLAND 23
Figure 2.8: Breakdown ofCOemissions from energy related activities, [MfE, 2015]
• Agricultural, bio-mass and outdoor burning of general waste (13%)
• Manufacture of Iron/Aluminium in Industry (8%)
Overall emissions by source are visualised in Figure 2.9, showing large contributions by residential heating (38%) and traffic (28%).
2.4.4 Nitrates - N O
x, N
2, N
2O, N O, N O
2[Godish et al., 2014] describes a number of nitrogen-based compounds that are present in air, due in part because Nitrogen is the most abundant gas present. Ordinarily it remains relatively unreactive and doesn’t play any particular role in the atmospheric chemical mix- ture. It does, however, play a part in some combustion and biological activities as a precur- sor in producingN OandN O2- themselves significant in other parts of the atmosphere.
Nitrous Oxide (N2O) gas - commonly known as "laughing gas" - has steadily increased in the atmosphere since the pre-industrial age and 70% is produced by (de)nitrification processes in undisturbed environments and the ocean. There are two is