7.1 Introduction
In this chapter, the proposed deep learning-based canonical correlation analysis architec- ture is implemented for experimental verification. These experiments were designed to gain an understanding of whether deep canonical correlation analysis works as a solution and to explore what this means for meeting the research questions.
The architecture in this research can be visualised as in Figure 7.1. Phase 1 involves establishing the strongest variables within the target location dataset, in this case "Penrose".
Phase 2 compares the target location - Penrose - against all other locations by splitting the data into two views - Penrose and All Other, respectively. In this phase, DCCA is run to establish the variables in view 2 that most strongly influence the variables in view 1. Phase 3 uses these reduced variables in a number of different predictive algorithms. After this, performance of these algorithms is evaluated to see which is the most accurate and efficient.
Once air data is collected - in this instance, 73 different variables - data mining tasks are performed. A canonical correlation analysis matrix is then built to ascertain the inter- relationships, which affect the overall air quality balance. Once this is established, a stan- dard deep learning step takes place on the newly correlated data, which in turn yields the improved forecast with much more efficiency than previously may have been experienced.
95
Figure 7.1: Overall Architecture
This has a number of benefits, not only identifying specific pollutants in that area with greater accuracy that may be important from an environmental science perspective or local government policy and enforcement, but can translate to a performance gain from a com- puting perspective [Bai et al., 2020]. While such a performance gain can mean quicker run times which may in turn pass on savings on cloud computing usages charges, for instance, from a more practical perspective it may allow the model to be re-calculated (fit) at shorter intervals to account for changes in the correlations that can occur over time or in response to special events. In a era of big data where the many types of challenges may only increase in future, such benefits could be quite attractive.
Current approaches in the air quality domain typically take air quality data variables and run them through an algorithm to generate predictions. This research makes new changes to that type of model to increase accuracy and performance.
1. A correlative algorithm features before the prediction stage to relate included vari- ables to one another in a meaningful way first.
2. Addition of surrounding related impacting factors to air quality as variables rather than solely air pollutants that are typically found in standard forecasts to mitigate current data sparsity issues.
For this research, the traditional air quality impacting factors, that is, chemical and par- ticulate pollutants, are correlated with meteorological factors, and data from the wider geographical area to address the data sparsity that is typical of air quality data. No such research using deep learning of the relevancy of this larger air dataset has at present been taking place. It is hypothesised that using a correlation-driven deep learning model for this will yield better results in both accuracy and performance, as we might now expect to see in other research fields [Wang et al., 2018].
7.1.1 Experimental environment
The primary software environment used in research implementation consists of Python 3.8.3, with the libraries Torch 1.5.1, Tensorflow-GPU 2.2.0, Keras-GPU 2.4.3 and GPU sup- port with CUDA 10.2 & CuDNN.
7.2. DATA ATTRIBUTES AND PRE-PROCESSING TASKS 97
Figure 7.2: Individual location DCCA-selected architecture
Experiments in this research were performed with IBM System x3850 X5 hardware with the following attributes: 24 Intel Xeon E7-4807 Processors @ 1.87MHz (48 logical proces- sors), 512GB RAM, nVidia GeForce RTX 2080 Ti GPU (11GB RAM, 4352 CUDA cores).
7.2 Data attributes and pre-processing tasks
7.2.1 Data attributes and sample size
The Data used for all experiments in this research was sourced from Auckland Council’s collection of Auckland-related published research, information, analysis and data about our communities, the economy and the Auckland environment. Data from this website - KnowledgeAuckland - is managed by Auckland Council’s Research and Evaluation Unit (RIMU) who were kind enough to assist in this research by providing a bulk export of air quality data for the locations and time frame in scope.
In regards to sample sizes typically recommended or used in canonical correlation anal- ysis, [Barcikowski and Stevens, 1975] posit that canonical correlation analysis may see a strong correlation with samples as small as n=50 and that weaker correlations require sam- ples n>200 to be detected. However, [Stevens, 1996] considered that most important canon- ical function’s canonical loadings should use a sample size of at least 20 times the number of variables in the analysis, and for two canonical functions than at least 40-60 times the number of variables should be used. Accordingly, [Dattalo, 2014] suggests that researchers use both perspectives in order to triangulate a minimally sufficient size.
The finalised dataset comprises of measurement integers from 12 monitoring stations
Figure 7.3: Individual location baseline architecture
located in the Auckland region, capturing data onCO, Humidity, N O, N O2, N Ox,O3, P M10,P M2.5,P M1,Rain,SO2,Solar Radiation,T emperature,W ind,W indspeed. Not every variable is recorded at every station, and further to this, not at every date in the dataset. The dataset runs from a data range beginning at 20/01/1997 03:00 and ends at 1/01/2017 00:00, and some stations did not capture information on a variable until a later date. For accuracy these were removed from the dataset, as was any measurement that showed only a "yes" or "no" record-type of whether it was captured as the only reading.
While these could be included by using one-host encoding, in the interests of keeping the DCCA algorithm working on pure integers, these were excluded. The readings for each variable, at each station, are captured hourly and represent an average measurement over that hour. Whereas an Air Quality Index (AQI) measurement is recorded at several loca- tions, this did not represent a usual metric to include in calculations (itself being a derived calculation) so was not included. Descriptive statistics about the data - min, max, mean, standard deviation - may be found in table A.1 in appendix A. A large amount of metadata has also been derived from the dataset during exploration and is published externally, see table 7.1. The means of measuring these measurements from a hardware perspective does not form part of this research so any impacts that may present because of hardware sensor or device calibrations are not considered.
Data External Link
P-Plot http://data.reganandrews.com/PPlot/PPlot.htm
Q-Plot http://data.reganandrews.com/QPlot/QPlot.htm
Bayesian correlation http://data.reganandrews.com/BayesianCorrelation.htm Correlations http://data.reganandrews.com/Correlation.htm
7.2. DATA ATTRIBUTES AND PRE-PROCESSING TASKS 99 Table 7.1 continued from previous page
Data External Link
Non-parametric correlations http://data.reganandrews.com/NonParametricCorrelation.htm Partial correlations http://data.reganandrews.com/PartialCorrelation.htm
Automatic linear modelling http://data.reganandrews.com/ALM.htm Table 7.1: Links to data held externally
Figure 7.4: Composition and splitting of datasets
7.2.2 Geographical locations of data
Table 7.2 displays the name and map co-ordinates for the air monitoring stations in scope for this research, and this is further visualised in the map of New Zealand in Figure 7.6 to show the relative locations between stations. This may give context to any correlations experienced between readings; however, distance measurements are not included in any part of the calculations. Penrose is chosen as the location we’re interested in due to it’s relative position in the centre of other stations. As we are interested in determining which other station’s measurements can be used to predict measurements at the Penrose station this would give us the most potential correlations.
Site Latitude Longitude Henderson -36.86803 174.62838 Glen Eden -36.92223 174.65205 Patumahoe -37.2045 174.86402 Papatoetoe -36.9706 174.83834
Penrose -36.90461 174.81559
Queen St -36.84743 174.76573
Takapuna -36.78031 174.7489
Helensville -36.70302 174.3777 Pakuranga -36.90549 174.89037 Ports of Auckland -36.8465551 174.7852229 Musick Point -36.84729 174.90148
Whangarei -35.7251 174.3237
Table 7.2: Name and map co-ordinates for air monitoring stations in Auckland region in scope for this research
Figure 7.6: Air monitoring stations in Auckland region in scope for this research
7.2. DATA ATTRIBUTES AND PRE-PROCESSING TASKS 101
7.2.3 Data pre-processing
Some degree of data pre-processing is necessary for most computational methods and can be described as a technique that helps to convert raw data into a useful format. In order to produce the highest quality work, adherence to some generally accepted design principles when designing and developing this research is necessary. Those principles are:
1. High-quality data input leads to a high-quality output when training a model, 2. Thorough data exploration and pre-processing are essential to ensure the most accu-
rate model.
Nevertheless, while the growing list of software applications, languages and libraries makes performing canonical correlation analysis more accessible than previously, a lack of data reconciliation can threaten the accuracy of any subsequent model relying on it.
These data tasks are needed to ensure the highest quality predictions are made with full confidence in the results.
Figure 7.7: Lifecycle of research architecture
Our first step in data pre-processing is identifying the data type and category of the vari- ables, e.g. measurement providingozoneis an attribute of ’numerical’ data type. Known as
’encoding’, it is the process of converting the categorical variables into numerical variables.
When handling missing or ’null’ values, We can consider imputing missing values, de- pending on the nature of attribute. For a categorical attribute, we can choose the most fre- quent value, and for a numerical attribute, we may use the mean or median - e.g.If a null value appears forN O2we can compute the median value and replace these values. [Wang et al., 2018] recommend excluding these readings if a median or mean cannot readily be ascertained so we have done so.
Outliers are often an essential consideration in data that is generated from monitoring equipment. These are the values that are distant from other observations. [Gelman and Hill, 2007] recommend these be removed from the data for more accurate analysis and modelling. In the case of the current dataset, a monitoring station shows values ofSO2lev- els lie between 400 units to 800 units for 98% of the observations. The other 2% observation shows 120 units ofSO2level, so in this case, for better analytics, those 2% observations have been removed from the training dataset. There are many readily available tools featuring heuristics which identify unlikely values, for example, a technique called ’winsoring’ can replace these values with 5% or 95% percentiles of information from the rest of the readings.
Consideration of potential ’nuisance’ values are also an area the data needs to be re- viewed for, known as ’deconfounding’. This is a pre-processing step which reduces the risk of non-meaningful modes of variations. These sort of steps are often applied in linear regression analysis, also though as canonical correlation analysis has no ’noise’ component,
it might only be used if the data is to be used with other techniques. Deconfounding works by creating a regression model which estimates the variance of the original data explained by the confounder. The resulting residual values are the data with the potential confounder removed.
Importantly, as this research hinges on canonical correlation analysis, it is relevant to note that as it is scale-invariant in that if we do or do not apply some standardisation technique we should not see a changing in the resulting canonical correlations regardless [Wang et al., 2018]. This property is inherited from Pearson’s correlation defined by the degree of simultaneous unit change between two variables, with implicit standardisation of the data.z-Scoring of each variable in the measurement sets is recommended by Wang et al. (2018) to ease the model estimation process and is said to enhance the ability to interpret it.
7.3 Prediction algorithms
7.3.1 Deep Neural Network
The deep neural network used to test and predict in this chapter is based on the Python library scikit-learn [scikit learn, 2021] Multi-layer Perceptron (MLP). The Multi-layer Per- ceptron is a supervised learning algorithm that learns a function by training on a dataset where there are many input and output dimensions.
Given a set of features and a target, for either classification or regression, it learns a non-linear function approximator. Differing from logistic regression, there can be several non-linear layers between the input and output layer known as hidden layers. Figure 7.8 shows a one hidden layer MLP with scalar output [scikit learn, 2021].
Figure 7.8: Multi-layer Perceptron
Figure 7.8 depicts an MLP, where there is an input layer which contains neurons, a
7.3. PREDICTION ALGORITHMS 103 hidden layer in the middle which contains neurons that transform in the values of the input layer using a weighted linear summation then non-linear activation function, and an output layer which transforms the previous layer’s values and transforms them as output.
Multi-layer perceptrons have many advantages [Ramchoun et al., 2016]:
• Capability to learn non-linear models
• Capability to learn models in real-time (on-line learning) usingpartial_fit There are also disadvantages, which [scikit learn, 2021] cite such as:
• MLP with hidden layers has a non-convex loss function where there exists more than one local minimum. Therefore different random weight initialisations can lead to different validation accuracy
• MLP requires tuning many hyperparameters, such as the number of hidden neurons, layers, and iterations.
• MLP is sensitive to feature scaling.
A Multi-layer Perceptron is described and operates in the [scikit learn, 2021] implemen- tation as when given a set of training examples(x1, y1),(x2, y2), ...,(xn, yn)wherexi ∈Rn and yi ∈ 0,1, a one hidden layer one hidden neuron MLP learns the functionf(x) = W2g(W1Tx+b1+b2 whereW1 ∈ RmandW2, b1, b2 ∈ Rare model parameters. W1, W2
represent the weights of the input layer and hidden layer, respectively; andb1, b2represent the bias added to the hidden layer and the output layer, respectively. g(·) : R → Ris the activation function, set by default as the hyperbolic tan. It is given as,
g(x) = ez−e−z
ez+e−z (7.1)
For binary classification,f(x)passes through the logistic functiong(z) = 1/(1 +e−zto obtain output values between zero and one. A threshold, set to 0.5, would assign samples of outputs larger or equal 0.5 to the positive class, and the rest to the negative class.
If there are more than two classes, f(x)itself would be a vector of size (n_classes,).
Instead of passing through logistic function, it passes through the softmax function, which is written as,
sof tmax(z)i= exp(zi) Pk
l=1exp(zl) (7.2)
Where zi represents the i-th element of the input tosof tmax, which corresponds to classi, andK is the number of classes. The result is a vector containing the probabilities that samplexbelongs to each class. The output is the class with the highest probability.
In regression, the output remains asf(x); therefore, output activation function is just the identity function. MLP uses different loss functions depending on the problem type.
The loss function for classification is Cross-Entropy, which in binary case is given as, Loss(ˆy, y, W) =−ylnˆy−(1−y)ln(1−y) +ˆ αkWk22 (7.3) whereαkWk22is an L2-regularisation term (aka penalty) that penalises complex models;
andα >0is a non-negative hyperparameter that controls the magnitude of the penalty. For regression, MLP uses the Square Error loss function; written as,
Loss(ˆy, y, W) =1
2kyˆ−yk22+α
2kWk22 (7.4)
Starting from initial random weights, multi-layer perceptron (MLP) minimises the loss function by repeatedly updating these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers, providing each weight param- eter with an update value meant to decrease the loss. In gradient descent, the gradient
∇Losswof the loss for the weights is computed and deducted fromW. More formally, this is expressed as,
W1+1=Wi−∇Lossiw (7.5)
Whereiis the iteration step, andis the learning rate with a value larger than 0.
The algorithm stops when it reaches a preset maximum number of iterations; or when the improvement in loss is below a certain, small number.
Parameters
Neurons per hidden layer are defined as thei−thelement and represent the number of neurons in the−ithhidden layer. In the implementation for this research, the number of neurons per hidden layer are set at five different numbers, and run twice each - once with the baseline data set and then again using the DCCA-modified dataset. Table 7.3 shows these parameter settings. As the objective is to benchmark DCCA against a baseline, these represent a spread of numbers at a default rather than an optimised amount to find the very best performance for one or other of the types.
Activation function for the hidden layer isReLu, the rectified linear unit function. The activation function in a neural network transforms the node’s input summed weight into the activation (or output) for that node [Goodfellow et al., 2016].
The solver for weight optimisation isL-BFGS-B, an optimiser in the family of quasi- Newton methods. L-BFGS is a solver that approximates the Hessian matrix which repre- sents the second-order partial derivative of a function. It approximates the inverse of the Hessian matrix to perform parameter updates [scikit learn, 2021].
The maximum number iterations was set to 2,000 to ensure enough iterations are per- formed thus preventing early finishing with no result.
All other parameters are set to the sklearn library defaults.
Neurons per hidden layer 10, 10, 10
35, 35, 35 73, 73, 73 100, 100, 100 1024, 1024, 1024
Table 7.3: Deep neural network neurons per hidden layer
7.3.2 Support Vector Machine (SVM)
Support Vector Machines (SVM) map inputs to higher-dimensional feature spaces and us- ing a hyperplane, separate the attribute space in order to maximise the margin on the differ- ent classes or class values. SVMs can yield excellent predictive performances results. The
7.3. PREDICTION ALGORITHMS 105 experiments in this research implement the LIBSVM package. SVMs use an-insensitive loss in order to perform linear regression. The accuracy is determined by the C,external parameters used.
The parameters are set as follows:
• Cost- the penalty term for loss applies for classification and regression tasks, which is set to 1.0
• - defines the distance from true values within which no penalty is associated with predicted values, which is set to 0.10.
The kernal is using RBF, with the constants are set to:
• The gamma constant in the kernel function has the recommended value of 1k, where kis the number of attributes,
• The constantc0in the kernel function remains at the default of 0
• The degree of the kernel remains at the default of 3.
The permitted deviation from the expected value has an tolerance of 0.0010, and the iteration limit is 2,000.
7.3.3 AdaBoost
AdaBoost is a classification and regression algorithm using weak learners, adapting to the
’hardness’ of each training sample. Used in conjunction with other learning algorithms, it boosts their performance by tweaking them.
The base estimator is a Tree, with the parameters set as:
• Number of estimators- is set to 50
• Learning Rate- determines to what extent the newly acquired information will over- ride the old information and is set at 1.00000.
• Boosting method. Classification algorithm is set toSAMME, which updates base estimator’s weights with classification results.
• Regression loss functionis set as Linear.
7.3.4 XGBoost
Extreme Gradient Boosting - XGBoost - is an algorithm that is a scalable method of tree boosting. It can be used for regression and classification by generation of a weak learner that accumulates these forwards to the overall total model [Pan, 2018].
Consider the datasetD={(x1]i, yi) :i= 1...n, xi,∈Rm, yi∈R}withnobservations,m features and a variabley.
ybi=φ(xi) = ΣKk=1fx(xi) (7.6) Letybibe the result of an ensemble generalised model in equation 7.6,fkbe a regression tree andfk(xi)be a score resulting from thek-the tree to thei−th. The functionfkrequires a minimalised, regularised objective function as per the equation, withlas the loss function.