6.1 Introduction
Canonical correlation analysis is the focus of this research, and an understanding of what it is, it’s strengths/weaknesses, use cases are important to implementing a deep learning solution version of it.
75
The types and variations on canonical correlation analysis are quite diverse with a num- ber of variations becoming available over the years. Figure 6.1 displays a number of these detailed CCA equations (red box) and the other variations of them. Constrained CCA is displayed as Yellow, non-linear CCA and Gray, multi-set CCA as Orange, other types as Light/Dark Green [Zhu et al., 2012].
Canonical correlation analysis is a group of statistical methods used to find the linear interrelationships between two sets of variables [Hair et al., 2013]. With more amounts of data than ever before being available, many related methods and extensions to canoni- cal correlation analysis have been developed to take advantage of the availability of these much more powerful computational resources [Wang et al., 2018]. In a canonical correla- tion analysis the first and second sets of variables are known as the independent and de- pendant variables respectively [Lattin et al., 2003], [Hair et al., 2013]. For each of these sets, acanonical variateis formed. The method’s purpose is to develop acanonical function to maximise thecanonical correlation coefficientbetween the canonical variates already mentioned. Each of the two canonical variates is interpreted usingcanonical loadings, the correlation of the individual variables, and their respective variates [Hair et al., 2006] - nearly equivalent to the estimation of a different factor for each set of variables thus max- imising the correlations between the factors. This can be visualised as in Figure 6.2.
Figure 6.2: Canonical Functions - relationship of variables and canonical loadings with the canonical variates in the canonical function
Canonical correlation analysis represents the highest level of the generalised linear model (GLM) that can be understood as closely related to the Pearsonr correlation co- efficient and ultimately the correlation of air quality variables is the focus of this research.
GLMs provide a good framework for understanding these sort of classic analyses in terms of the Pearson correlation coefficient (r) and could be thought of a hierarchy with CCA as the top-level analysis in the overall group. CCA encompasses both univariate and multivariate methods, contrary to the views generally held by researchers ( [Fan, 1996],
6.1. INTRODUCTION 77
Figure 6.1: Technical details of CCA and relationship between CCA and its variants.
[Zhuang et al., 2020]
[Fan, 1997], [Henson, 2000], [Thompson, 2005]). Although Structural Equation Modelling (SEM) is the highest level of GLM, it explicitly contains measurement error in the analysis in contrast to other methods [Fish, 1988]).
CCA in some form has been around since the 1930s with the framework of the method developed by Hotelling ( [Hotelling, 1935], [Hotelling, 1936]). As CCA and it’s related techniques have become available in software form, it’s use has increased but still sees less use by researchers in general who continue to use univariate methods such as multiple regression and ANOVA and such methods may be less accurate than CCA [Sherry and Henson, 2005].
6.2 Canonical correlation analysis in the environmental do- main
Early work on canonical correlation analysis in the environmental domain appears with a study by [Laessig and Duckett, 1979] on the use of canonical correlation analysis for the purposes of environmental health planning purposes. The motivation was that such in- dices may be useful to identify associations among groups of variables such as specific geographic area. The indices may also provide insights into environmental health rela- tionships which are worthy of further epidemiological investigation. Interestingly, they conclude that "Although one can rightfully question whether any substantive conclusions can be drawn from [their] Philadelphia example, the example does demonstrate some of the methodological features of canonical correlation analysis. The canonical correlations and loadings suggest associations both within each set of variables and between sets of predictor and criterion variables." They also list a number negatives of choosing canonical correlation analysis:
• Requires considerable time and skill in the collection of data and the analysis of re- sults,
• Sample sizes must be large,
• Data should cover a wide range of environmental and health conditions,
• Data should be of high quality,
• Checks on weight (loading) stability are required,
• Loadings and canonical correlations may need to be reviewed a number of times before the "meaning" of each correlation emerges, and even then it may be nothing more than a statistical anomaly.
The advent of computing hardware, software and techniques now that enable most of these negatives to be removed or minimised and may serve to negate most of these negative aspects.
[Cherry, 1996] notes that canonical correlation analysis has a long history of abuse in statistics - being used in inappropriate circumstances. The results of his research when using it in two geophysical fields show good results and notes that canonical correlation analysis gave better results than singular value composition.
[Yu et al., 1997] studied two statistical models - canonical correlation analysis and mul- tivariate Principal Component Regression in order to forecast rainfall variations in 10 sta- tions. Sea surface temperatures in the Pacific Ocean are used as predictors for both models.
6.3. CANONICAL CORRELATION ANALYSIS 79 The results showed that both models are potentially useful in predicting seasonal rainfall variations. They conclude that canonical correlation analysis and principal component re- gression show similar results.
[Statheropoulos et al., 1998] conducted a study on 5 common chemical air pollutants in Greece and after having used principal component analysis used canonical correlation analysis to study the relationships between these 5 chemicals as variable set 1, and mete- orological data as variable set 2. This study used an in-house developed tool to facilitate PCA and CCA, and approached missing data values with averages for the 5 years of data available.
Research was undertaken by [Di Leonardo et al., 2014] on the use of CCA in envi- ronmental contamination assessment, a use-case not dissimilar to air quality given it’s in- tent to characterise the relationship of different trace amounts of pollutants to each other.
[Di Leonardo et al., 2014] applied CCA in order to identify and lower dimension set of variables that were closely correlated in order to make distinctions between different geo- chemicals, thereby drawing a distinction between what is a original material and what is subsequently generated. Their conclusion was the CCA was able to discriminate between major and trace amounts of chemicals and their natural or anthropogenic origins. Ac- cordingly, CCA and frequency distribution analysis techniques "constitute a powerful and economic tool ... and can be adequately applied to other similar environments"
A large number of research studies focus on the specific results of meteorological ac- tivity in very specific regions and present results of a CCA, rather than commentary on the merits of using canonical correlation analysis itself or accuracy concerns - [Young and Matthews, 1981], [Landman and Mason, 1999], [Livezey and Smith, 1999], [Xoplaki et al., 2003], [Juneng and Tangang, 2008], [Cannon and Hsieh, 2008], [Tivy et al., 2011], [Rana et al., 2018].
6.3 Canonical correlation analysis
[Zhuang et al., 2020] explains CCA as designed to maximise the correlation between two latent variablesy1 ∈Rp1x1andy2 ∈Rp2x1(modalities). Yk ∈RN xpk,k= 1,2are samples of the two variables involved withNrepresenting the number of observations andpk,k= 1,2 representing the number of features in each variable. CCA determines the canonical coefficients u1 ∈ Rp1x1 and u2 ∈ Rp2x1 for Y1 and Y2, respectively, by maximising the correlation betweenY1u1andY2u2, shown in equation 6.1 [Zhuang et al., 2020]:
CCA:maxρ=corr(Y1u1, Y2u2) = uT1 P
12u2
q uT1 P
11u1q uT2 P
22u2
. (6.1)
P
11 and P
22 are the within-set covariance matrices andP
12 is the between-set co- variance matrix. The denominator is used to normalise within-set covariances therefore ensuring that it is invariant to the scaling effect of coefficients. The canonical coefficientsu1
andu2are calculated by setting the partial derivative of the objective function inu1andu2
to zero, thereby giving us equation 6.2 and 6.3 ifΣkkis invertible [Zhuang et al., 2020]:
Σ12u2=ρΣ11u1 and Σ21u1=ρΣ22u2. (6.2)
Σ−111Σ12Σ−122Σ21=ρ2u1 , Σ−122Σ21Σ−111Σ12=ρ2u2. (6.3)
Each of the pairs of canonical coefficients (u1, u2)are eigenvectors ofΣ−111Σ12Σ−122Σ21
andΣ−122Σ21Σ−111Σ12, respectively with the same eigenvalueρ2. Following Equation 6.3, up toM =min(p1, p2)pairs of canonical coefficients are able to be computed through singular value decomposition (SVD), and thus every pair of canonical variablesY1u(m)1 , Y2u(m)2 , m= 1,2, . . . , M, are uncorrelated with any other pair of canonical variables. Corresponding M canonical correlation values are shown in descending order as ρ(1) > ρ(2) > . . . >
ρ(M). One of the requirements for solving the CCA problem (Equation 6.1) through this eigenvalue problem (Equation 6.3) is that within-set covariance matricesΣ11andΣ22have to be invertible.
To satisfy this invertible requirement the number of observations inY1 and Y2 needs to be be greater than the number of features - that is to say,N > pk, k = 1,2. In addi- tion, since the square of canonical correlation values (ρ2) are the eigenvalues of matrices Σ−111Σ12Σ−122Σ21andΣ−122Σ21Σ−111Σ12, both matrices are required to be positive definite.
The statistical inferences of this are that parametric inferences exist for CCA if both of the variables adhere to a Gaussian distribution. The null hypothesis here is that no canonical correlation can exist betweenY1andY2, that is,ρ(1)=ρ(2) =. . . =ρ(M)= 0, and alternatively at least one of the canonical correlation values is not zero. Testing based on Wilk’sΛis [Bartlett, 1939]:
Λ =−(N−p1+p2+ 3
2 log
M
Y
i=1
(1−ρ(i)), (6.4)
which follows a chi-square distributionχ2p1×p2 with degree of freedom ofp1×p2. We also should test if a specific canonical correlation value(ρ(m),1≤m≤M)could be differ- ent from zero. If this is the case then test statistic in Equation 6.4 becomes:
Λ(m)=−(N−p1+p2+ 3
2 log
M
Y
i=m+1
(1−ρ(i)), (6.5)
which followsX(p2
1−m)(p2−m).
However, in practice, this is not commonly used due to the requirement that requires variables have to follow a Gaussian distribution strictly and are quite sensitive to outliers [Bartlett, 1939]. As a workaround, some permutation-based non-parametric statistics have become commonly used when applying CCA. The Observations of one of the variables are randomly shuffled (Y1becomesYˆ1) while the observations of the other variable are kept intact (Y2remains) - after which another new set of canonical correlation values can then be calculated forYˆ1andY2 following Equation 6.3. The random shuffling procedure can be repeated multiple times, resulting in a null distribution of canonical correlation values being generated. Statistical significance (p-values) for the valid canonical correlation values can then be obtained from this null distribution.
6.4 Variations of Canonical Correlation Analysis
Standard Canonical Correlation Analysis has a number of variations that have been devel- oped for different uses;
• Constrained CCA adds penalties in order to re-frame as a contrained optimisation problem ( [Yang et al., 2018], [Zhuang et al., 2017]),
6.5. DIFFERENTIATION OF CCA TECHNIQUES 81
• Sparse CCA uses theL1-norm penalty [Zhuang et al., 2017],
• Structured Sparse CCA/Discriminant CCA is where Sparse CCA has known features beforehand ( [Lin et al., 2014], [Kim et al., 2019], [Wang et al., 2019])
• Kernel CCA and temporal kernel CCA maps original feature spaces onto new feature spaces to show the non-linear relationships between the two variables sets [Hardoon et al., 2007],
• Multiset CCA maximises correlations using SUMCOR ( [Kettenring, 1971], [Drud, 1985]) and Multiset CCA with constraints where penalty terms are added to each Ui[Sui et al., 2018].
• Deep CCA [Andrew et al., 2013], implemented in this research.
[Zhuang et al., 2020] note that non-parametric permutation tests have been widely per- formed in CCA variant techniques to determine the statistical significance of each canon- ical correlation value and the corresponding canonical coefficients. Observations of one of those variables can be randomly shuffled, wherebyY1becomescY1and observations of the other variable are kept intact Y2 remains. After a number of random shufflings (of- ten 5,000) and application of the CCA technique, the obtained canonical correlation values form the null distribution.p-values of the true canonical correlation values are determined by comparing those true values to the null distribution. Another technique is to provide null data input to CCA variant techniques themselves. This null data is often created from the physical properties of input variables.
6.5 Differentiation of CCA techniques
Canonical correlation analysis variations can be categorised into three groups: standard (conventional) canonical correlation analysis, non-linear canonical correlation analysis and multi-set canonical correlation analysis. All three of these methods share the ability to be solved by the use of closed-form analytical methods - obtained with the usage of the partial derivatives of the objective function concerning each unknown, separately [Zhuang et al., 2020].
6.5.1 Relationship between CCA and other multivariate and univariate techniques
[Zhuang et al., 2020] describes canonical correlation analysis relationships with other uni- variate and multi-variate techniques in the following ways;
Relationship with other multivariate techniques
CCA can be directly rewritten in terms of the multivariate multiple regression (MVMR) model:
Y1u1=Y2u2+e, (6.6)
whereu1andu2are obtained by minimising the residual term∈RN×N. Since CCA is scale-invariant, a solution to Equation 6.6 is also a solution of Equation 6.1. Furthermore,
with normalisation terms ofuT1Σ11u1 = 1and U2TΣ22u22 = 1, the MVMR model is ex- actly equivalent to CCA, that is, maximising the canonical correlation betweenY1andY2is equivalent to minimising the residual term:
umax1,u2(corr(Y1u1, Y2, u2))⇔max
u1,u2uT1Σ min
u1,U2
−uT1Σ12u2
min
u1,U2
kY1u1−Y2u2k22 (6.7) By replacing the covariance matricesΣ11 andΣ22in the denominator in Equation 6.1 with the identity matrix i, conventional CCA is converted to partial least square (PLS), which maximises the covariance between latent variables. IfY1is the same asY2, the PLS will maximise the variance within a single variable, which is equivalent to PCA.
Relationship with univariate techniques
If one variable in CCA, for example,Y1, only has a single feature, that is,y∈RN×N,u1can then be defined as1and CCA becomes a linear regression problem:
y=Xβ+ (6.8)
where Y1 is renamed as y and Y2 is renamed asX to follow conventional notations.
∈RN×1denotes the residual term. If both variablesY1andY2contain only one feature, the canonical correlation betweenY1andY2becomes the Pearson’s correlation betweenY1
andY2as in the univariate analysis.
6.5.2 CCA advantages
CCA has several advantages over univariate techniques primarily due to it’s multivariate nature. Multivariate techniques reduce the likelihood that a type 1 error will occur. A type 1 error - also known as a false positive -occurs when a researcher incorrectly rejects a correct null hypothesis. This means the findings appear to be significant when, in actuality, they have occurred by chance in this instance. This type of error can happen when too many tests are run on the dataset, each test having its own risk of a Type 1 error. CCA, like other multivariate techniques, because the comparisons are run at the same time [Sherry and Henson, 2005].
A significant advantage is that multivariate techniques tend towards realistic interpre- tations, given real-world scenarios can have several causes and effects. If causes and effects are analysed individually and independently, they may be distorted, and this is especially important in some fields such as psychology where behaviour and cognition are the com- plex reality of humans.
Illustrating the advantage, therefore, multivariate techniques can analyse the data in congruence with the purpose of the research. Examples of missed relationships when using univariate methods are illustrated in [Fish, 1988].
CCA is a comprehensive method when compared to other parametric methods [Hen- son, 2000], [Knapp, 1978], and [Thompson, 1991]. [Sherry and Henson, 2005] describes a number of special case methods that can be subsumed with CCA:
• ANOVA
• MANOVA
6.5. DIFFERENTIATION OF CCA TECHNIQUES 83
• Multiple regression
• Pearson correlation
• t test
• point-bi-serial correlation
• discriminant analysis
There is, of course, the risk of CCA being used on simple datasets or analyses cases where a less lengthy and tedious method could be used instead, which is a common theme in a number of studies.
CategoryCCAvariantAdvantagesLimitations CCACCA1)Hasclosed-formanalyticalsolution 2)Easytoapply 3)Invarianttoscaling 1)RequiresNpk,k=1,2 2)Signsofcanonicalcorrelationsareindeterminate ConstrainedCCASparseCCA1)RemovesnoninformativefeaturesandsolvesNpk 2)Performsreasonablywithhigh-dimensional-co-lineardataRequiresoptimisationexpertise StructuresparseCCA
Removesnoninformativefeatures,solvingNpkwithpriorinformationaboutthedata 1)ImproveseffectivenessofsparseCCA. 2)Producesbiologicalmeaningfulresults
1)Requiresoptimisationexpertise 2)Requirespriorknowledgeaboutthedata DiscriminantsparseCCADiscoversgroupdiscriminantfeatures GeneralisedconstrainedCCA1)Reducesfalsepositives 2)Maintainsmostofthevarianceinastablemodel1)Requiresoptimisationexpertise 2)Requirespredefinedconstraints NonlinearCCAKernelCCA1)Findsnonlinearrelationshipamongmodalities 2)Hasanalyticalsolution
1)Requirespredefinedkernelfunctions 2)Difficulttoprojectfromkernelspacebacktooriginalfeaturespace, leadingtodifficultiesininterpretation 3)Onlylinearkernelspacecanbeprojectedbacktotheoriginalfeaturespace. TemporalkernelCCAMostappropriatetosimultaneouslycollectdatafromtwomodalitieswithtimedelay DeepCCA1)Findsunknownnonlinearrelationship 2)Purelydata-driven1)Requiresdeeplearningexpertise 2)Requireslargenumberoftrainingsamples(intensofthousands) MultisetCCAMultisetCCA1)Goodformorethantwomodalities 2)Goodforgroupanalysis
1)Requirespredefinedobjectivefunctions 2)Thenumberoffinalcanonicalcomponentsdoesnotrepresenttheintersectedcommon patternsacrossallmodalities SparsemultisetCCA1)Goodformorethantwomodalities 2)Removesnon-informativefeaturesandsolvesNpk MultisetCCAwithreferenceSupervisedfusiontechniquetolinkcommonpatternswithapriorknownvariable Table6.1:AdvantagesandlimitationsofeachCCA-relatedtechnique[Zhuangetal.,2020]
6.5. DIFFERENTIATION OF CCA TECHNIQUES 85
6.5.3 CCA use-cases
CCA could be the correct choice when two datasets need to be evaluated for their relation- ship to one another. As an indication of whether it is the appropriate method, the rationale for why the variables are in separate sets but needed to be still considered together.
Such a rationale in this research is that there are many variables representing measure- ments of different chemical compounds in the air, and these are divided into different re- gions geographically. Thus we can consider 11 variables (e.g. O2,CO2,O3...) in Penrose, against the same 11 variables in Glen Eden or Auckland City. It only would not make sense to combine these all together when we are waning to compare and correlate the dispersion of chemical in regions against one another.
While one dataset is often called the predictor set and the other the criterion set, these labels are for convenience and ultimately do not have any effect on the outcome. A broad description of the method could be summarised as examining the correlation between a synthetic predictor and synthetic criterion, weighted based on a relationship between the variables in the sets thus it can be viewed conceptually as a Pearsonrbetween two syn- thetic variables.
Figure 6.3: Depiction of CCA network [Wang et al., 2018]
Figure 6.3 shows a representation of the relationships between datasets and variables.
Evaluation of the correlational relationships requires it is somehow combined into one syn- thetic variable, referred to as the latent or unobserved variable). CCA creates this synthetic variable by using a linear equation which results in a single predictor variable and a sin- gle criterion variable. Using a linear equations is comparable to using multiple regression where beta (β) weights are multiplied with observed scores (in Z score form) and then summed to yield synthetic predicted scores (i.e., Y0 = β1X1+β2X2) [Sherry and Hen- son, 2005]. Using standardised weights (comparable to beta weights) results in two linear equations, one each for predictor and criterion, thus resulting in two synthetic variables.
The two resultant equations produce the greatest possible correlation between these two synthetic variables.
The output of a canonical correlation from the linear equations is, in actuality, a Pearson rand from a high-level view all activities in a CCA procedure are designed to maximise this simple correlation [Henson, 2000], [Thompson, 1984]. There are as many canonical functions as there are variables in the smallest dataset. There is ordinarily a residual vari- ance leftover in the variable sets from the first function that cannot be explained unless the canonical correlation is 1.00, which is impractical in reality. Two more synthetic variables are created from the second function which is as strongly correlated as possible in light of the residual variance resulting from the first function, with the condition that the new synthetic variables are as uncorrelated with the first function’s two synthetic variables as is possible [Sherry and Henson, 2005].
[Sherry and Henson, 2005] explains that when two more maximally correlated syn- thetic variables are produced from the second function, and assuming these are perfectly uncorrelated with the synthetic variables from the first function - they are usually known as having double orthogonality. Comparable to principal component analysis, this is repeated until the variance of the original variables is explained or until there as many variables as there is in the smallest dataset. The only functions considered for analyses to happen are the ones that define the relationship to the original dataset. Whether CCA is appropri- ate has some assumptions, such as multivariate normality [Tabachnick and Fidell, 1996]
and [Thompson, 1984]. Briefly, this assumes that all variables in the dataset and all lin- ear combinations of these variables are normally distributed, which may in some cases, be difficult. Approaches to this have been devised by [Mardia, 1985], [Henson, 1999].