Canonical correlation analysis and deep learning

6.1 Introduction

Canonical correlation analysis is the focus of this research, and an understanding of what it is, it’s strengths/weaknesses, use cases are important to implementing a deep learning solution version of it.

The types and variations on canonical correlation analysis are quite diverse with a number of variations becoming available over the years. Figure 6.1 displays a number of these detailed CCA equations (red box) and the other variations of them. Constrained CCA is displayed as Yellow, non-linear CCA and Gray, multi-set CCA as Orange, other types as Light/Dark Green [Zhu et al., 2012].

Canonical correlation analysis is a group of statistical methods used to find the linear interrelationships between two sets of variables [Hair et al., 2013]. With more amounts of data than ever before being available, many related methods and extensions to canonical correlation analysis have been developed to take advantage of the availability of these much more powerful computational resources [Wang et al., 2018]. In a canonical correlation analysis the first and second sets of variables are known as the independent and de- pendant variables respectively [Lattin et al., 2003], [Hair et al., 2013]. For each of these sets, acanonical variateis formed. The method’s purpose is to develop acanonical function to maximise thecanonical correlation coefficientbetween the canonical variates already mentioned. Each of the two canonical variates is interpreted usingcanonical loadings, the correlation of the individual variables, and their respective variates [Hair et al., 2006] - nearly equivalent to the estimation of a different factor for each set of variables thus maximising the correlations between the factors. This can be visualised as in Figure 6.2.

Figure 6.2: Canonical Functions - relationship of variables and canonical loadings with the canonical variates in the canonical function

Canonical correlation analysis represents the highest level of the generalised linear model (GLM) that can be understood as closely related to the Pearsonr correlation coefficient and ultimately the correlation of air quality variables is the focus of this research.

GLMs provide a good framework for understanding these sort of classic analyses in terms of the Pearson correlation coefficient (r) and could be thought of a hierarchy with CCA as the top-level analysis in the overall group. CCA encompasses both univariate and multivariate methods, contrary to the views generally held by researchers ( [Fan, 1996],

6.1. INTRODUCTION 77

Figure 6.1: Technical details of CCA and relationship between CCA and its variants.

[Zhuang et al., 2020]

[Fan, 1997], [Henson, 2000], [Thompson, 2005]). Although Structural Equation Modelling (SEM) is the highest level of GLM, it explicitly contains measurement error in the analysis in contrast to other methods [Fish, 1988]).

CCA in some form has been around since the 1930s with the framework of the method developed by Hotelling ( [Hotelling, 1935], [Hotelling, 1936]). As CCA and it’s related techniques have become available in software form, it’s use has increased but still sees less use by researchers in general who continue to use univariate methods such as multiple regression and ANOVA and such methods may be less accurate than CCA [Sherry and Henson, 2005].

6.2 Canonical correlation analysis in the environmental do- main

Early work on canonical correlation analysis in the environmental domain appears with a study by [Laessig and Duckett, 1979] on the use of canonical correlation analysis for the purposes of environmental health planning purposes. The motivation was that such indices may be useful to identify associations among groups of variables such as specific geographic area. The indices may also provide insights into environmental health relationships which are worthy of further epidemiological investigation. Interestingly, they conclude that "Although one can rightfully question whether any substantive conclusions can be drawn from [their] Philadelphia example, the example does demonstrate some of the methodological features of canonical correlation analysis. The canonical correlations and loadings suggest associations both within each set of variables and between sets of predictor and criterion variables." They also list a number negatives of choosing canonical correlation analysis:

• Requires considerable time and skill in the collection of data and the analysis of results,

• Sample sizes must be large,

• Data should cover a wide range of environmental and health conditions,

• Data should be of high quality,

• Checks on weight (loading) stability are required,

• Loadings and canonical correlations may need to be reviewed a number of times before the "meaning" of each correlation emerges, and even then it may be nothing more than a statistical anomaly.

The advent of computing hardware, software and techniques now that enable most of these negatives to be removed or minimised and may serve to negate most of these negative aspects.

[Cherry, 1996] notes that canonical correlation analysis has a long history of abuse in statistics - being used in inappropriate circumstances. The results of his research when using it in two geophysical fields show good results and notes that canonical correlation analysis gave better results than singular value composition.

[Yu et al., 1997] studied two statistical models - canonical correlation analysis and multivariate Principal Component Regression in order to forecast rainfall variations in 10 sta- tions. Sea surface temperatures in the Pacific Ocean are used as predictors for both models.

6.3. CANONICAL CORRELATION ANALYSIS 79 The results showed that both models are potentially useful in predicting seasonal rainfall variations. They conclude that canonical correlation analysis and principal component regression show similar results.

[Statheropoulos et al., 1998] conducted a study on 5 common chemical air pollutants in Greece and after having used principal component analysis used canonical correlation analysis to study the relationships between these 5 chemicals as variable set 1, and meteorological data as variable set 2. This study used an in-house developed tool to facilitate PCA and CCA, and approached missing data values with averages for the 5 years of data available.

Research was undertaken by [Di Leonardo et al., 2014] on the use of CCA in environmental contamination assessment, a use-case not dissimilar to air quality given it’s in- tent to characterise the relationship of different trace amounts of pollutants to each other.

[Di Leonardo et al., 2014] applied CCA in order to identify and lower dimension set of variables that were closely correlated in order to make distinctions between different geo- chemicals, thereby drawing a distinction between what is a original material and what is subsequently generated. Their conclusion was the CCA was able to discriminate between major and trace amounts of chemicals and their natural or anthropogenic origins. Ac- cordingly, CCA and frequency distribution analysis techniques "constitute a powerful and economic tool ... and can be adequately applied to other similar environments"

A large number of research studies focus on the specific results of meteorological ac- tivity in very specific regions and present results of a CCA, rather than commentary on the merits of using canonical correlation analysis itself or accuracy concerns - [Young and Matthews, 1981], [Landman and Mason, 1999], [Livezey and Smith, 1999], [Xoplaki et al., 2003], [Juneng and Tangang, 2008], [Cannon and Hsieh, 2008], [Tivy et al., 2011], [Rana et al., 2018].

6.3 Canonical correlation analysis

[Zhuang et al., 2020] explains CCA as designed to maximise the correlation between two latent variablesy1 ∈R^p¹^x1andy2 ∈R^p²^x1(modalities). Yk ∈R^{N xp}^k,k= 1,2are samples of the two variables involved withNrepresenting the number of observations andp_k,k= 1,2 representing the number of features in each variable. CCA determines the canonical coefficients u₁ ∈ R^p¹^x1 and u₂ ∈ R^p²^x1 for Y₁ and Y₂, respectively, by maximising the correlation betweenY₁u₁andY₂u₂, shown in equation 6.1 [Zhuang et al., 2020]:

CCA:maxρ=corr(Y₁u₁, Y₂u₂) = u^T₁ P

12u2

q u^T₁ P

11u₁q u^T₂ P

22u₂

. (6.1)

11 and P

22 are the within-set covariance matrices andP

12 is the between-set covariance matrix. The denominator is used to normalise within-set covariances therefore ensuring that it is invariant to the scaling effect of coefficients. The canonical coefficientsu1

andu2are calculated by setting the partial derivative of the objective function inu1andu2

to zero, thereby giving us equation 6.2 and 6.3 ifΣkkis invertible [Zhuang et al., 2020]:

Σ₁₂u₂=ρΣ₁₁u₁ and Σ₂₁u₁=ρΣ₂₂u₂. (6.2)

Σ⁻¹₁₁Σ₁₂Σ⁻¹₂₂Σ₂₁=ρ²u₁ , Σ⁻¹₂₂Σ₂₁Σ⁻¹₁₁Σ₁₂=ρ²u₂. (6.3)

Each of the pairs of canonical coefficients (u1, u2)are eigenvectors ofΣ⁻¹₁₁Σ12Σ⁻¹₂₂Σ21

andΣ⁻¹₂₂Σ21Σ⁻¹₁₁Σ12, respectively with the same eigenvalueρ². Following Equation 6.3, up toM =min(p1, p2)pairs of canonical coefficients are able to be computed through singular value decomposition (SVD), and thus every pair of canonical variablesY₁u^(m)₁ , Y₂u^(m)₂ , m= 1,2, . . . , M, are uncorrelated with any other pair of canonical variables. Corresponding M canonical correlation values are shown in descending order as ρ⁽¹⁾ > ρ⁽²⁾ > . . . >

ρ(M). One of the requirements for solving the CCA problem (Equation 6.1) through this eigenvalue problem (Equation 6.3) is that within-set covariance matricesΣ11andΣ22have to be invertible.

To satisfy this invertible requirement the number of observations inY1 and Y2 needs to be be greater than the number of features - that is to say,N > pk, k = 1,2. In addi- tion, since the square of canonical correlation values (ρ²) are the eigenvalues of matrices Σ⁻¹₁₁Σ12Σ⁻¹₂₂Σ21andΣ⁻¹₂₂Σ21Σ⁻¹₁₁Σ12, both matrices are required to be positive definite.

The statistical inferences of this are that parametric inferences exist for CCA if both of the variables adhere to a Gaussian distribution. The null hypothesis here is that no canonical correlation can exist betweenY1andY2, that is,ρ⁽¹⁾=ρ⁽²⁾ =. . . =ρ^(M)= 0, and alternatively at least one of the canonical correlation values is not zero. Testing based on Wilk’sΛis [Bartlett, 1939]:

Λ =−(N−p₁+p₂+ 3

2 log

i=1

(1−ρ⁽ⁱ⁾), (6.4)

which follows a chi-square distributionχ²_p₁_×p₂ with degree of freedom ofp1×p2. We also should test if a specific canonical correlation value(ρ^(m),1≤m≤M)could be different from zero. If this is the case then test statistic in Equation 6.4 becomes:

Λ^(m)=−(N−p1+p2+ 3

2 log

i=m+1

(1−ρ⁽ⁱ⁾), (6.5)

which followsX_(p²

1−m)(p₂−m).

However, in practice, this is not commonly used due to the requirement that requires variables have to follow a Gaussian distribution strictly and are quite sensitive to outliers [Bartlett, 1939]. As a workaround, some permutation-based non-parametric statistics have become commonly used when applying CCA. The Observations of one of the variables are randomly shuffled (Y1becomesYˆ1) while the observations of the other variable are kept intact (Y2remains) - after which another new set of canonical correlation values can then be calculated forYˆ1andY2 following Equation 6.3. The random shuffling procedure can be repeated multiple times, resulting in a null distribution of canonical correlation values being generated. Statistical significance (p-values) for the valid canonical correlation values can then be obtained from this null distribution.

6.4 Variations of Canonical Correlation Analysis

Standard Canonical Correlation Analysis has a number of variations that have been developed for different uses;

• Constrained CCA adds penalties in order to re-frame as a contrained optimisation problem ( [Yang et al., 2018], [Zhuang et al., 2017]),

6.5. DIFFERENTIATION OF CCA TECHNIQUES 81

• Sparse CCA uses theL1-norm penalty [Zhuang et al., 2017],

• Structured Sparse CCA/Discriminant CCA is where Sparse CCA has known features beforehand ( [Lin et al., 2014], [Kim et al., 2019], [Wang et al., 2019])

• Kernel CCA and temporal kernel CCA maps original feature spaces onto new feature spaces to show the non-linear relationships between the two variables sets [Hardoon et al., 2007],

• Multiset CCA maximises correlations using SUMCOR ( [Kettenring, 1971], [Drud, 1985]) and Multiset CCA with constraints where penalty terms are added to each Ui[Sui et al., 2018].

• Deep CCA [Andrew et al., 2013], implemented in this research.

[Zhuang et al., 2020] note that non-parametric permutation tests have been widely per- formed in CCA variant techniques to determine the statistical significance of each canonical correlation value and the corresponding canonical coefficients. Observations of one of those variables can be randomly shuffled, wherebyY₁becomescY₁and observations of the other variable are kept intact Y₂ remains. After a number of random shufflings (often 5,000) and application of the CCA technique, the obtained canonical correlation values form the null distribution.p-values of the true canonical correlation values are determined by comparing those true values to the null distribution. Another technique is to provide null data input to CCA variant techniques themselves. This null data is often created from the physical properties of input variables.

6.5 Differentiation of CCA techniques

Canonical correlation analysis variations can be categorised into three groups: standard (conventional) canonical correlation analysis, non-linear canonical correlation analysis and multi-set canonical correlation analysis. All three of these methods share the ability to be solved by the use of closed-form analytical methods - obtained with the usage of the partial derivatives of the objective function concerning each unknown, separately [Zhuang et al., 2020].

6.5.1 Relationship between CCA and other multivariate and univariate techniques

[Zhuang et al., 2020] describes canonical correlation analysis relationships with other univariate and multi-variate techniques in the following ways;

Relationship with other multivariate techniques

CCA can be directly rewritten in terms of the multivariate multiple regression (MVMR) model:

Y1u1=Y2u2+e, (6.6)

whereu1andu₂are obtained by minimising the residual term∈R^N×N. Since CCA is scale-invariant, a solution to Equation 6.6 is also a solution of Equation 6.1. Furthermore,

with normalisation terms ofu^T₁Σ11u1 = 1and U₂^TΣ22u22 = 1, the MVMR model is ex- actly equivalent to CCA, that is, maximising the canonical correlation betweenY1andY2is equivalent to minimising the residual term:

umax₁,u₂(corr(Y1u1, Y2, u2))⇔max

u₁,u₂u^T₁Σ min

u1,U2

−u^T₁Σ12u2

min

u1,U2

kY1u1−Y2u2k²₂ (6.7) By replacing the covariance matricesΣ₁₁ andΣ₂₂in the denominator in Equation 6.1 with the identity matrix i, conventional CCA is converted to partial least square (PLS), which maximises the covariance between latent variables. IfY₁is the same asY₂, the PLS will maximise the variance within a single variable, which is equivalent to PCA.

Relationship with univariate techniques

If one variable in CCA, for example,Y1, only has a single feature, that is,y∈R^N×N,u1can then be defined as1and CCA becomes a linear regression problem:

y=Xβ+ (6.8)

where Y₁ is renamed as y and Y₂ is renamed asX to follow conventional notations.

∈R^N×1denotes the residual term. If both variablesY1andY2contain only one feature, the canonical correlation betweenY1andY2becomes the Pearson’s correlation betweenY1

andY2as in the univariate analysis.

6.5.2 CCA advantages

CCA has several advantages over univariate techniques primarily due to it’s multivariate nature. Multivariate techniques reduce the likelihood that a type 1 error will occur. A type 1 error - also known as a false positive -occurs when a researcher incorrectly rejects a correct null hypothesis. This means the findings appear to be significant when, in actuality, they have occurred by chance in this instance. This type of error can happen when too many tests are run on the dataset, each test having its own risk of a Type 1 error. CCA, like other multivariate techniques, because the comparisons are run at the same time [Sherry and Henson, 2005].

A significant advantage is that multivariate techniques tend towards realistic interpre- tations, given real-world scenarios can have several causes and effects. If causes and effects are analysed individually and independently, they may be distorted, and this is especially important in some fields such as psychology where behaviour and cognition are the com- plex reality of humans.

Illustrating the advantage, therefore, multivariate techniques can analyse the data in congruence with the purpose of the research. Examples of missed relationships when using univariate methods are illustrated in [Fish, 1988].

CCA is a comprehensive method when compared to other parametric methods [Hen- son, 2000], [Knapp, 1978], and [Thompson, 1991]. [Sherry and Henson, 2005] describes a number of special case methods that can be subsumed with CCA:

• ANOVA

• MANOVA

6.5. DIFFERENTIATION OF CCA TECHNIQUES 83

• Multiple regression

• Pearson correlation

• t test

• point-bi-serial correlation

• discriminant analysis

There is, of course, the risk of CCA being used on simple datasets or analyses cases where a less lengthy and tedious method could be used instead, which is a common theme in a number of studies.

CategoryCCAvariantAdvantagesLimitations CCACCA1)Hasclosed-formanalyticalsolution 2)Easytoapply 3)Invarianttoscaling 1)RequiresNpk,k=1,2 2)Signsofcanonicalcorrelationsareindeterminate ConstrainedCCASparseCCA1)RemovesnoninformativefeaturesandsolvesNpk 2)Performsreasonablywithhigh-dimensional-co-lineardataRequiresoptimisationexpertise StructuresparseCCA

Removesnoninformativefeatures,solvingNpkwithpriorinformationaboutthedata 1)ImproveseffectivenessofsparseCCA. 2)Producesbiologicalmeaningfulresults

1)Requiresoptimisationexpertise 2)Requirespriorknowledgeaboutthedata DiscriminantsparseCCADiscoversgroupdiscriminantfeatures GeneralisedconstrainedCCA1)Reducesfalsepositives 2)Maintainsmostofthevarianceinastablemodel1)Requiresoptimisationexpertise 2)Requirespredefinedconstraints NonlinearCCAKernelCCA1)Findsnonlinearrelationshipamongmodalities 2)Hasanalyticalsolution

1)Requirespredefinedkernelfunctions 2)Difficulttoprojectfromkernelspacebacktooriginalfeaturespace, leadingtodifficultiesininterpretation 3)Onlylinearkernelspacecanbeprojectedbacktotheoriginalfeaturespace. TemporalkernelCCAMostappropriatetosimultaneouslycollectdatafromtwomodalitieswithtimedelay DeepCCA1)Findsunknownnonlinearrelationship 2)Purelydata-driven1)Requiresdeeplearningexpertise 2)Requireslargenumberoftrainingsamples(intensofthousands) MultisetCCAMultisetCCA1)Goodformorethantwomodalities 2)Goodforgroupanalysis

1)Requirespredefinedobjectivefunctions 2)Thenumberoffinalcanonicalcomponentsdoesnotrepresenttheintersectedcommon patternsacrossallmodalities SparsemultisetCCA1)Goodformorethantwomodalities 2)Removesnon-informativefeaturesandsolvesNpk MultisetCCAwithreferenceSupervisedfusiontechniquetolinkcommonpatternswithapriorknownvariable Table6.1:AdvantagesandlimitationsofeachCCA-relatedtechnique[Zhuangetal.,2020]

6.5. DIFFERENTIATION OF CCA TECHNIQUES 85

6.5.3 CCA use-cases

CCA could be the correct choice when two datasets need to be evaluated for their relationship to one another. As an indication of whether it is the appropriate method, the rationale for why the variables are in separate sets but needed to be still considered together.

Such a rationale in this research is that there are many variables representing measure- ments of different chemical compounds in the air, and these are divided into different regions geographically. Thus we can consider 11 variables (e.g. O₂,CO₂,O₃...) in Penrose, against the same 11 variables in Glen Eden or Auckland City. It only would not make sense to combine these all together when we are waning to compare and correlate the dispersion of chemical in regions against one another.

While one dataset is often called the predictor set and the other the criterion set, these labels are for convenience and ultimately do not have any effect on the outcome. A broad description of the method could be summarised as examining the correlation between a synthetic predictor and synthetic criterion, weighted based on a relationship between the variables in the sets thus it can be viewed conceptually as a Pearsonrbetween two synthetic variables.

Figure 6.3: Depiction of CCA network [Wang et al., 2018]

Figure 6.3 shows a representation of the relationships between datasets and variables.

Evaluation of the correlational relationships requires it is somehow combined into one synthetic variable, referred to as the latent or unobserved variable). CCA creates this synthetic variable by using a linear equation which results in a single predictor variable and a single criterion variable. Using a linear equations is comparable to using multiple regression where beta (β) weights are multiplied with observed scores (in Z score form) and then summed to yield synthetic predicted scores (i.e., Y⁰ = β₁X₁+β₂X₂) [Sherry and Hen- son, 2005]. Using standardised weights (comparable to beta weights) results in two linear equations, one each for predictor and criterion, thus resulting in two synthetic variables.

The two resultant equations produce the greatest possible correlation between these two synthetic variables.

The output of a canonical correlation from the linear equations is, in actuality, a Pearson rand from a high-level view all activities in a CCA procedure are designed to maximise this simple correlation [Henson, 2000], [Thompson, 1984]. There are as many canonical functions as there are variables in the smallest dataset. There is ordinarily a residual variance leftover in the variable sets from the first function that cannot be explained unless the canonical correlation is 1.00, which is impractical in reality. Two more synthetic variables are created from the second function which is as strongly correlated as possible in light of the residual variance resulting from the first function, with the condition that the new synthetic variables are as uncorrelated with the first function’s two synthetic variables as is possible [Sherry and Henson, 2005].

[Sherry and Henson, 2005] explains that when two more maximally correlated synthetic variables are produced from the second function, and assuming these are perfectly uncorrelated with the synthetic variables from the first function - they are usually known as having double orthogonality. Comparable to principal component analysis, this is repeated until the variance of the original variables is explained or until there as many variables as there is in the smallest dataset. The only functions considered for analyses to happen are the ones that define the relationship to the original dataset. Whether CCA is appropriate has some assumptions, such as multivariate normality [Tabachnick and Fidell, 1996]

and [Thompson, 1984]. Briefly, this assumes that all variables in the dataset and all linear combinations of these variables are normally distributed, which may in some cases, be difficult. Approaches to this have been devised by [Mardia, 1985], [Henson, 1999].

Dalam dokumen Deep correlation learning for urban air quality (Halaman 91-111)