Chapter 4: Correspondence Analysis
4.1 Introduction to Correspondence Analysis
Correspondence analysis (CA) is a technique for displaying the rows and columns of a data matrix (primarily, a two-way contingency table) as points in dual low-dimensional vector spaces (Greenacre, 1984). This exploratory multivariate technique is the brainchild of Jean- Paul Benzѐcri which originated in France in the early 1960s, and is used for the graphical and numerical analysis of almost any data matrix with nonnegative entries, but primarily involves tables and counts. Greenacre & Blasius (2006) have reported that CA can be extended to analyze presence/absence data, rankings and preferences, paired comparison data, multiresponse tables, multiway tables and square transition tables amongst others. They have further reported that since it is oriented toward categorical data, it can be used to analyze almost any type of tabular data after suitable data transformation and recording. In correspondence analysis it is claimed that no underlying distribution has to be assumed and no model has to be hypothesized, but a decomposition of the data is obtained in order to study their “structure” (Panagiotakos & Pitsavos, 2004).
Similar to PCA, the rows or columns of the data matrix are assumed to be points in a high- dimensional Euclidean space, and the method aims to redefine the dimensions of the space so that the principal dimensions capture the most variance possible, allowing for lower- dimensional descriptions of the data (Greenacre & Blasius, 2006).
The basic method underlying CA is discussed in Greenacre & Blasius (2006) and is detailed below.
Greenacre (1984) presents the theory of CA in terms of the singular – value decomposition (SVD) of a suitably transformed matrix. Greenacre & Blasius (2006) present the theory in the same context and describes the CA algorithm by use of a single cross-table, or a two-way
53
contingency table, with rows and » columns, and the R, element N is denoted C. The correspondence matrix ø is calculated as a first step, with elements
C = C
,
where is the sample size.
Corresponding to each element C of ø is a row sum
C.= C.
,
and a column sum
.= .
,
denoted by ÁC and Î respectively.
Greenacre & Blasius (2006) explain that these marginal relative frequencies which are called masses, play dual roles in CA by serving to center and to normalize the correspondence matrix and under the null hypothesis of independence, the expected values of the relative frequencies C are the products of ÁCÎ of the masses. Further, the process of centering involves calculating differences gC − ÁCÎh between observed and expected relative frequencies, and normalization involves dividing the said differences by the square roots of ÁCÎ, leading to a matrix of standardized residuals as follows
ùC =gC − ÁCÎh úÁCÎ .
In matrix notation this is written as:
û = ü &⁄ ý − þüÐ &⁄
where and are vectors of row and column masses, and ü and Ð are diagonal matrices with the masses on respective diagonals.
The sum of squared elements of the matrix of standard residuals given by
54 { { ùC& =
C
traceûû ,
is called the total inertia and is defined as the amount that quantifies the total variance in the cross-table. The standardized residuals in û resemble those in the calculation of the chi- square statistic, %& , apart from the division by to convert the original frequencies to relative ones. The following relationship exists:
total inertia =%&
.
By the use of the SVD, the association structure in the matrix û is revealed û = ,
where is the diagonal matrix with singular values in descending order: ≥ & ≥ ⋯ ≥ >
0, and S is the rank of the matrix û. The columns of and , called left singular vectors and right singular vectors respectively, are orthonormal: = = .
The connection between the SVD and the eigenvalue decomposition can be understood as follows
ûû = = =
ûû = = = ,
showing that the right singular vectors of û correspond to the eigenvectors of ûû, the left singular vectors correspond to the eigenvectors of ûû, and the squared singular values & in & correspond to the eigenvalues of ûû or ûû, where is the diagonal matrix of eigenvalues. These eigenvalues are termed principal inertias and the sum ∑ is equal to the total inertia since:
traceûû = traceûû = trace& = trace.
Hendry et al. (2014) have reported that from the result of the SVD, the principle coordinates of the points, i.e., coordinates with respect to their principal axes, can be defined. The row principle coordinates are calculated as
= ü⁄ ,
55
whilst the column principle coordinates are calculated as = üÐ &⁄ .
The row standard coordinates can be calculated as follows = ü &⁄ ,
and the standard coordinates as
= üÐ &⁄ .
These row and principle coordinates are used to produce the graphical displays of the points on the CA maps and the amount of inertia explained by each principal axis is given by the square of the corresponding single value.
The most common method used to test for significant associations between rows and columns in a contingency table is the chi-square statistic. Greenacre & Blasius (2006) define the chi-square statistic as the sum of squared deviations between observed and expected frequencies, where the expected frequencies are those calculated under the independence model
%& = { {gC− Ch&
C ,
½
x Cx
where C = C × ⁄.
This calculation can be repeated for relative frequencies C , to obtain
%&
= { {gC − ̂Ch&
̂C ,
½
x Cx
which is the chi-square statistic divided by the grand total of the table and where C = ÁC × Î .
Now, using the above equation, the total inertia can be rewritten as
56
%&
= { { Î
gC⁄Î − ÁCh&
ÁC ,
½
x Cx
where C⁄Î is an element of the th column profile: C⁄Î = C⁄. ; and ÁC is the corresponding element of the average column profile: ÁC = C.⁄ . The squared distance is a Euclidean – type distance where each squared difference is divided by the corresponding average value ÁC , and the weight of the column profile is in its mass Î . The %& distances between profiles in CA are visualized as ordinary Euclidean distances.