Article

Linkage index of variables and its relationship with variance of eigenvalue in PCA and MCA.

Abstract and Figures

We show that, in principal component analysis (PCA) on correlation matrix as well as in multiple correspondence analysis (MCA), the strength of the relationship between variables is linked to the variance of the eigenvalues, and indicates the axes to which the variables contribute the most. In PCA, we define the linkage index of a variable as the mean of the squared correlations between this variable and the others. We prove that the variance of eigenvalues is proportional to the mean linkage index and that, for each variable, the variance of eigenvalues weighted by the contributions of the variable to axes is proportional to the linkage index of the variable. In MCA, similar properties are proven regarding both categorical variables and categories. We illustrate these properties using two datasets coming from classical articles by Spearman (1904) for PCA and Burt (1950) for MCA.
Content may be subject to copyright.
Statistica Applicata - Italian Journal of Applied Statistics Vol. 29 (2-3) 123
1. INTRODUCTION
In the present article, we study the variance of eigenvalues in PCA and
MCA, i.e., the mean of squared deviations from eigenvalues to their mean.
The variance of eigenvalues can be seen as an index of departure from
sphericity. We examine the relationship between the variance of eigenvalues
and the correlations between variables in PCA or the contingency mean
square coefficients (usually denoted Φ2) between categorical variables in
MCA. This article develops the properties presented in Durand (1998).
There are few studies on the variance of eigenvalues. However, we can
find the expression of the sum of squares of eigenvalues in studies on the
LINKAGE INDEX OF VARIABLES AND ITS RELATIONSHIP
WITH VARIANCE OF EIGENVALUES IN PCA AND MCA
Jean-Luc Durand1
Laboratoire d’Ethologie Expérimentale et Comparée, LEEC EA4443, Université
Paris 13 (Sorbonne Paris Cité), Villetaneuse, France
Brigitte Le Roux
MAP5, UMR 8145, Université Paris Descartes (Sorbonne Paris Cité), Paris
CEVIPOF, UMR 7048, Sciences Po, Paris, France
Abstract. In the present article, we show that, in principal component analysis (PCA) on
correlation matrix as well as in multiple correspondence analysis (MCA), the strength of
the relationship between variables is linked to the variance of the eigenvalues, and indicates
the axes to which the variables contribute the most. In PCA, we define the linkage index of
a variable as the mean of the squared correlations between this variable and the others. We
prove that the variance of eigenvalues is proportional to the mean linkage index and that,
for each variable, the variance of eigenvalues weighted by the contributions of the variable
to axes is proportional to the linkage index of the variable. In MCA, similar properties are
proven regarding both categorical variables and categories. We illustrate these properties
using two datasets coming from classical articles by Spearman (1904) for PCA and Burt
(1950) for MCA.
Keywords: PCA, MCA, Variance of eigenvalues, Contributions to axes.
1Corresponding author: Jean-Luc Durand, email: jean-luc.durand@univ-paris13.fr
doi.org/10.26398/IJAS.0029-006
124 Durand, J.-L., Le Roux, B.
number of axes to be used for interpretation, or on confidence interval (see
e.g. Saporta, 2003; Karlis et al., 2003). The variance of eigenvalues is also
used in applications, especially in biological studies (Pavlicev et al., 2009).
2. PRINCIPAL COMPONENT ANALYSIS
In this section we study the variance of eigenvalues in the case of PCA on
correlation matrix.
2.1. BASIC PROPERTIES AND NOTATIONS
Let Idenote a set of nindividuals, Ka set indexing p(p>1) non–constant
variables on I; the k–th variable is denoted xk=xi
k, its mean xkand its
variance vk.
Let rkkbe the correlation between variables xkand xkand R=[rkk]
the correlation matrix between the pvariables. The calculation method of
standard PCA is based on the diagonalization of the correlation matrix R.
Let Lbe a set indexing the nonnull eigenvalues. If Λ=[λ]denotes the
diagonal matrix of eigenvalues (λ)Land A=[ak]the matrix of eigenvec-
tors, then the PCA of variables xkwrites RA =with AA=I. In the
simple linear regression of the standardized initial variable (xkxk)/vk
on the –th principal variable with variance 1, the regression coefficient is
equal to the correlation coefficient rk between the k–th initial variable and
the –th principal variable. We have the properties:
rk =λak and
L
(rk)2=1.
The contribution of variable xkto the variance λof axis , denoted
Ctr
k, is equal to r2
k,withL,
kK
Ctr
k=1.
If the correlation matrix has full rank (all eigenvalues are strictly posi-
tive), one has the property:
kK,
L
Ctr
k=1.(1)
This property comes from Ctr
k=r2
k=a2
k,with
L
a2
k =1since
matrix Ais orthogonal (AA=AA=I).
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 125
2.2. VARIANCE OF EIGENVALUES
Theorem 2.1. The variance of eigenvalues, denoted V(λ), is such that
V(λ)=
1
p
kK
k
K
k
=k
r
2
kk
Proof.
The mean of eigenvalues is λ=1, the variance is V(λ)=
1
p
L
λ
2
1,
L
λ
2
being equal to the trace of matrix R
2
. The entries of R
2
are equal to
k
K
r
kk
r
k
k

and the diagonal entries are (
k
K
r
2
kk
)
kK
, hence the trace of R
2
is equal to
kK
1+
k
=k
r
2
kk
=p+
kK
k
K
k
=k
r
2
kk
.
This expression of the variance invites us to consider the mean of the
squared correlations between one variable and the others. As we will see
later on, this quantity is an index of the strength of the link between one
variable and the others. Hence the following definition:
Definition 2.1 (Linkage index of a variable).We call linkage index of
variable x
k
,denotedLI
k
, the mean of the squared correlations between the
variable x
k
and the (p1) others:
LI
k
=
1
p1
k
K
k
=k
r
2
kk
Note that the linkage index of a variable is between zero and one.
Proposition 1. The mean linkage index of the pvariables, denoted LI,is
equal to the variance of eigenvalues divided by (p-1):
LI =
1
p1
V(λ)
Comment. The mean linkage index is a measure of both the global mag-
nitude of correlations and of the departure from sphericity. In the particu-
lar case of an equicorrelation matrix, all off–diagonal elements of which are
equal to r(Morrison, 1976, p. 331), the mean linkage index is equal to r
2
.
.
.
.
126 Durand, J.-L., Le Roux, B.
2.3. VARIABLES AND EIGENVALUES
From now on, we assume that the correlation matrix has full rank.
We will now express the linkage index of a variable as a function of the
eigenvalues and the contributions of this variable to axes.
Considering the contributions of the initial variable x
kto axes (namely
(Ctr
k)Lwith
L
Ctr
k=1, see Equation 1), we define the weighted mean
(
λk) and the weighted variance (
Vk(λ)) of eigenvalues:
λk=
L
Ctr
kλand
Vk(λ)=
L
Ctr
k(λ
λ)2
The following two properties can easily be shown:
kK,
λk=1 and
Vk(λ)=
L
λr2
k 1
Theorem 2.2. The linkage index of variable xkis proportional to
Vk(λ):
LIk=1
p1
Vk(λ)
Proof.
By the reconstitution formula of the correlation matrix (see Le Roux and
Rouanet, 2004, p. 153), one has R=AΛA
, hence R
2
=
2
A
.Thek–th
diagonal entry of R
2
is equal to
k
K
r
2
kk
=1+
k
=k
r
2
kk
(see the proof of Theorem
2.1) and also to
L
λ
2
a
2
k
=
λ
r
2
k
(since r
k
=λ
a
k
). Hence
L
λ
r
2
k
=
V
k
(λ)+1=1+
k
K
k
=k
r
2
kk
.
Corollary 2.2.1. The ratio of the linkage index of variable x
kand the mean
linkage index is equal to the ratio of the weighted variance of eigenvalues
Vk(λ)to the variance of eigenvalues V(λ):
LIk
LI =
Vk(λ)
V(λ)
Comments
1) The more the linkage index of a variable is superior to the mean, the
more this variable contributes to extreme axes (first and last axes).
2) The more the linkage index of a variable is inferior to the mean, the
more this variable contributes to central axes (axes with variance near 1).
.
.
.
.
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 127
2.4. APPLICATION TO SPEARMAN’S DATA
Table 1 (see Spearman, 1904, p.291) gives the correlations between perfor-
mance variables of English pupils in the following subjects: Classics (k1),
French (k2), English (k3), Mathematics (k4), Pitch Discrimination (k5),
and Music (k6).
Table 1: Correlations, linkage indexes LI
k
and ratios LI
k
/LI.
k1k2k3k4k5k6LIkLIk/LI
k1Classics 1 0.83 0.78 0.70 0.66 0.63 0.524 1.34
k2French 0.83 1 0.67 0.67 0.65 0.57 0.467 1.20
k3English 0.78 0.67 1 0.64 0.54 0.51 0.404 1.04
k4Mathematics 0.70 0.67 0.64 1 0.45 0.51 0.362 0.93
k5Pitch discrim. 0.66 0.65 0.54 0.45 1 0.40 0.302 0.78
k6Music 0.63 0.57 0.51 0.51 0.40 1 0.280 0.72
LI = 0.390
We notice (see Table 1) that Classics is the most correlated with other
variables with a linkage index equal to 0.524, which is 34% higher than the
average. As we can see in Table 2, this variable is the one that contributes
the most to axes 1and 6, for which variances are the farthest from 1
(“extreme” axes). In sharp contrast, Music is the least correlated with other
variables (linkage index equal to 0.280) and contributes heavily to axes 2
and 3, for which variances are the closest to 1 (“central” axes).
Tabl e 2 : Eigenvalues λ, contributions of variables to axes (in %),
variance of eigenvalues weighted by contributions
Vk(λ)and variance
ratios
Vk(λ)/V (λ).
123456
λ4.103 0.619 0.512 0.357 0.270 0.139
Vk(λ)
Vk(λ)/V (λ)
k1210027702.62 1.34
k21910554202.33 1.20
k317 0 12 59 4 9 2.02 1.04
k416 6313214 01.81 0.93
k513 51 15 2 18 0 1.51 0.78
k613 41 42 0 3 1 1.40 0.72
V(λ)=1.95
Table 1: Correlations, linkage indexes LIk and ratios LIk/LI.
128 Durand, J.-L., Le Roux, B.
3. MULTIPLE CORRESPONDENCE ANALYSIS
We will now adopt the same approach for MCA.
3.1. BASIC PROPERTIES AND NOTATIONS
Let Idenote the set of nindividuals and Qthe set of categorical variables
(questions). The table analyzed by mca is an I×Qtable such that the
entry in cell (i, q)is the category of variable qchosen by individual i.The
set of categories of variable qis denoted by Kqand its cardinal by Kq; the
overall set of categories is denoted by Kand its cardinal by K.
The number of individuals who have chosen category kis denoted by
nk(with nk>0) and the corresponding relative frequency by fk=nk/n.
Multiple correspondence on I×K.Let us denote δIK =(δik)iI,kK
the multiple correspondence on I×Kdefined by
δ
ik
=1
0
if individual ihas chosen category k
if not
Performing the mca of the I×Qtable is equivalent to proceeding to
Correspondence Analysis of the I×Ktable δIK (Benzécri, 1977; Greenacre,
1984). The solution is given by the diagonalization of the symmetric matrix
S=[skk]with skk=1
Q
nkknknk/n
nknk/n (nkkis the number of individuals who
have chosen both categories kand k).
We denote Lthe set indexing the KQnonnull eigenvalues and (yk
)L
the principal coordinates of the category point k. The sum of eigenvalues
(λ)Lis equal to (KQ)/Q, hence the mean is λ=1/Q.
Burt table and mean square contingency coefficients. The Burt
table associated with δIK is the symmetric K×Ktable defined by:
b
kk
=
iI
δ
ik
δ
ik
=
n
k
if k=k
0ifk=k
with k, k
K
q
n
kk
if kK
q
and k
K
q
with q=q
Denoting Φ2
qqthe mean square contingency coefficient of the contin-
gency table crossing variables qand q, one has: Φ2
qq =Kq1and for q=q,
Φ2
qq=
kKq
kKq
(fkkfkfk)2
fkfk.
The Φ2of the Burt table, denoted Φ2
Burt, is the average of the Φ2of
the Q2subtables of the Burt table. Denoting Φ2the mean of the Φ2of the
.
.
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 129
Q(Q1) non–diagonal subtables, one has:
Φ
2
Burt
=
1
Q
2
qQ
q
Q
Φ
2
qq
=
1
Q
KQ
Q
+
1
Q
qQ
q
Q
q
=q
Φ
2
qq
=
1
Q
KQ
Q
+
Q1
Q
Φ
2
Contributions of categories and of variables. The squared distance
between the category point kand the mean point of the cloud is equal to
1f
k
f
k
=
L
(y
k
)
2
.
The contribution of category kto axis , denoted Ctr
k
, is equal to
f
k
Q
(y
k
)
2
λ
. We have the two following properties:
L,
kK
Ctr
k
=1 and kK,
L
Ctr
k
=1f
k
The first property follows the definition of contribution to axes. The second
one can be proven as follows: The -th unit eigenvector of Sassociated
with nonnull eigenvalue λ
is (c
k
)
kK
with c
k
=f
k
/Q(y
k
/λ
)and the
Qones associated with null eigenvalue are (c
kq
)
qQ
with c
kq
=f
k
for
kK
q
and 0 for k/K
q
. Hence
L
c
2
k
+
qQ
c
2
kq
=
f
k
Q
L
(y
k
)
2
λ
+f
k
=1.
By definition, the contribution of a variable to axis is the sum of
the contributions of its categories: Ctr
q
=
kK
q
Ctr
k
, and we have the two
following properties:
L,
qQ
Ctr
q
=1 and qQ,
L
Ctr
q
=K
q
1
Burt cloud. The mean point of the subcloud of individuals who have
chosen category kis called category mean point. Its profile (obtained from
the Burt table) is equal to
1
Q
(f
kk
/f
k
)
k
K
; its squared distance to the
mean point is equal to
k
K
1
Q
2
(f
kk
/f
k
f
k
)
2
f
k
/Q
=
1
Qf
k
k
K
(f
kk
f
k
f
k
)
2
f
k
f
k
.Let-
ting φ
2
q
(k)=
k
K
q
(f
kk
f
k
f
k
)
2
f
k
f
k
if kK
q
and q
=q, the squared distance
writes:
1
Qf
k
(1 f
k
)+
1
Qf
k
q
Q
q
=q
φ
2
q
(k),with Φ
2
qq
=
kK
q
φ
2
q
(k)(2)
.
.
.
.
130 Durand, J.-L., Le Roux, B.
The Kcategory mean points define the Burt cloud (see Le Roux and
Rouanet, 2004, pp. 199-200). The principal coordinates of the category
mean point kon axis are equal to y
k
/λ
, hence its squared distance to
the mean point is also equal to
L
(y
k
)
2
.
The eigenvalues verify the following property:
L
=1
λ
2
2
Burt
.
3.2. VARIANCE OF EIGENVALUES
Theorem 3.1. The variance of eigenvalues, denoted V(λ), is such that:
V(λ)=
1
KQ
Q1
Q
Φ
2
Proof.
1
KQ
L
(λ
λ)
2
=
1
KQ
Φ
2
Burt
λ
2
with λ=1/Q. Hence the variance
is V(λ)=
1
KQ
1
Q
KQ
Q
+
Q1
Q
Φ
2
1
Q
2.
This expression of the variance of eigenvalues leads us to consider the
mean of the Φ
2
between one categorical variable and the others, that is, it
leads us to the following definition.
Definition 3.1 (Linkage index of categorical variable).The linkage index
of categorical variable q,denotedLI
q
, is such that:
LI
q
=
1
K
q
1
1
Q1
q
Q,q
=q
Φ
2
qq
Note that the linkage index of a categorical variable is between zero
and one, since Φ
2
qq
K
q
1.
Property 1 (Mean linkage index).The mean linkage index of the Qcate-
gorical variables weighted by (K
q
1)
qQ
,denotedLI, is such that:
LI =
Q
2
Q1
V(λ)=Φ
2
/(
KQ
Q
)
Proof.
qQ
(K
q
1) = KQ, hence the weighted mean of linkage indexes is
1
KQ
qQ
(K
q
1)LI
q
=
1
KQ
qQ
1
Q1
q
Q,q
=q
Φ
2
qq
=
Q
2
Q1
V(λ).
.
.
.
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 131
Definition 3.2 (Linkage index of category).Given a category kof the
categorical variable q, the linkage index of category k,denotedLI
k
, is defined
as follows:
LI
k
=
1
1f
k
×
1
Q1
q
Q,q
=q
φ
2
q
(k)
with for q
=q,φ
2
q
(k)=
k
K
q
(f
kk
f
k
f
k
)
2
f
k
f
k
.
One deduces from
kK
q
φ
2
q
(k)=Φ
2
qq
that the linkage index of variable
qis equal to the mean of the linkage indexes of its categories weighted by
(1 f
k
)
kK
q
:LI
q
=
1
K
q
1
kK
q
(1 f
k
)LI
k
.
3.3. CATEGORIES, CATEGORICAL VARIABLES AND EIGENVAL-
UES
In order to explain the link between categorical variables or categories and
eigenvalues, we will now express the linkage indexes in terms of eigenvalues
weighted by contributions to axes.
Lemma 3.1. The mean of eigenvalues weighted by the contributions of
category kto axes, denoted
λ
k
,isequalto1/Q.
Proof.
λ
k
=
L
Ctr
k
λ
/(
L
Ctr
k
)=
1
1f
k
f
k
Q
L
(y
k
)
2
=
1
1f
k
f
k
Q
1f
k
f
k
=
1
Q
.
Lemma 3.2. The variance of eigenvalues weighted by the contributions of
category kof variable qto axes is denoted
V
k
(λ)and called k–variance of
eigenvalues; it is equal to
1
Q
2
(1f
k
)
q
Q,q
=q
φ
2
q
(k).
Proof. The weighted sum of the squared eigenvalues is equal to
L
f
k
Q
(y
k
)
2
λ
=
L
f
k
Q
y
k
λ
2
=
f
k
Q
1
Qf
k
(1 f
k
)+
1
Qf
k
q
=q
φ
2
q
(k)(Equation 2). Hence
V
k
(λ)=
1
Q
2
(1 f
k
)+
q
=q
φ
2
q
(k)/(1 f
k
)
1
Q
2
=
1
Q
2
(1f
k
)
q
=q
φ
2
q
(k).
Theorem 3.2. The linkage index of category kis proportional to the vari-
ance of eigenvalues weighted by contributions of kto axes.
LI
k
=
Q
2
Q1
V
k
(λ)
.
132 Durand, J.-L., Le Roux, B.
Proof.
V
k
(λ)=
Q1
Q
2
1
(Q1)(1f
k
)
q
=q
φ
2
q
(k)=
K
q
1
Q
2LI
k
.
From Property 1 and Property 3.2 we deduce that:
Corollary 3.2.1. The ratio of the linkage index of category kto the mean
linkage index is equal to the ratio of the k–variance of eigenvalues to the
variance of eigenvalues:
LI
k
LI =
V
k
(λ)
V(λ)
The properties about categorical variables follows the property of aver-
age of linkage indexes of categories (LI
q
=
1
K
q
1
kK
q
(1 f
k
)LI
k
) and of the
property of sum of contributions (Ctr
q
=
kK
q
Ctr
k
). We denote
V
q
(λ)the
q–variance of eigenvalues (variance of eigenvalues weighted by the contribu-
tions of variable q):
V
q
(λ)=
1
K
q
1
L
Ctr
q
(λ
1
Q
)
2
. One has the following
property: LI
q
=
Q
2
Q1
V
q
(λ). Hence:
LI
q
LI =
V
q
(λ)
V(λ)
Comments
1) The more the linkage index of a category (or a categorical variable) is
superior to the mean, the more this category (or this variable) contributes
to extreme axes (first and last axes).
2) The more the linkage index of a category (or a categorical variable)
is inferior to the mean, the more this category (or this variable) contributes
to central axes (axes with variance near 1/Q).
3.4. APPLICATION TO BURTSDATA
Burt’s data (Table 3), reproduced from Burt (1950, p.171), gives, for 100
individuals (men living in Liverpool), the observed response patterns and
their absolute frequencies for four attributes (categorical variables), that
is, AHair (a1:fair,a2:red,a3:dark ), BEyes (b1:light,b2:mixed,b3:
brown), CHead (c1:narrow,c2:wide), DStature (d1:tall,d2:short).
As we can see in Table 4, the categories having the highest linkage in-
dexes are the category b1(light)ofEyes and the two categories of Stature.
.
.
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 133
Tabl e 3 : Observed response patterns with their absolute frequencies.
Abs.freq
a1b1c1d18
a1b1c1d24
a1b1c2d12
a1b2c1d11
a1b2c1d21
a1b2c2d12
a1b2c2d22
a1b3c2d22
Abs.freq
a2b1c1d16
a2b1c2d12
a2b2c1d12
a2b2c1d21
a2b2c2d22
a2b3c1d22
a1a2a3b1b2b3c1c2d1d2
22 15 63 33 36 31 69 31 43 57
Abs.freq
a3b1c1d19
a3b1c2d12
a3b2c1d13
a3b2c1d212
a3b2c2d12
a3b2c2d28
a3b3c1d11
a3b3c1d219
a3b3c2d13
a3b3c2d24
Tabl e 4 : Eigenvalues λ
, linkage index LI
k
,ratioLI
k
/LI and contribu-
tions of categories to axes (in %).
123456
λ
0.489 0.299 0.254 0.206 0.179 0.073
LI
k
LI
k
/LI
a1fair .054 0.63 95361235
a2red .027 0.31 60554200
a3dark .099 1.14 9100252
b1light .211 2.42 26210136
b2mixed .037 0.43 326 524 0 5
b3brown .093 1.07 12 16 2 23 3 14
c1narrow .020 0.23 015 014 0 1
c2wide .020 0.23 134 031 1 2
d1tall .170 1.95 20 0 0 2 16 20
d2short .170 1.95 15 0 0 1 12 15
LI = 0.087
Tabl e 5 : Eigenvalues λ
, linkage index LI
q
,ratioLI
q
/LI and contribu-
tions of categorical variables to axes (in %).
123456
λ
0.489 0.299 0.254 0.206 0.179 0.073
LI
q
LI
q
/LI
q1Hair .051 0.59 23 6 91 5 67 7
q2Eyes .115 1.32 41 45 8 47 4 55
q3Head .020 0.23 14904513
q4Stature .170 1.95 34 0 0 3 28 35
LI = 0.087
134 Durand, J.-L., Le Roux, B.
Their linkage indexes are about twice the mean (LI
b1
/LI = 2.42 and LI
d1
/LI =
LI
d2
/LI = 1.95). These three categories contribute heavily to “extreme”
axes 1and 6(together, they account for 61% of axis 1 and 71% of
axis 2). In contrast, both categories of Head and the category a2(red )
of Hair have the smallest linkage indexes, less than a third of the mean
(LI
c1
/LI = LI
c2
/LI = 0.23 and LI
a2
/LI = 0.31). The contributions of these
three categories to both “extreme” axes (1and 6) are very small (7% and
3%, respectively) but they contribute heavily to “central” axes 2,3and
4(49%, 55% and 50%, respectively).
In Table 5, we see that Head has a very small linkage index; this variable
does not contribute to the first axis (neither to the 5th and the 6th axes).
4. CONCLUSION
In this paper, we emphasize that the higher the mean of the linkage indexes
of (numerical or categorical) variables, the higher the variance of eigenval-
ues, that is, the larger the departure of clouds from sphericity.
In addition, further analysis shows that the more the linkage index of
a variable is superior to the mean, the more this variable contributes to
extreme axes (first and last axes); otherwise this variable contributes to
central axes (whose variances are close to the mean). So, if the range of
linkage indexes of variables is large, one can predict that the variables with
the greatest linkage indexes will play a preponderant role in the interpreta-
tion of first axes. Then, if we decide to reduce the number of active variables
in the analysis, linkage indexes will be a useful tool: if a variable with a
weak linkage index is discarded, the proportion of variance associated with
the first axes will increase and the interpretation will remain the same.
REFERENCES
Benzécri, J.-P. (1977). Sur l’analyse des tableaux binaires associés à une correspondance multiple.
Les cahiers de l’analyse des données, 2(1): 55–71, from a mimeographed note of 1972.
Burt, C. (1950). The factorial analysis of qualitative data. British Journal of Statistical Psychology,
3: 166–185.
Durand, J.-L. (1998). Taux de dispersion des valeurs propres en ACP, AC et ACM. Mathématiques
Informatique et Sciences humaines, 144: 15–28.
Greenacre, M. (1984). Theory and Applications of Correspondence Analysis. London: Academic
press.
Karlis, D., Saporta, G. and Spinakis, A. (2003). A simple rule for the selection of principal
components. Communications in Statistics-Theory and Methods, 32(3): 643–666.
Linkage Index of Variables and Its Relationship with Variance of Eigenvalues … 135
Le Roux, B. and Rouanet, H. (2004). Geometric Data Analysis. From Correspondence Analysis to
Structured Data Analysis. Dordrecht: Kluwer.
Morrison, D. (1976). Multivariate Statistical Methods. New York: McGraw-Hill Publ. Co.
Pavlicev, M ., Cheverud, J. and Wagner, G . (2009). M easuring morphological integration using
eigenvalue variance. Evolutionary Biology, 36(1): 157–170.
Saporta, G. (2003). A control chart approach to select eigenvalues in principal component and
correspondence analysis. 54th Session of the International Statistical Institute-Berlin.
Spearman, C. (1904). ‘General intelligence’, objectively determined and measured. American
Journal of Psychology, 15: 201–292.
... because r 2 ii = 1 for all i. This relationship has been known in the statistical literature (e.g., Gleason and Staelin 1975;Durand and Le Roux 2017), and empirically confirmed by Haber (2011). This statistic is used as a measure of overall association between variables (e.g., Schott 2005; Durand and Le Roux 2017), with the corresponding null hypothesis being P = I p . ...
Article
Analysis of trait covariation plays a pivotal role in the study of phenotypic evolution. The magnitude of covariation is often quantified with statistics based on dispersion of eigenvalues of a covariance or correlation matrix—eigenvalue dispersion indices. This study clarifies the statistical justifications of these statistics and elaborates on their sampling properties. The relative eigenvalue variance of a covariance matrix is known in the statistical literature a test statistic for sphericity, thus is an appropriate measure of eccentricity of variation. The same of a correlation matrix is equal to the average squared correlation, which has a straightforward interpretation as a measure of integration. Here, expressions for the mean and variance of these statistics are analytically derived under multivariate normality, clarifying the effects of sample size N, number of variables p, and parameters on sampling bias and error. Simulations confirm that approximations involved are reasonably accurate with a moderate sample size (N ≥ 16–64). Importantly, sampling properties of these indices are not adversely affected by a high p:N ratio, promising their utility in high-dimensional phenotypic analyses. They can furthermore be applied to shape variables and phylogenetically structured data with appropriate modifications.
... .8 holds for any correlation matrix in general (e.g., Gleason & Staelin, 1975;Durand & Le Roux, 2017;Watanabe, 2021). Also, by equation A.3, n i=1 l m i = p i=1 k m i holds for any integer m. ...
Preprint
Full-text available
Parallelism between evolutionary trajectories in a trait space is often seen as evidence for repeatability of phenotypic evolution, and angles between trajectories play a pivotal role in the analysis of parallelism. However, many biologists have been ignorant on properties of angles in multidimensional spaces, and unsound uses of angles are common in the biological literature. To remedy this situation, this study provides a brief overview on geometric and statistical aspects of angles in multidimensional spaces. Under the null hypothesis that trajectory vectors have no preferred directions, the angle between two independent vectors is concentrated around the right angle, with a more pronounced peak in a higher-dimensional space. This probability distribution is closely related to t- and beta distributions, which can be used for testing the null hypothesis concerning a pair of trajectories. A recently proposed method with eigenanalysis of a vector correlation matrix essentially boils down to the test of no correlation or concentration of multiple vectors, for which a simple test procedure is available in the statistical literature. Concentration of vectors can also be examined by tools of directional statistics such as the Rayleigh test. These frameworks provide biologists with baselines to make statistically justified inferences for (non)parallel evolution.
Article
Full-text available
The concept of morphological integration describes the pattern and the amount of correlation between morphological traits. Integration is relevant in evolutionary biology as it imposes constraint on the variation that is exposed to selection, and is at the same time often based on heritable genetic correlations. Several measures have been proposed to assess the amount of integration, many using the distribution of eigenvalues of the correlation matrix. In this paper, we analyze the properties of eigenvalue variance as a much applied measure. We show that eigenvalue variance scales linearly with the square of the mean correlation and propose the standard deviation of the eigenvalues as a suitable alternative that scales linearly with the correlation. We furthermore develop a relative measure that is independent of the number of traits and can thus be readily compared across datasets. We apply this measure to examples of phenotypic correlation matrices and compare our measure to several other methods. The relative standard deviation of the eigenvalues gives similar results as the mean absolute correlation (W.P. Cane, Evol Int J Org Evol 47:844–854, 1993) but is only identical to this measure if the correlation matrix is homogenous. For heterogeneous correlation matrices the mean absolute correlation is consistently smaller than the relative standard deviation of eigenvalues and may thus underestimate integration. Unequal allocation of variance due to variation among correlation coefficients is captured by the relative standard deviation of eigenvalues. We thus suggest that this measure is a better reflection of the overall morphological integration than the average correlation.
Article
Full-text available
We define the quadratic concentration rate of a positive mesure, or quadratic dispersion rate of the values of its elementary density. When applied on the eigenvalues of a cloud of points in an Euclidean space, this rate is geometrically interpreted as an index of non-sphericity of the cloud, which accounts for its capacity to be well summarized by the first axis or axes. We provide and comment upon the expressions of corrected variance and dispersion rate of eigenvalues for the most usual methods of geometric data analysis : principal component analysis (weighted PCA, simple and standard) correspondence analysis (CA) and multiple correspondence analysis (MCA). These relationships particularly show that in standard PCA and in MCA, the average intensity of binary relations between variables is geometrically expressed by the non-sphericity of clouds of points.
Book
Geometric Data Analysis (GDA) is the name suggested by P. Suppes (Stanford University) to designate the approach to Multivariate Statistics initiated by Benzécri as Correspondence Analysis, an approach that has become more and more used and appreciated over the years. This book presents the full formalization of GDA in terms of linear algebra-the most original and far-reaching consequential feature of the approach-and shows also how to integrate the standard statistical tools such as Analysis of Variance, including Bayesian methods. Chapter 9, Research Case Studies, is nearly a book in itself; it presents the methodology in action on three extensive applications, one for medicine, one from political science, and one from education (data borrowed from the Stanford computer-based Educational Program for Gifted Youth ). Thus the readership of the book concerns both mathematicians interested in the applications of mathematics, and researchers willing to master an exceptionally powerful approach of statistical data analysis.
Article
A vast literature (Cattell, Horn, Velicer) has been devoted to the assessment of the proper number of eigenvalues that have to be retained in Principal Components Analysis. Most of the publications are based on either (non-realistic) distributional assumptions for the underlying populations or on empirical criteria. Techniques that are based on bootstrap or cross-validation have been proposed (Diana, Krzanowski ,Wold) but requires a lot of computation. For Multiple Correspondence Analysis, the problem is similar, but there are few publications. In this paper a simple technique based on a control chart approach is proposed for selecting the number of principal components to retain for the analysis. 2. A new rule for PCA In PCA with p standardised variables the most common rule is the Kaiser's criterion, which selects components that correspond to eigenvalues larger than 1. This rule if often supplemented by the consideration of the confidence interval based on Anderson's asymptotic result which states that with .95 confidence level the true eigenvalue λi is such that
Article
A vast literature has been devoted to the assessment of the proper number of eigenvalues that have to be retained in Principal Components Analysis. Most of the publications are based on either distributional assumptions for the underlying populations or on empirical evident. In addition, techniques that are based on bootstrap or cross-validatory techniques have been proposed despite the computational effort implied. In this paper a simple technique based on a control chart approach is proposed for selecting the number of principal components to retain for the analysis. This approach accounts for the sampling variability which can lead to the selection of components that are not in fact statistically significant. The method is compared with other methods and is found to be superior regardless of the underlying distributional properties of the population as well as the existing structure. An illustrative example is provided.