ArticlePDF Available

On estimation of the noise variance in high-dimensional probabilistic principal component analysis

Authors:
  • The Chinese University of Hong Kong Shenzhen
  • The Chinese University of Hong Kong (Shenzhen)

Abstract and Figures

In this paper, we develop new statistical theory for probabilistic principal component analysis models in high dimensions. The focus is the estimation of the noise variance, which is an important and unresolved issue when the number of variables is large in comparison with the sample size. We first unveil the reasons of an observed downward bias of the maximum likelihood estimator of the noise variance when the data dimension is high. We then propose a bias-corrected estimator using random matrix theory and establish its asymptotic normality. The superiority of the new and bias-corrected estimator over existing alternatives is checked by Monte-Carlo experiments with various combinations of (p, n) (dimension and sample size). Next, we construct a new criterion based on the bias-corrected estimator to determine the number of the principal components, and a consistent estimator is obtained. Its good performance is confirmed by simulation study and real data analysis. The bias-corrected estimator is also used to derive new asymptotic for the related goodness-of-fit statistic under the high-dimensional scheme. KEYWORDS. Probabilistic principal component analysis, high-dimensional data, noise variance estimator, number of principal components, random matrix theory, goodness-of-fit.
Content may be subject to copyright.
On estimation of the noise variance in high-dimensional
probabilistic principal component analysis
Damien Passemier, Zhaoyuan Li and Jianfeng Yao
Department of Statistics and Actuarial Science
The University of Hong Kong
Abstract
In this paper, we develop new statistical theory for probabilistic principal
component analysis models in high dimensions. The focus is the estimation
of the noise variance, which is an important and unresolved issue when the
number of variables is large in comparison with the sample size. We first
unveil the reasons of an observed downward bias of the maximum likelihood
estimator of the noise variance when the data dimension is high. We then
propose a bias-corrected estimator using random matrix theory and estab-
lish its asymptotic normality. The superiority of the new and bias-corrected
estimator over existing alternatives is checked by Monte-Carlo experiments
with various combinations of (p, n) (dimension and sample size). Next, we
construct a new criterion based on the bias-corrected estimator to determine
the number of the principal components, and a consistent estimator is ob-
tained. Its good performance is confirmed by simulation study and real data
analysis. The bias-corrected estimator is also used to derive new asymptotic
for the related goodness-of-fit statistic under the high-dimensional scheme.
Keywords. Probabilistic principal component analysis, high-dimensional
data, noise variance estimator, number of principal components, random
matrix theory, goodness-of-fit.
1
1 Introduction
Principal component analysis (PCA) is a very popular technique in multivariate
analysis for dimensionality reduction and feature extraction. Due to dramatic
development in data-collection technology, high-dimensional data are nowadays
common in many fields. Natural high-dimensional data, such as images, signal
processing, documents and biological data often reside in a low-dimensional sub-
space or low-dimensional manifold (Ding et al., 2011). In financial econometrics, it
is commonly believed that the variations in a large number of economic variables
can be modeled by a small number of reference variables (Forni et al., 2000; Bai
and Ng, 2002; Bai, 2003). Consequently, PCA is a recommended tool for analysis
of such high-dimensional data.
There is an underlying probabilistic model behind PCA, called probabilistic
principal component analysis (PPCA), defined as follows. The observation vectors
{xi}1inare p-dimensional and satisfy the equation
xi=Λfi+ei+µ, i = 1, . . . , n. (1)
Here, fiis a m-dimensional principal components with mp,Λis a p×mmatrix
of loadings, and µrepresents the general mean and {ei}1inare a sequence of
independent errors with covariance matrix Ψ=σ2Ip. The parameter σ2is the
noise variance we are interested in. None of the quantities in the right-hand side
of (1) is known or observed (except their sum xi).
To ensure the identification of the model, constraints have to be introduced on
the parameters. There are several possibilities for the choice of such constraints,
see Table 1 in Bai and Li (2012). A traditional choice is the following (Anderson,
2003, Chapter 14):
2
Efi=0and Efif0
i=I;
The matrix Γ:= Λ0Λis diagonal with distinct diagonal elements.
Therefore, the population covariance matrix of {xi}1inis Σ=ΛΛ0+σ2I. Find-
ing a reliable estimator of the noise variance σ2is a nontrivial issue for high-
dimensional data which we now pursue.
The PPCA model (1) can be viewed as a special instance of the approximate
factor model (Chamberlain and Rothschild, 1983) where the noise covariance Ψ
can be a general diagonal matrix (the model is also called a strict factor model in
statistical literature, see Anderson, 2003). For related recent papers on inference
of large approximate (or dynamic) factor models, we refer to Bai (2003), Forni
et al. (2000) and Doz et al. (2012).
Let ¯
xbe the sample mean and define the sample covariance matrix
Sn=1
n1
n
X
i=1
(xi¯
x)(xi¯
x)0. (2)
Let λn,1λn,2≥ ··· ≥ λn,p be the eigenvalues of Sn. Under the normality
assumption on both {fi}and {ei}, the maximum likelihood estimator of σ2is
bσ2=1
pm
p
X
i=m+1
λn,i.(3)
In the classic setting where the dimension pis relatively small compared to
the sample size n(low-dimensional setting), the consistency of bσ2is established
in Anderson and Rubin (1956). Moreover, it is asymptotically normal with the
standard n-convergence rate: as n→ ∞,
n(bσ2σ2)D
→ N(0, s2), s2=2σ4
pm. (4)
3
Actually, Anderson and Amemiya (1988) provides a general CLT in an approxi-
mate factor model that encompasses the present PPCA model. For the reader’s
convenience, we provide in Supplement a detailed deviation of (4) from this general
CLT.
The situation is, however radically different when pis large compared to the
sample size n. Recent advances in high-dimensional statistics indicate that in
such high-dimensional situation, the above asymptotic result is no more valid and
indeed, it has been reported in the literature that bσ2seriously underestimates the
true noise variance σ2, see Kritchman and Nadler (2008). However, the exact form
of this bias has not been determined (to our best knowledge). Basically, when p
becomes large, the sample principal eigenvalues and principal components are no
longer consistent estimates of their population counterparts (Baik and Silverstein,
2006; Johnstone and Lu, 2009; Kritchman and Nadler, 2008). Many estimation
methods developed in low-dimensional setting have been shown to perform poorly
even for moderately large pand n(Cragg and Donald, 1997).
As all meaningful inference procedures in the model will unavoidably use some
estimator of the noise variance σ2, such a severe bias needs to be corrected for
high-dimensional data. There are several estimators proposed to deal with the
high-dimensional situation. Kritchman and Nadler (2008) proposes an estimator
by solving a system of implicit equations; Ulfarsson and Solo (2008) introduces an
estimator using the median of the sample eigenvalues {λn,i }; and Johnstone and
Lu (2009) uses the median of the sample variances. However, these estimators are
assessed by Monte-Carlo experiments only and their theoretical properties have
not been investigated.
The main aim of this paper is to provide a new estimator of the noise variance
for which a rigorous asymptotic theory can be established in the high-dimensional
4
setting. First, by using recent advances in random matrix theory, we found a CLT
for the m.l.e. bσ2in the high-dimensional setting. Next, using this identification,
we propose a new estimator bσ2
for the noise variance by correcting this bias.
The asymptotic normality of the new estimator is thus established with explicit
asymptotic mean and variance.
Although the asymptotic Gaussian distribution of the new estimator bσ2
is es-
tablished under the high-dimensional setting p→ ∞,n→ ∞ and p/n c > 0, if
we set c= 0, i.e. the dimension pis infinitely smaller than n, this Gaussian limit co-
incides with the classical low-dimensional limit given in (4). In this sense, the new
asymptotic theory extends in a continuous manner the classical low-dimensional
result to the high-dimensional situation. Finite sample properties of the new esti-
mator bσ2
have been checked via Monte-Carlo experiments in comparison with the
above-mentioned four existing estimators. In terms of mean squared errors and
in all the tested scenarios, bσ2
outperforms very significantly three of them, and is
slightly preferable than the other one, see Table 2.
In order to demonstrate further potential benefits of the new estimator bσ2
,
we consider an important inference problem in PPCA, namely, the determination
of the number of principal components (PCs). Bai and Ng (2002) developed six
criteria with penalty on both pand nto identify the number of factors in the
approximate factor model. The approximate factor model allows the components
of the errors {ei}be correlated. PPCA can be considered as a simplified instance
of this model and indeed, Bai and Ng (2002) also applied their criteria to PPCA.
It is worth noticing that the determination of the number mof PCs and the esti-
mation of the noise variance σ2are inter-related and Bai and Ng’s criteria provide
a consistent and joint inference on (m, σ2) in the high-dimensional context. How-
ever, this consistency is obtained under the assumption that the variances (or the
strengths) of the PCs grow up to infinity with the dimension (see Assumption B of
5
their paper), while in our context these variances could be bounded. Therefore, we
propose a modified estimator of both (m, σ2) by implementing our new estimator
bσ2
in the criteria of Bai and Ng (2002). Furthermore, in order to deal with possibly
bounded variances of PCs, a new penalty function is found based on our new esti-
mator bσ2
. The resulting procedure provides a consistent joint estimator of (m, σ2).
Moreover, as predicted by our theory, this new procedure has a better performance
than the original Bai and Ng’s procedures in our context with possibly bounded
PC variances.
As a final application of the new estimator bσ2
, we consider the goodness-of-fit
test for the PPCA model. The likelihood ratio test statistic as well as their classical
(low-dimensional) chi-squared asymptotic theory are well-known since the work
of Amemiya and Anderson (1990). These results are again challenged by high-
dimensional data and the classical chi-squared limit is no more valid. We propose
a correction to this goodness-of-fit test statistic involving our new estimator bσ2
to
cope with the high-dimensional effects and establish its asymptotic normality.
The remaining sections are organized as follows. In Section 2, we present the
main results of the paper. The new estimator bσ2
of the noise variance is proposed
first, then a new joint estimator of the number of PCs and the noise variance is
constructed. In Section 3, we develop the corrected likelihood ratio test for the
goodness-of-fit of a PPCA model in the high-dimensional framework using the new
estimator bσ2
. Section 4 concludes. The most important technical proofs are gath-
ered in Appendix while the remaining ones are relegated to the supplementary
report. This report contains also many numerical results and additional applica-
tions. Lastly, all the codes permitting the reproduction of the results in Tables
1-6 of the paper and some of the data sets used in the paper are available at
http://web.hku.hk/~jeffyao/papersInfo.html
6
2 Main results
The PPCA model (1) is a spiked population model (Johnstone, 2001) since the
eigenvalues of the population covariance matrix Σare
spec(Σ) = (α1, . . . , αm,0,...,0
| {z }
pm
) + σ2(1,...,1
| {z }
p
)
=σ2(α
1, . . . , α
m,1,··· ,1
| {z }
pm
), (5)
where {αi}are mnon-null eigenvalues of ΛΛ0and the notation α
i=αi2+ 1 is
used. To develop a meaningful asymptotic theory in the high-dimensional context,
we assume that pand nare related so that when n→ ∞,cn=p/(n1) c > 0,
that is, pcan be large compared to the sample size nand for the asymptotic theory,
pand ntend to infinity proportionally. Let
φ(α) = α+
α1,α6= 1 .
Following Baik and Silverstein (2006), assumed that α
1≥ ··· ≥ α
m>1 + c, i.e
all the eigenvalues αiare greater than σ2c. It is then known that, for the spiked
sample eigenvalues {λn,i }1imof Sn, almost surely,
λn,i σ2φ(α
i) = ψ(σ2, c, αi) = αi+σ2+σ2c1 + σ2
αi. (6)
Moreover, the remaining sample eigenvalues {λn,i}m<ip, called noise eigenvalues,
will converge to a continuous distribution with support interval [a(c), b(c)] where
a(c) = σ2(1 c)2and b(c) = σ2(1 + c)2. In particular, for all 1 jL
with a prefixed range Land almost surely, λn,m+jb(c).It is worth noticing
that, if c0, we recover the low-dimensional limits λn,i αi+σ2(population
spike eigenvalues) and λn,i σ2(population noise eigenvalues) discussed earlier.
In addition, CLT for the spiked eigenvalues is established in Bai and Yao (2008):
n(λn,i σ2φ(α
i)) is asymptotically Gaussian.
7
As explained in Introduction, when the dimension pis large compared to the
sample size n, the m.l.e. bσ2has a negative bias. In order to identify this bias, we
first establish a CLT for bσ2under the high-dimensional scheme.
Theorem 1. Consider the PPCA model (1) with population covariance matrix
Σ=ΛΛ0+σ2Ipwhere both the principal components and the noise are Gaussian.
Assume that p→ ∞,n→ ∞ and cn=p/(n1) c > 0, and the non-null
eigenvalues of ΛΛ0{αi}satisfy αiσ2c(1 im). Then, we have
(pm)
σ22c(bσ2σ2) + b(σ2)D
→ N(0,1),
where b(σ2) = pc
2m+σ2Pm
i=1
1
αi.
The proof is given in Appendix. Therefore, for high-dimensional data, the m.l.e.
bσ2has an asymptotic bias b(σ2) (after normalization). This bias is a complex
function of the noise variance and the mnon-null eigenvalues of the loading matrix
ΛΛ0. The above CLT is still valid if ˜cn= (pm)/n is substituted for c. Now if
indeed pn, i.e. the dimension pis infinitely smaller than the sample size n, so
that ˜cn'0 and b(σ2)'0, and hence
(pm)
σ22c(bσ2σ2) + b(σ2)'pm
σ22n(bσ2σ2)D
→ N(0,1) .
This is the CLT (4) for bσ2known under the classical low-dimensional scheme. In
a sense, Theorem 1 constitutes a natural and continuous extension of the classical
CLT to the high-dimensional context.
Theorem 1 recommends to correct the negative bias of bσ2. As the bias depends
on σ2which we want to estimate, a natural correction is to use the plug-in estimator
bσ2
=bσ2+b(bσ2)
pmbσ22cn. (7)
This estimator will be hereafter referred as the bias-corrected estimator. The fol-
lowing CLT is an direct consequence of Theorem 1.
8
Theorem 2. We assume the same conditions as in Theorem 1. Then, we have
pm
σ22cnbσ2
σ2D
→ N(0,1) .
The proof is given in Appendix. Compared to the m.l.e. bσ2in Theorem 1, the
bias-corrected estimator bσ2
has no more a bias after normalization by pm
σ22cn, and
it should be a much better estimator than bσ2.
For the implementation of bσ2
in practice, we need the value of b(bσ2) which
depends on the unknown spike values {αi}. It is remarked that only consistent
estimate of b(bσ2) is needed here and this is achieved by substituting some consistent
estimates bαifor αiin b(bσ2). This is done as follows. Following Theorem 1, bσ2P
σ2. Then using the function ψin (6), and by solving in αithe equation λn,i =
ψ(bσ2, p/n, αi), we find an estimator bαifor αi. Since p/n c,bσ2P
σ2and ψis
known to be invertible, we deduce easily that bαi
P
αi. This procedure will be
used for real data analysis in Section 2.4.
2.1 Monte-Carlo experiments
We first check by simulation the effect of bias-correction obtained in bσ2
and its
asymptotic normality. Independent Gaussian samples of size nare considered in
three different settings:
Model 1: spec(Σ) = (25,16,9,0,...,0) + σ2(1,...,1), σ2= 4, c= 1;
Model 2: spec(Σ) = (4,3,0,...,0) + σ2(1,...,1), σ2= 2, c= 0.2;
Model 3: spec(Σ) = (12,10,8,8,0,...,0) + σ2(1,...,1), σ2= 3, c= 1.5.
9
Table 1: Comparison between the empirical and the theoretical bias.
Settings Empirical bias Theoretical bias |Difference|
Mod. p n
1
100 100 -0.1556 -0.1589 0.0024
400 400 -0.0391 -0.0388 0.0003
800 800 -0.0197 -0.0193 0.0003
2
20 100 -0.0625 -0.0704 0.0052
80 400 -0.0166 -0.0162 0.0027
200 1000 -0.0064 -0.0064 0.0011
3
150 100 -0.1609 -0.1634 0.0025
600 400 -0.0401 -0.0400 0.0001
1500 1000 -0.0161 -0.0159 0.0002
In Table 1, we compare the empirical bias of bσ2(i.e. the empirical mean of
bσ2σ2=1
pmPp
i=m+1 λn,i σ2) over 1000 replications with the theoretical one
σ22cb(σ2)/(pm). In all the three models, the empirical and theoretical bias
are close each other. As expected, their difference vanishes when pand nincrease.
The table also shows that this bias is quite significant even for large dimension
and sample size such as (p, n) = (1500,1000). In addition, we have drawn the
histograms from 1000 replications of (pm)(σ22cn)1(bσ2σ2) + b(σ2) of the
three models above, with sample size n= 100 and dimensions p=c×nand
they match very well the density of the standard Gaussian distribution (see the
supplementary report).
Next, we compare our bias-corrected estimator bσ2
to the m.l.e. bσ2and other
three existing estimators in the literature. For the reader’s convenience, we recall
their definitions:
10
1. The estimator bσ2
KN of Kritchman and Nadler (2008): it is defined as the
solution of the following non-linear system of m+ 1 equations involving the
m+ 1 unknowns bρ1,...,bρmand bσ2
KN:
bσ2
KN 1
pm"p
X
j=m+1
λn,j +
m
X
j=1
(λn,j bρj)#= 0, and
bρ2
jbρjλn,j +bσ2
KN bσ2
KN
pm
n+λn,j bσ2
KN = 0, j= 1, . . . , m.
We use the code available on the authors’ web-page to carry out the simu-
lation. Notice that bσ2
KN is only implicitly defined and no precise asymptotic
analysis has been provided in Kritchman and Nadler (2008) for bσ2
KN. We men-
tion one common feature shared by bσ2
KN and bσ2
: both methods use the same
relationship between the population spike eigenvalues and their asymptotic
limits which are equation (6) in our paper and equation (16) in Kritchman
and Nadler (2008) (this leads to the second equation in the system above
satisfied by bσ2
KN).
2. The estimator bσ2
US of Ulfarsson and Solo (2008): it is defined as the ratio
bσ2
US =median(λn,m+1, . . . , λn,p)
mp/n,1
,
where mα,1is the median of the Marˇcenko-Pastur distribution Fα,1.
3. The estimator bσ2
median of Johnstone and Lu (2009): it is defined as the median
of the psample variances (the data {xij}are assumed centered)
bσ2
median = median 1
n
n
X
i=1
x2
ij,1jp!.
Table 2 presents the ratios of the empirical MSEs of these estimators over the
empirical MSE of the bias-corrected estimator bσ2
. The performance of bσ2
and bσ2
KN
are similar but bσ2
is slightly better. The estimator bσ2
median is better than bσ2
US and
11
Table 2: Comparison between four existing estimators and the proposed bσ2
in
terms of ratios of MSEs: MSE(bσ2)
MSEbσ2
,MSEbσ2
KN
MSEbσ2
,MSEbσ2
US
MSEbσ2
and MSEbσ2
median
MSEbσ2
.
Settings bσ2bσ2
KN bσ2
US bσ2
median
Mod. p n σ2
1
100 100
4
7.8232 1.0130 14.6394 1.5085
400 400 8.5905 0.9980 25.5941 1.6429
800 800 8.1162 1.0019 39.9444 1.6639
2
20 100
2
1.7045 1.0220 2.4980 1.5926
80 400 2.0406 1.0045 3.8686 1.5433
200 1000 1.9729 1.0011 3.8731 1.5427
3
150 100
3
19.2114 1.2292 41.7319 1.4274
600 400 20.8471 0.9958 48.3130 1.6096
1500 1000 21.6207 1.0001 51.9302 1.8071
the m.l.e. bσ2. But bσ2
median and bσ2
US performs poorly compared to bσ2
and bσ2
KN. The
reader is, however reminded that the theoretic properties of bσ2
KN,bσ2
US and bσ2
median
are unknown and so far they have been checked via simulations only. A careful
look at the defining formula of both bσ2
US and bσ2
median reveals that these estimators
are close to bσ2, all of them being close to average or median of sample eigenvalues.
Therefore they might have a similar performance as bσ2.
2.2 Extension to non-Gaussian data
In this section, we provide some extension of the main results to cover non-
Gaussian data. Following a common approach in high-dimensional statistics (Bai
and Saranadasa, 1996), we assume xican be generated as
xi=Ayi+µ,(8)
12
where A=Σ1/2and yi={yij}1jphas pi.i.d and standardized components. We
set γ=E|y11|41 (γ= 2 under normal assumption).
Theorem 3. Consider the PPCA model (1) where the observation vectors {xi}1in
are generated as in (8). Assume that p→ ∞, n → ∞ and cn=p/(n1) c > 0.
Then, we have
(pm)
σ2γc (bσ2σ2) + b(σ2)D
→ N(0,1).
The proof is given in Appendix. Similarly to the the plug-in estimator bσ2
in
(7) for Gaussian data, we define a new plug-in estimator
bσ2
0=bσ2+b(bσ2)
pmbσ2γcn. (9)
Notice that when γ= 2 (Gaussian case), bσ2
0coincides with bσ2
.
Theorem 4. We assume the same conditions as in Theorem 3. Then, we have
pm
σ2γcnbσ2
0σ2D
→ N(0,1).
The proof of this theorem is omitted because it is the same for Theorem 2.
For the implementation of bσ2
0, there is however a new problem to solve with non-
Gaussian data, namely the parameter γin (9) needs to be first estimated. Again
we use the random matrix theory to resolve this issue.
Proposition 1. Under the same conditions in Theorem 3, as p, n → ∞,
p
X
i=1
λ2
n,i pβ2+cnβ2
1D
→ N 4(γ1), v,
where v > 0denotes an (computable) asymptotic variance, and
β1=σ2+1
p
m
X
j=1
αj, β2=σ4+1
p
m
X
j=1
α2
j+2
pσ2
m
X
j=1
αj.
13
This result does not provide directly an consistent estimator of γ. There-
fore, bootstrap method is applied to the sample eigenvalues {λn,i}1ipto get
say Bbootstrapped sample {λ
n,i}1ip. We find the bootstrap mean wof w=
Pp
i=1 λ2
n,i p(β2+cnβ2
1)/cσ4(here σ2is approximated by the m.l.e. bσ2), and
finally by letting w=bγ1 we find a bootstrap estimator bγof the unknown γ.
Plugging bγin (9), the final bias-corrected estimator we propose is
bσ2
∗∗ =bσ2+b(bσ2)
pmbσ2pbγcn. (10)
We conclude the section by some simulation experiments to check the perfor-
mance of bσ2
∗∗ for non-Gaussian data. We start with the same setting of covariance
matrix of Table 1 and use independent gamma and continuous uniform distributed
variables for the components of xi’s. The gamma distributed data are drawn from
Gamma(k, θ) with fixed shape parameter k= 2 and accordingly determined scale
parameter θ; the uniform distributed data are drawn from U(a, b) with fixed a= 0
and accordingly determined b. The bootstrap sample are repeated B= 500 times.
The simulation results are summarized in Table 3. The estimator bσ2
∗∗ has a very
good performance for both tested non-Gaussian distributions. The MAD and MSE
decrease when pand nincrease in all model settings.
2.3 Determination of the number of PCs
So far we have assumed that the number mof PCs is known. It is however desirable
to estimate mdirectly from the data. In the literature, consistent estimators of
mhave been proposed in the high-dimensional context, e.g. in Kritchman and
Nadler (2008), Ulfarsson and Solo (2008), Onatski (2009) and Passemier and Yao
(2012). As a benchmark work, Bai and Ng (2002) proposes six criteria to determine
the number of PCs (or factors) under the framework of large cross-sections (N)
14
Table 3: Empirical mean, MAD and MSE of bσ2
∗∗ for gamma and uniform samples.
Settings Gamma Continuous uniform
Mod. p n σ2bσ2
∗∗ MAD MSE bσ2
∗∗ MAD MSE
1
100 100
4
4.0807 0.1697 0.0411 4.0150 0.1052 0.0161
400 400 4.0561 0.0571 0.0040 4.0424 0.0429 0.0022
800 800 4.0304 0.0306 0.0011 4.0236 0.0237 0.0006
2
20 100
2
1.9570 0.1150 0.0194 1.9175 0.0849 0.0089
80 400 1.9978 0.0241 0.0009 1.9858 0.0166 0.0004
200 1000 2.0016 0.0090 0.0001 1.9968 0.0054 <104
3
150 100
3
3.3637 0.3637 0.1415 3.3336 0.3336 0.1143
600 400 3.1060 0.1060 0.0116 3.0957 0.0957 0.0093
1500 1000 3.0434 0.0434 0.0019 3.0391 0.0391 0.0015
and large time dimensions (T). These criteria are popular and widely used in
factor modeling literature: for example, they are one of the starting-blocks of the
newly proposed POET-estimator in Fan et al. (2013). Notice that the dimension-
sample-size pair is denoted here as (N, T ) instead of (p, n). Three of these criteria
applicable to PPCA models are
P Cj(m) = V(m, b
Fm) + mbσ2
BNgj(N , T ), j ∈ {1,2,3},(11)
where bσ2
BN is a consistent estimate of (N T )1PN
i=1 PT
j=1 E(eij)2,V(m, b
Fm) =
(N T )1PN
i=1 ˆ
e0
iˆ
ei, and gj(N, T ) denote the penalty functions
g1(N, T ) = N+T
NT ln N T
N+T, g2(N, T ) = N+T
NT ln e
N, g3(N , T ) = ln e
N
e
N,
with e
N= min{N, T }. The corresponding estimators of the number of PCs are
bmj= arg min0mm0P Cj(m), j∈ {1,2,3}, where m0is a predetermined maximum
value of m. In applications, bσ2
BN is replaced by V(m0,b
Fm0). The calculations of
bσ2
BN and V(m, b
Fm) have no explicit formula and are based on the estimation of
15
the residuals {ˆ
ei}. It is worth mentioning that V(m, b
Fm) and bσ2
BN are indeed the
estimates of the noise variance if the underlying model is the PPCA model.
To start with and in order to assess the quality of our bias-corrected estimator
bσ2
in the current context, we substitute bσ2
for empirical V(m, b
Fm) and bσ2
BN in
the criteria P Cj’s. Notice that the noise variance estimator bσ2
depends on the
supposed number mof PCs, and we let bσ2
(m) = bσ2
to denote explicitly this
dependency. The modified criteria and estimators using bσ2
(m) are thus
P C
j(m) = bσ2
(m) + mbσ2
(m0)gj(N, T ),
and
bm
j= arg min
0mm0
P C
j(m), j ∈ {1,2,3},(12)
respectively. These modified criteria P C
j’s will be compared below to their original
counterparts by simulation.
Under approriate conditions, Bai and Ng (2002) established the consistency of
the criteria P Cjwhen both Nand Tgrow to infinity. A careful examination of their
method reveals that the modified criteria P C
jare also consistent under the same
conditions (this thus means that potential differences between the two families
of criteria are of higher asymptotic order). There is however a main issue here:
such consistency results require that the variances of the PCs (or their strengths)
grow to infinity with the dimension N, see Assumption B of their paper. This
pervasiveness assumption is not satisfied in our context where these variances can
be weaker and remain bounded. Consequently, the proof of Bai and Ng (2002)
does not apply here and we are forced to seek for a new asymptotic result. We
thus introduce a new penalty function
g(N, T ) = (c+ 2c)(1 + T/N1+δ)
N,(13)
16
and define a new criterion
P C(m) = bσ2
(m) + mbσ2
(m0)g(N, T ).(14)
Here δ > 0 is a small pre-fixed constant. As another main contribution of the
paper, we establish the consistency of the corresponding estimator
bm= arg min
0mm0
P C(m).(15)
Theorem 5. We assume the same conditions as in Theorem 1: in particular
N, T → ∞ and cn=N/(T1) c > 0. With the condition αi> σ2c, we have
limN,T →∞ Prob(bm= ¯m) = 1 where ¯mis the true number of PCs.
The proof is given in Appendix. Simulation experiments are conducted to show
the performance of the new estimator bmin comparison with both the modified
estimators bm
j’s and the original bmj’s. As in Bai and Ng (2002), the data are
generated from the model:
Xit =
m
X
j=1
λijFtj +θeit ,
where the PCs, the loadings and the errors (eit) are N(0,1) variates, the common
component of Xit has variance mand the idiosyncratic component has variance
θ. The noise variance is σ2=θand Λ= (λij). Typically, a PC corresponding to
αjis detectable when αjqN
Tθ, see (6). We conduct extensive simulation by
reproducing the configuration of Nand Tused in Bai and Ng (2002). In all the
experiments, the same value of δ= 0.05 is used in (13).
Tables 4 and 5 report the empirical means of the estimator of the number of
PCs over 1000 replications, for m= 1 and 5 respectively, with standard errors
in parentheses. When a standard error is actually zero, no standard error is thus
indicated. For all cases, the predetermined maximum number m0of PCs is set
17
Table 4: Comparison between P C,P C
jand P C0
jsfor m= 1, θ = 1.
N T P CP C
1P C
2P C
3P C1P C2P C3
100 40 1.00 1.00 1.00 1.00 1.17(0.37) 1.01(0.10) 3.78(0.75)
100 60 1.00 1.00 1.00 1.00 1.00 1.00 3.63(0.76)
200 60 1.00 1.00 1.00 1.00 1.00 1.00 1.00
500 60 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1000 60 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2000 60 1.00 1.00 1.00 1.00 1.00 1.00 1.00
100 100 1.00 1.00 1.00 1.00 1.00 1.00 5.36(0.80)
40 100 1.00 1.00 1.00 1.00 1.79(0.72) 1.19(0.40) 4.91(0.90)
60 100 1.00 1.00 1.00 1.00 1.01(0.08) 1.00 4.30(0.85)
60 200 1.00 1.00 1.00 1.00 1.00 1.00 1.02(0.16)
10 50 7.19(1.92) 8.00 8.00 8.00 8.00 8.00 8.00
10 100 4.95(3.13) 8.00 8.00 8.00 8.00 8.00 8.00
20 100 1.00(0.03) 1.01(0.15) 1.01(0.12) 1.08(0.53) 6.96(0.88) 6.35(0.98) 7.84(0.40)
100 10 2.10(2.52) 1.08(0.73) 1.03(0.50) 1.15(1.01) 8.00 8.00 8.00
100 20 1.00 1.00(0.03) 1.00(0.03) 1.00(0.03) 5.88(0.76) 5.12(0.77) 7.35(0.63)
to 8. When the true number of PCs is 1 (Table 4), the new criterion P Cand
the modified criteria P C
jcan correctly detect the number almost surely and the
corresponding standard errors are all zeros. In comparison, there are 11 cases where
the original criteria P Cjlose efficiency in finding the true number of PCs with a
non-zero standard error. In the small dimensions situations (last five rows), all the
modified P C
jand the original P Cjfail when the value of Nis 10: they all report
the maximum value m0. But the new criterion P C outperforms the others in all
cases in terms of mean and standard error. Meanwhile, the modified criteria P C
j
globally perform better than the original P Cj’s and this establishes the superiority
of the bias-corrected estimator bσ2
. In Table 5, the common component has variance
5 and the idiosyncratic component has a smaller variance 3, and the situation is a
bit more difficult. We can however draw the same conclusion that the new criterion
P Coutperforms the other criteria in all tested cases, and the modified criteria
P C
jhave an overall better performance than the original P Cj’s. In both Tables
4 and 5, only a part of the tested combinations of Nand Tis reported and the
18
Table 5: Comparison between P C,P C
jand P C0
jsfor m= 5, θ = 3.
N T P CP C
1P C
2P C
3P C1P C2P C3
100 40 5.00 4.91(0.30) 4.81(0.41) 4.99(0.11) 5.00(0.03) 5.00 5.59(0.57)
100 60 5.00 5.00(0.04) 4.99(0.11) 5.00 5.00 5.00 5.58(0.57)
200 60 5.00 5.00 5.00 5.00 5.00 5.00 5.00
500 60 5.00 5.00 5.00 5.00 5.00 5.00 5.00
1000 60 5.00 5.00 5.00 5.00 5.00 5.00 5.00
2000 60 5.00 5.00 5.00 5.00 5.00 5.00 5.00
100 100 5.00 5.00 5.00 5.00 5.00 5.00 6.84(0.65)
40 100 4.98(0.14) 4.97(0.17) 4.92(0.27) 5.00(0.04) 5.02(0.12) 5.00 6.22(0.66)
60 100 5.00 5.00(0.04) 4.99(0.08) 5.00 5.00 5.00 6.03(0.64)
60 200 5.00 5.00 5.00 5.00 5.00 5.00 6.03(0.03)
10 50 7.47(1.00) 8.00 8.00 8.00 8.00 8.00 8.00
10 100 5.77(1.53) 8.00 8.00 8.00 8.00 8.00 8.00
20 100 3.74(0.84) 4.74(0.51) 4.62(0.57) 4.92(0.45) 7.11(0.63) 6.65(0.64) 7.85(0.37)
100 10 6.66(1.97) 4.59(1.99) 4.35(1.91) 4.88(2.09) 8.00 8.00 8.00
100 20 4.68(0.51) 3.86(0.79) 3.69(0.81) 4.13(0.73) 6.74(0.63) 6.19(0.62) 7.77(0.43)
other combinations where all criteria detect the right number mwith zero error are
omitted. Additional simulation results and tables are in the supplementary report.
In conclusion, the proposed criterion has the best performance in determining the
number of PCs, and the modified criteria perform better by using the bias-corrected
estimator proposed in this paper for PPCA model.
2.4 Real data
Though the new and modified estimators seems to perform better than the original
ones in the simulation experiments, we now compare them on two real data sets.
The first data set contains stock returns. Following Bai and Ng (2002), we extract
data from the CRSP US Stock Database using the monthly returns for all common
stocks listed in NYSE, Amex, and NASDAQ over twenty years (January 1991 to
December 2010). Stocks that do not trade for cumulative two years during the
period are deleted. The final data set includes 1913 stocks with 240 monthly
19
Table 6: Comparison between the modified and the original criteria.
P CP C
1P C
2P C
3P C1P C2P C3
Data 1 (m0= 15) 22 2 2 4 4 6
Data 2 (m0= 20) 13 18 18 19 20 20 20
returns for each of them (T= 240, N = 1913). Notice that the data set does
not match exactly the one used in Bai and Ng (2002) as they selected 4883 firms
for a shorter period January 1994 to December 1998; however this selected data
set is not publicly available. The second one is the fMRI data set. This data
set is freely available on the web-site http://afni.nimh.nih.gov/afni/. A human
brain was scanned when the person performed finger-thumb opposition. There are
T= 124 observations on 21 brain slices. We pick out one brain slice and only
keep the variables (pixels) that significantly corresponded to brain tissue, so that,
N= 1126 variables are selected. We transform both data respectively so that
each series is mean zero. The results of rank estimates of the new and modified
criteria on these two data sets are shown in Table 6. The original criteria P Cj
display a significant variation for the first data set and fail for the second one
by only reporting the maximum value m0= 20. In contrary, the new criterion
P Cand the modified criteria P C
j’s with the proposed variance estimator bσ2
seem mutually consistent by giving very close if not identical rank estimates. In
particular, the original criteria P Cjhave a significant over-estimation effect and
this is much reduced and stabilized either with a more accurate estimation of the
noise variance by bσ2
or with the new penalty function g(N, T ) in (13).
20
3 Application to the goodness-of-fit test of a PPCA
model
As a third application of the bias-corrected estimator bσ2
, we consider the following
goodness-of-fit test for the PPCA model (1). The null hypothesis is
H0:Σ=ΛΛ0+σ2Ip,
where the number of PCs mis specified. Following Anderson and Rubin (1956),
the likelihood ratio test (LRT) statistic is
Tn=nL,with L=
p
X
j=m+1
log λn,j
bσ2,
and bσ2is the m.l.e. (3) of the variance. Keeping pfixed while letting n→ ∞,
the classical low-dimensional theory states that Tnconverges to χ2
q, where q=
p(p+ 1)/2 + m(m1)/2pm 1. However, this classical approximation is
again useless in large-dimensional situation. Indeed, this criterion leads to a high
false-positive rate (see Table 3 in the supplementary report).
In a way similar to Section 2, we now construct a corrected version of Tnusing
the calculus done in Bai et al. (2009) and Zheng (2012). As we consider the
logarithm of the eigenvalues of the sample covariance matrix, we will assume in
the sequel that p<nand c < 1 to avoid null eigenvalues.
Theorem 6. Assume the same conditions as in Theorem 1 and in addition c < 1.
Then, we have
v(c)1
2{Lm(c)ph(cn) + η+ (pm) log(β)}D
→ N(0,1),
21
where
m(c) = log (1 c)
2, h(cn) = cn1
cn
log(1 cn)1, η =
m
X
i=1
log(1 + 2α1
i),
β= 1 c
pm(m+σ2
m
X
i=1
α1
i), v(c) = 2 log(1 c) + 2c
β1
β2.
The above statistic depends on the unknown variance σ2and the spike eigen-
values {αi}. First of all, as explained in Section 2, consistent estimates of {αi}are
available. By using these estimates and substituting bias-corrected estimate bσ2
for
σ2, we obtain consistent estimates bv(cn), bηand b
βof v(c), ηand β, respectively.
Therefore, to test H0, it is natural to use the statistic
n:= bv(cn)1
2Lm(cn)ph(cn) + bη+ (pm) log( b
β).
Since ∆nis asymptotically standard normal, the critical region {n> qα}where
qαis the αth upper quantile of the standard normal, will have an asymptotic size
α. This test is referred as the corrected likelihood ratio test (CLRT).
The supplementary report to the paper contains the proof of Theorem 6 to-
gether with numerical illustrations from simulation experiments.
4 Conclusions
In this paper, we propose a bias-corrected estimator of the noise variance for PPCA
model in the high-dimensional framework. The main appeal of our estimator is that
it is developed under the assumption that p/n c > 0 as p, n → ∞ and is thus
appropriate for a wide range of large-dimensional data sets. Extensive Monte-
Carlo experiments demonstrated the superiority of the proposed estimator over
several existing estimators (however no theoretical justification has been proposed
22
in the literature for these estimators). In addition, by implementing the proposed
estimator of the noise variance within the well-known determination algorithms for
the number of principal components proposed by Bai and Ng (2002), we construct
a new joint consistent estimator of the pair (m, σ2) with a new penality function
to cope with non pervasive PCs. In an additional application of our methodology,
we develop an asymptotic theory of the goodness-of-fit test for high-dimensional
PPCA model. The overall message from the paper is that in a high-dimensional
PPCA model, when an estimator of the noise variance σ2is needed, the bias-
corrected estimator bσ2
from the paper should be recommended.
To conclude, we like to mention an important question that requires further
investigation, namely the impact of the size of mand of the PC eigenvalues on the
methodology developed in this paper. In the numerical simulations, mis typically
small compared to min{p, n}. Can one expect different behavior if (a) some of the
PC eigenvalues are fairly big compared to σ2, say of the order O(p), as is often the
case in some econometric problems (Fan et al., 2008) or, (b) when mis relatively
big, e.g. increasing with p, while many of the PC eigenvalues are small, possibly
below the size where the phase transition of eigenvalues take place? Unfortunately,
these questions go much beyond the scope the existing literature including this
paper: for instance we are not aware of any work capable of integrating both
large and small PC eigenvalues. Similarly, the treatment of varying number m
of PC components will require the development of new mathematical techniques.
Needless to say, these questions are of fundametal importance and worth much
research effort in the future.
23
References
Y. Amemiya and T. W. Anderson. Asymptotic chi-square tests for a large class of
factor analysis models. Ann. Statist., 18(3):1453–1463, 1990.
T. W. Anderson. An introduction to multivariate statistical analysis. Wiley Series
in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken,
NJ, third edition, 2003.
T. W. Anderson and Y. Amemiya. The asymptotic normal distribution of estima-
tors in factor analysis under general conditions. Ann. Statist., 16(2):759–771,
1988.
T. W. Anderson and H. Rubin. Statistical inference in factor analysis. In Proceed-
ings of the Third Berkeley Symposium on Mathematical Statistics and Probabil-
ity, 1954–1955, vol. V, pages 111–150, Berkeley and Los Angeles, 1956. Univer-
sity of California Press.
J. Bai. Inference theory for factor models of large dimensions. Econometrica, 71
(1):135–171, 2003.
J. Bai and K. Li. Statistical analysis of factor models of high dimension. Ann.
Statist., 40(1):436–465, 2012.
J. Bai and S. Ng. Determining the number of factors in approximate factor models.
Econometrica, 70(1):191–221, 2002.
Z. Bai and H. Saranadasa. Effect of high dimension: by an example of a two
sample problem. Statist. Sinica, 6(2):311–329, 1996.
Z. Bai and J. Yao. Central limit theorems for eigenvalues in a spiked population
model. Ann. Inst. Henri Poincar´e Probab. Stat., 44(3):447–474, 2008.
24
Z. Bai, D. Jiang, J. Yao, and S. Zheng. Corrections to LRT on large-dimensional
covariance matrix by RMT. Ann. Statist., 37(6B):3822–3840, 2009.
Z. Bai, J. Chen, and J. Yao. On estimation of the population spectral distribution
from a high-dimensional sample covariance matrix. Aust. N. Z. J. Stat., 52(4):
423–437, 2010.
J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of
spiked population models. J. Multivariate Anal., 97(6):1382–1408, 2006.
G. Chamberlain and M. Rothschild. Arbitrage, factor structure, and mean-variance
analysis on large asset markets. Econometrica, 51(5):1281–1304, 1983.
J. G. Cragg and S. G. Donald. Inferring the rank of a matrix. Journal of econo-
metrics, 76(1):223–250, 1997.
X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image
Processing, IEEE Transactions on, 20(12):3419–3430, 2011.
C. Doz, D. Giannone, and L. Reichlin. A quasi–maximum likelihood approach for
large, approximate dynamic factor models. Rev. Econ. Stat., 94(4):1014–1024,
2012.
J. Fan, Y. Fan, and J. Lv. High dimensional covariance matrix estimation using a
factor model. J. Econometrics, 147:186–197, 2008.
J. Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding
principal orthogonal complements. J. R. Statist. Soc. B, 75(4):603–680, 2013.
M. Forni, M. Hallin, M. Lippi, and L. Reichlin. Reference cycles: the NBER
methodology revisited. Number 2400. Citeseer, 2000.
I. M. Johnstone. On the distribution of the largest eigenvalue in principal compo-
nents analysis. Ann. Statist., 29(2):295–327, 2001.
25
I. M Johnstone and A. Y. Lu. On consistency and sparsity for principal components
analysis in high dimensions. JASA, 104(486), 2009.
S. Kritchman and B. Nadler. Determining the number of components in a factor
model from limited noisy data. Chem. Int. Lab. Syst., 94, 2008.
A. Onatski. Testing hypotheses about the number of factors in large factor models.
Econometrica, 77(5):1447–1479, 2009.
D. Passemier and J. Yao. On determining the number of spikes in a high-
dimensional spiked population model. Random Matrices: Theory and Appli-
cations, 1(1):1150002, 2012.
M. O. Ulfarsson and V. Solo. Dimension estimation in noisy PCA with SURE and
random matrix theory. IEEE Trans. Signal Process., 56(12):5804–5816, 2008.
Q. Wang and J. Yao. On the sphericity test with large-dimensional observations.
Electron. J. Stat., 7:2164–2192, 2013.
S. Zheng. Central limit theorems for linear spectral statistics of large dimensional
f-matrices. Ann. Inst. Henri Poincar´e Probab. Stat., 48(2):444–476, 2012.
Appendix
Proof of Theorem 1
We have (pm)bσ2=Pp
i=1 λn,i Pm
i=1 λn,i.By (6),
m
X
i=1
λn,i
m
X
i=1 αi+4
αi+σ2m(1 + c) a.s. (16)
26
For the first term, we have
p
X
i=1
λi=pZxdFn(x)
=pZxd(FnFcn,Hn)(x) + pZxdFcn,Hn(x)
=Gn(x) + pZxdFcn,Hn(x).
By Proposition 1 in the supplement report, the first term is asymptotically normal
Gn(x) =
p
X
i=1
λn,i pZxdFcn,Hn(x)D
→ N (m(x), v(x)) ,
with asymptotic mean
m(x) = 0 (17)
and asymptotic variance
v(x) = 24.(18)
Furthermore, by Lemma 1 of Bai et al. (2010),
ZxdFcn,Hn(x) = ZtdHn(t) = σ2+1
p
m
X
i=1
αi.
So we have
p
X
i=1
λn,i 2
m
X
i=1
αiD
→ N(0,24). (19)
By (16) and (19) and using Slutsky’s lemma, we obtain
(pm)(bσ2σ2) + 2 m+σ2
m
X
i=1
1
αi!D
→ N(0,24).
27
Proof of Theorem 2
We have
pm
σ22cnbσ2
σ2=pm
σ22cnbσ2σ2+bbσ2bσ2
σ2
=pm
σ22cnbσ2σ2+b(σ2)+1
σ2bbσ2bσ2b(σ2)σ2.
Since bσ2P
σ2, by continuity, the second expression tends to 0 in probability and
the conclusion follows from Theorem 1.
Proof of Theorem 3
By Lemma 2.2 of Wang and Yao (2013), we have
Gn(x) =
p
X
i=1
λn,i pZxdFcn,Hn(x)D
→ N(m(x), v(x)).
The asymptotic mean does not change for non-Gaussian data, m(x) = 0, but the
asymptotic variance is v(x) = cγσ4.
Proof of Theorem 5
We prove that limN,T →∞ P[P C (m)< P C ( ¯m)] = 0 for all m6= ¯mand mm0.
Notice that by definition,
P C(m)P C ( ¯m)<0
bσ2
(m)bσ2
( ¯m)<( ¯mm)bσ2
(m0)g(N, T )
bσ2
( ¯m)bσ2
(m)>(m¯m)bσ2
(m0)g(N, T ).
Consider first m < ¯m. We have by (7)
bσ2
(m)bσ2
( ¯m) = {bσ2(m)bσ2( ¯m)}{1 + op(1)}.
28
Moreover,
(Nm){bσ2(m)bσ2( ¯m)}=X
m<i¯m
λi( ¯mm)bσ2( ¯m)
( ¯mm){λ¯mbσ2( ¯m)}.
Since λ¯mσ2hα¯m
σ2+1+c1 + σ2
α¯miand bσ2( ¯m)σ2(in probability), the lower
bound above converges to ( ¯mm)σ2{α¯m2+c(1 + σ2¯m)}which is positive.
The conclusion P[P C(m)< P C ( ¯m)] 0 will follow if the penalty satisfies
(Nm)g(N, T )<σ2
bσ2
(m0)α¯m
σ2+c1 + σ2
α¯m,(20)
for large N, T . Notice that by assumption α¯m2>cwhich implies that
hα¯m
σ2+1+c1 + σ2
α¯mi > c + 2c. On the other hand, we have σ2/bσ2
(m0) =
1 + β/T +op(1/T ) where βis some constant (depending on m0,cand σ2). So
with the g(N, T ) in (13), we have (Nm)g(N, T )=(c+ 2c)(1 + T
N1+δ)
(c+ 2c)(1 + β/T +op(1/T )) and the conclusion follows. Next, consider the case
where m > ¯m. We have
(Nm){bσ2( ¯m)bσ2(m)}=X
¯m<im
λi(m¯m)bσ2( ¯m)
i.p.
(m¯m)σ2(c+ 2c),
due to λi
i.p.
σ2(1 + c)2for ¯m<im. Notice that bσ2
(m0)i.p.
σ2and with the
g(N, T ) in (13), we have
lim inf
N,T →∞(Nm)g(N , T )c+ 2c.
The conclusion follows.
29
On estimation of the noise variance in high-dimensional probabilistic
principal component analysis
(Supplementary report)
Damien Passemier, Zhaoyuan Li and Jianfeng Yao
Department of Statistics and Actuarial Science
The University of Hong Kong
1 Figure for Section 2.1
Figure 1 presents the histograms from 1000 replications of
(pm)
σ22cn
(bσ2σ2) + b(σ2)
of the three models in Section 2.1, with sample size n= 100 and dimensions p=c×n,
compared to the density of the standard Gaussian distribution. The sampling distribution is
almost normal.
Model 1 (p=n=100)
−4 −2 0 2 4
0.0 0.1 0.2 0.3 0.4 0.5
Model 2 (p=20,n=100)
−4 −2 0 2 4
0.0 0.1 0.2 0.3 0.4 0.5
Model 3 (p=150,n=100)
−4 −2 0 2 4
0.0 0.1 0.2 0.3 0.4 0.5
(a) (b) (c)
Figure 1. Histogram of (pm)
σ22c(bσ2σ2) + b(σ2) compared with the density of a standard Gaussian
distribution.
2 More Monte-Carlo experiments for Section 2.3
Tables 1 and 2 report the empirical means of the estimator of the number of PCs over
1000 replications, for m= 3 and 5 respectively, with standard errors in parentheses. When
1
a standard error is actually zero, no standard error is thus indicated. For all cases, the
predetermined maximum number m0of PCs is set to 8.
3 Application to the SURE criterion
Ulfarsson and Solo (2008) proposes to use the SURE criteria to choose the number of PCs.
This criterion uses the noise variance estimator bσ2
US defined in Section 2. It aims at minimizing
the Euclidean distance between the underlying estimator of the population mean µand its
true value. The proposed SURE criterion for mnumber of PCs (to be minimized) is
Rm= (pm)bσ2
US +bσ4
US
m
X
j=1
1
λj
+ 2bσ2
US (1 1/n)m
2bσ4
US (1 1/n)
m
X
j=1
1
λj
+4(1 1/n)bσ4
US
n
m
X
j=1
1
λj
+Cm,(1)
where
Cm=4(1 1/n)bσ2
US
n
m
X
j=1
p
X
i=m+1
λjbσ2
US
λjλi
+2(1 1/n)bσ2
US
nm(m1)
2(1 1/n)bσ2
US
n(p1)
m
X
j=1 1bσ2
US
λj.
Recall that bσ2
US is also related to m. From Section 2, we have known that bσ2
US is not as good
as our bias-corrected estimator. To examine further this difference, we replace bσ2
US with bσ2
in (11), referred then as SURE, to see whether the performance of SURE can be improved.
Then simulation experiments are conducted to check the performance of SURE. The
setup follows the paper Ulfarsson and Solo (2008) and the data are simulated according
to (1) with the parameters p= 64, p/n = [2/3,1/2,2/5], m = [5,10,15,20] and σ2= 1.
The loading matrix is set to Λ=FD1/2, where Fis constructed by generating a p×m
matrix of Gaussian random variables and then orthogonalizing the resulting matrix, and
D= diag(m+ 1)2, m2,...,32, λm, λm= 1.5. All simulations were repeated 1500 times.
Table 3 presents the percentage of correct selection of number of PCs for SURE and SURE,
and the results of SURE are from Table II of Ulfarsson and Solo (2008). SURElargely
outperforms SURE in all of the tested cases, most of times by a wide margin. All the
percentages of correct selection of SUREare larger than 90% and in 4 out of 12 cases, the
2
Table 1: Comparison between P C
psand P Cpsin terms of the mean estimation numbers of
PCs for m= 3, θ = 3.
N T PC
p1P C
p2P C
p3P Cp1P Cp2P Cp3
100 40 2.98(0.15) 2.95(0.22) 3.00(0.06) 3.00 3.00 3.90
100 60 3.00(0.03) 3.00(0.04) 3.00 3.01(0.08) 3.00 4.37(0.64)
200 60 3.00 3.00 3.00 3.00 3.00 4.18(0.63)
500 60 3.00 3.00 3.00 3.00 3.00 3.00
1000 60 3.00 3.00 3.00 3.00 3.00 3.00
2000 60 3.00 3.00 3.00 3.00 3.00 3.00
100 100 3.00 3.00 3.00 3.00 3.00 5.62(0.72)
200 100 3.00 3.00 3.00 3.00 3.00 3.00
500 100 3.00 3.00 3.00 3.00 3.00 3.00
1000 100 3.00 3.00 3.00 3.00 3.00 3.00
2000 60 3.00 3.00 3.00 3.00 3.00 3.00
40 100 2.99(0.10) 2.98(0.14) 3.00 3.07(0.26) 3.01(0.07) 5.04(0.72)
60 100 3.00 3.00(0.03) 3.00 3.00 3.00 4.65(0.69)
60 200 3.00 3.00 3.00 3.00 3.00 3.00
60 500 3.00 3.00 3.00 3.00 3.00 3.00
60 1000 3.00 3.00 3.00 3.00 3.00 3.00
60 2000 3.00 3.00 3.00 3.00 3.00 3.00
4000 60 3.00 3.00 3.00 3.00 3.00 3.00
4000 100 3.00 3.00 3.00 3.00 3.00 3.00
8000 60 3.00 3.00 3.00 3.00 3.00 3.00
8000 100 3.00 3.00 3.00 3.00 3.00 3.00
60 4000 3.00 3.00 3.00 3.00 3.00 3.00
100 4000 3.00 3.00 3.00 3.00 3.00 3.00
60 8000 3.00 3.00 3.00 3.00 3.00 3.00
100 8000 3.00 3.00 3.00 3.00 3.00 3.00
10 50 8.00 8.00 8.00 8.00 8.00 8.00
10 100 8.00 8.00 8.00 8.00 8.00 8.00
20 100 2.89(0.32) 2.85(0.37) 2.95(0.27) 6.55(0.74) 5.96(0.77) 7.62(0.55)
100 10 2.57(1.35) 2.43(1.19) 2.77(1.54) 8.00 8.00 8.00
100 20 2.46(0.63) 2.37(0.65) 2.65(0.52) 6.15(0.69) 5.46(0.68) 7.49(0.59)
3
Table 2: Comparison between P C
psand P Cpsin terms of the mean estimation numbers of
PCs for m= 5, θ = 5.
N T P C
p1P C
p2P C
p3P Cp1P Cp2P Cp3
100 40 3.83(0.77) 3.49(0.77) 4.51(0.58) 5.00(0.07) 4.98(0.15) 5.36(0.51)
100 60 4.66(0.50) 4.36(0.61) 4.98(0.13) 5.00(0.03) 5.00(0.06) 5.27(0.45)
200 60 4.95(0.22) 4.90(0.30) 4.99(0.08) 5.00 5.00 5.00
500 60 5.00(0.04) 5.00(0.07) 5.00(0.03) 5.00 5.00 5.00
1000 60 5.00(0.04) 5.00(0.04) 5.00 5.00 5.00 5.00
2000 60 5.00(0.03) 5.00(0.03) 5.00(0.03) 5.00 5.00 5.00
100 100 4.(0.12) 4.90(0.30) 5.00 5.00 5.00 6.18(0.63)
200 100 5.00 5.00 5.00 5.00 5.00 5.00
500 100 5.00 5.00 5.00 5.00 5.00 5.00
1000 100 5.00 5.00 5.00 5.00 5.00 5.00
2000 60 5.00 5.00 5.00 5.00 5.00 5.00
40 100 4.25(0.68) 3.92(0.75) 4.77(0.44) 4.98(0.04) 5.66(0.14) 5.66(0.57)
60 100 4.76(0.44) 4.47(0.60) 4.76(0.10) 5.00(0.03) 4.99(0.08) 5.46(0.56)
60 200 4.97(0.17) 4.94(0.24) 5.00 5.00 5.00 5.00
60 500 5.00(0.05) 5.00(0.06) 5.00(0.04) 5.00 5.00 5.00
60 1000 5.00(0.03) 5.00(0.03) 5.00 5.00 5.00 5.00
60 2000 5.00 5.00 5.00 5.00 5.00 5.00
4000 60 5.00 5.00 5.00 5.00 5.00 5.00
4000 100 5.00 5.00 5.00 5.00 5.00 5.00
8000 60 5.00 5.00 5.00 5.00 5.00 5.00
8000 100 5.00 5.00 5.00 5.00 5.00 5.00
60 4000 5.00 5.00 5.00 5.00 5.00 5.000
100 4000 5.00 5.00 5.00 5.00 5.00 5.00
60 8000 5.00 5.00 5.00 5.00 5.00 5.00
100 8000 5.00 5.00 5.00 5.00 5.00 5.00
10 50 8.00 8.00 8.00 8.00 8.00 8.00
10 100 8.00 8.00 8.00 8.00 8.00 8.00
20 100 3.64(0.91) 3.38(0.94) 4.08(0.79) 6.65(0.64) 6.12(0.64) 7.63(0.51)
100 10 3.10(2.01) 2.83(1.86) 3.53(2.27) 8.00 8.00 8.00
100 20 2.18(0.92) 1.93(0.92) 2.65(0.0.90) 6.56(0.62) 5.97(0.62) 7.66(0.50)
4
Table 3: Comparison between SURE and SUREin terms of percentage of correct selection
of PCs.
mp/n = 2/3p/n = 1/2p/n = 2/5
5SURE1.000 1.000 1.000
SURE 0.408 0.621 0.807
10 SURE0.992 1.000 0.998
SURE 0.512 0.739 0.858
15 SURE0.920 0.978 0.989
SURE 0.598 0.783 0.911
20 SURE0.909 0.966 0.990
SURE 0.617 0.810 0.899
detection rate is 100%. Therefore, by implementing our bias-corrected estimator of the noise
variance instead of the one provided by its authors, the SURE criterion has a much better
performance.
4 Monte-Carlo experiments for Section 3
We consider again Models 1 and 2 described in Section 2, and a new one (Model 4):
Model 1: spec(Σ) = (25,16,9,0,...,0) + σ2(1,...,1), σ2= 4, c= 0.9;
Model 2: spec(Σ) = (4,3,0,...,0) + σ2(1,...,1), σ2= 2, c= 0.2;
Model 4: spec(Σ) = (8,7,0,...,0) + σ2(1,...,1), σ2= 1, varying c.
Table 4 presents the empirical sizes of the classical likelihood ratio test (LRT) and the
new corrected likelihood ratio test (CLRT). For the LRT, we use the correction proposed by
Bartlett (1950), that is replacing Tn=nLby ˜
Tn=(n(2p+ 11)/62m/3)L. The
computations are done under 10000 independent replications and the nominal test level is
0.05.
The empirical sizes of the new CLRT are very close to the nominal one, except when the
ratio p/n is very small (less than 0.1). On the contrary, the empirical sizes of the classical
LRT are much higher than the nominal level especially when cis not too small, and the test
will always reject the null hypothesis when pbecomes large. In particular when p/n 1
2, the
LRT test tends to reject automatically the null.
5
Table 4: Comparison of the empirical size of the classical likelihood ratio test (LRT) and the
corrected likelihood ratio test (CLRT) in various settings.
Settings Empirical size of CLRT Empirical size of LRT
Mod. p n
1
90 100 0.0522 0.9997
180 200 0.0515 1.0000
720 800 0.0483 1.0000
2
20 100 0.0375 0.0321
80 400 0.0440 0.0368
200 1000 0.0481 0.0514
4
5 500 0.0122 0.0475
10 500 0.0217 0.0482
50 500 0.0421 0.0419
100 500 0.0438 0.0424
200 500 0.0498 0.2216
250 500 0.0501 0.7416
300 500 0.0461 0.9991
5 Proofs
Before giving the proofs, we first recall some important results from the random matrix theory
which laid the foundation for the proofs of the main results of the paper.
5.1 Useful results from random matrix theory
Random matrix theory has become a powerful tool to address new inference problems in
high-dimensional scheme. For general background and references, we refer to review papers
Johnstone (2007) and Johnstone and Titterington (2009).
Let Hbe a probability measure on R+and c > 0 a constant. We define the map
g(s) = gc,H (s) = 1
s+cZt
1 + ts dH(t) (2)
in the set C+={zC:=z > 0}. The map gis a one-to-one mapping from C+onto
itself (see Bai and Silverstein, 2010, Chapter 6), and the inverse map m=g1satisfies
all the requirements of the Stieltjes transform of a probability measure on [0,). We call
this measure F
¯c,H . Next, a companion measure Fc,H is introduced by the equation cFc,H =
6
(c1) δ0+F
¯c,H (note that in this equation, measures can be signed). The measure Fc,H is
referred as the generalized Marˇcenko-Pastur distribution with index (c, H).
Let Fn=1
pPp
i=1 δλn,i be the empirical spectral distribution (ESD) of the sample covari-
ance matrix Sndefined in (2) in main paper with the {λn,i}denoting its eigenvalues. Then, it
is well-known that under suitable moment conditions, Fnconverges to the Marˇcenko-Pastur
distribution of index (c, δσ2), simply denoted as Fc,σ2, with the following density function
pc,σ2(x) =
1
2πxcσ2p{b(c)x}{xa(c)}, a(c)xb(c),
0,otherwise.
The distribution has an additional mass (1 1/c) at the origin if c > 1.
The ESD Hnof Σis
Hn=pm
pδσ2+1
p
m
X
i=1
δαi+σ2,
and Hnδσ2. Define the normalized empirical process
Gn(f) = pZR
f(x)[FnFcn,Hn](dx), f∈ A,
where Ais the set of analytic functions f:U C, with Uan open set of Csuch that
[1(0,1)(c)a(c), b(c)] ⊂ U. We will need the following CLT which is a combination of Theorem
1.1 of Bai and Silverstein (2004) and a recent addition proposed in Zheng et al. (2014).
Proposition 1. We assume the same conditions as in Theorem 1. Then, for any func-
tions f1, . . . , fk∈ A, the random vector (Gn(f1), . . . , Gn(fk)) converges to a k-dimensional
Gaussian vector with mean vector
m(fj) = fj(a(c)) + fj(b(c))
41
2πZb(c)
a(c)
fj(x)
p44(xσ22)2dx,j= 1, . . . , k,
and covariance function
v(fj, fl) = 1
2π2IC1IC2
fj(z1)fl(z2)
(m(z1)m(z2))2dm(z1)dm(z2),j, l = 1, . . . , k,(3)
where m(z)is the Stieltjes transform of F
¯c,σ2= (1 c)δ0+cFc,σ2. The contours C1and C2
are non overlapping and both contain the support of Fc,σ2.
An important and subtle point here is that the centering term in Gn(f) in the above CLT
is defined with respect to the Marcˇcenko-Pastur distribution Fcn,Hnwith “current” index
7
(cn, Hn) instead of the limiting distribution Fc,σ2with index (c, σ2). In contrast, the limiting
mean function m(fj) and covariance function v(fj, fl) depend on the limiting distribution
Fc,σ2only.
5.2 Proof of Proposition 1 in main paper
By Lemma 2.2 of Wang and Yao (2013),
p
X
i=1
λ2
ipZx2dFcn,Hn(x)D
→ N(m(x2), v),
with m(x2) = 4(γ1) and some computable v > 0. Furthermore, by Lemma 1 of Bai et al.
(2010), Zx2dFcn,Hn(x) = β2+p
nβ2
1,
where
β1=σ2+1
p
m
X
j=1
αj,and β2=σ4+1
p
m
X
j=1
α2
j+2
pσ2
m
X
j=1
αj.
The conclusion follows.
5.3 Proof of Theorem 6 in main paper
We have
L=
p
X
i=m+1
log λn,i
bσ2
=
p
X
i=m+1
log λn,i
σ2
p
X
i=m+1
log bσ2
σ2
=
p
X
i=m+1
log λn,i
σ2(pm) log 1
pm
p
X
i=m+1
λn,i
σ2!
=L1(pm) log L2
pm,
where we have defined a two-dimensional vector (L1, L2) = (Pp
i=m+1 log λn,i
σ2,Pp
i=m+1
λn,i
σ2).
8
CLT when σ2= 1.To start with, we consider the case σ2= 1. We have
L1=pZlog(x) dFn(x)
m
X
i=1
log λn,i
=pZlog(x) d(FnFcn,Hn)(x) + pZlog(x) dFcn,Hn(x)
m
X
i=1
log λn,i.
Similarly, we have
L2=pZxd(FnFcn,Hn)(x) + pZxdFcn,Hn(x)
m
X
i=1
λn,i.
By Proposition 1, we find that
p Rlog(x) d(FnFcn,Hn)(x)
Rxd(FnFcn,Hn)(x)!D
N m1(c)
m2(c)!, v1(c)v1,2(c)
v1,2(c)v2(c)!! (4)
with m2(c) = 0 and v2(c)=2cand
m1(c) = log (1 c)
2, (5)
v1(c) = 2 log (1 c), (6)
v1,2(c)=2c. (7)
Formulae of m2and v2have been established in the proof of Theorem 1 and the others are
derived in next subsection.
In Theorem 1, with σ2= 1, we found that
ZxdFcn,Hn(x) = 1 + 1
p
m
X
i=1
αi,
and m
X
i=1
λn,i a.s.
m
X
i=1 αi+c
αi+m(1 + c).
For the last term of L1, by (6) in main paper, we have
log λn,i log(φ(αi+ 1)) = log(αi+ 1)(1 + 1
i)a.s.
9
Furthermore, by Wang et al. (2014), we have
Zlog(x) dFcn,Hn(x) = 1
p
m
X
i=1
log(αi+ 1) + h(cn) + o1
p,
where
h(cn) = Zlog(x)dFcn1(x) = cn1
cn
log(1 cn)1. (8)
can be calculated using the density of the Marˇcenko-Pastur law (see 4.1). Summarising, we
have obtained that
L1m1(c)ph(cn) + η(c, α)D
→ N (0, v1(c)) ,
where h(cn) = cn1
cnlog(1 cn)1 and η(c, α) = Pm
i=1 log(1 + 2α1
i). Similarly, we have
L2(pm) + ρ(c, α)D
→ N (0, v2(c)) ,
where ρ(c, α) = c(m+Pm
i=1 α1
i).
Using (4) and Slutsky’s lemma,
L1
L2!D
→ N m1(c) + ph(cn)η(c, α)
pmρ(c, α)!, v1(c)v1,2(c)
v1,2(cn)v2(cn)!!,
with h(cn) = cn1
cnlog(1 cn)1, η(c, α) = Pm
i=1 log(1 + 2α1
i) and ρ(c, α) = c(m+
Pm
i=1 α1
i).
CLT with general σ2.When σ2= 1,
spec(Σ)=(α1+ 1, . . . , αm+ 1,1,...,1),
whereas in the general case
spec(Σ)=(α1+σ2, . . . , αm+σ2, σ2, . . . , σ2)
=σ2α1
σ2+ 1,...,αm
σ2+ 1,...,1.
Thus, if we consider λi2, we will find the same CLT by replacing the (αi)1imby αi2.
Furthermore, we divide L2by pmto find
L1
L2
pm!D
N m1(c) + ph(cn)η(c, α/σ2)
1ρ(c,α/σ2)
pm!, 2c
(pm)22c
pm
2c
pm2 log(1 c)!!,(9)
10
with η(c, α/σ2) = Pm
i=1 log(1 + 2α1
i), ρ(c, α/σ2) = c(m+σ2Pm
i=1 α1
i) and h(cn) =
cn1
cnlog(1 cn)1.
Asymptotic distribution of L. We have L=g(L1, L2/(pm)), with g(x, y) = x(p
m) log(y). We will apply the multivariate delta-method on (9) with the function g. We have
5g(x, y) = 1,pm
yand
LD
→ N(β1(pm) log(β2),5g(β1, β2) cov(L1, L2/(pm)) 5g(β1, β2)0),
with β1=m1(c)+ph(cn)η(c, α/σ2) and β2= 1ρ(c,α/σ2)
pm. After some standard calculation,
we finally find
LD
→ N m1(c) + ph(cn)ηc, α
σ2(pm) log(β2),2 log(1 c) + 2c
β21
β22.
5.4 Complementary proofs
Proof of (4) in main paper
The general theory of the m.l.e. for the PPCA model (1) in the classical setting has been
developed in Anderson and Amemiya (1988) with in particular the following result.
Proposition 2. Let Θ = (θij)1i,jp=ΨΛ(Λ0Ψ1Λ)1Λ0. If (θ2
ij)1i,jpis nonsingular,
if Λand Ψare identified by the condition that Λ0ΨΛ is diagonal and the diagonal elements
are different and ordered, if SnΛΛ0+Ψin probability and if n(SnΣ)has a limiting
distribution, then n(b
ΛΛ)and n(b
ΨΨ)have a limiting distribution. The covariance
of n(b
Ψii Ψii)and n(b
Ψjj Ψjj )in the limiting distribution is 2Ψ2
iiΨ2
jj ξij (1 i, j p),
where (ξij)=(θ2
ij)1.
To prove the CLT (4) in main paper, by Proposition 2, we know that the inverse of
the Fisher information matrix is I1(ψ11, . . . , ψpp) = (2ψ2
iiψ2
jj ξij )ij. We have to change the
parametrization: in our case, we have ψ11 =··· =ψpp. Let g:RRp,a7→ (a, . . . , a). The
information matrix in this new parametrization becomes
I(σ2) = J0I(g(σ2))J,
where Jis the Jacobian matrix of g. As
I(g(σ2)) = 1
2σ8(θ2
ij)ij ,
11
we have
I(σ2) = 1
2σ8
p
X
i,j=1
θ2
ij,
and
Θ = (θij)ij =ΨΛ(Λ0Ψ1Λ)1Λ0
=σ2(IpΛ(Λ0Λ)1Λ0).
By hypothesis, we have Λ0Λ= diag(d2
1, . . . , d2
m). Consider the Singular Value Decomposition
of Λ,Λ=UDV, where Uis a p×pmatrix such that UU0=Ip,Vis a m×mmatrix such
that V0V=Im, and Dis a p×mdiagonal matrix with d1, . . . , dmas diagonal elements. As
Λ0Λis diagonal, V=Im, so Λ=UD. By elementary calculus, one can find that
Λ(Λ0Λ)1Λ0= diag(1,...,1
|{z }
m
,0,...,0
| {z }
pm
),
so
Θ = σ2diag(0,...,0
| {z }
m
,1,...,1
| {z }
pm
).
Finally,
I(σ2) = 1
2σ8(pm)σ4=pm
2σ4,
and the asymptotic variance of bσ2is
s2=I1(σ2) = 2σ4
pm.
Proof of (17) in main paper
By Proposition 1, for g(x) = x, by using the variable change x=σ2(1 + c2ccos θ),
0θπ, we have
m(g) = g(a(c)) + g(b(c))
41
2πZb(c)
a(c)
x
p44(xσ22)2dx,j= 1, . . . , k
=σ2(1 + c)
2σ2
2πZπ
0
(1 + c2ccos θ) dθ
= 0.
12
Proof of (18) in main paper
Let s(z) be the Stieltjes transform of (1 c)1[0,)+cFc,δ1. One can show that
m(z) = 1
σ2sz
σ2.
Then, in Proposition 1, we have
v(fj, fl) = 1
2π2I I fj(σ2z1)fl(σ2z2)
(s(z1)s(z2))2ds(z1) ds(z2), j, l = 1, . . . , k. (10)
For g(x) = x, we have
v(g) = 1
2π2I I g(σ2z1)g(σ2z2)
(s(z1)s(z2))2ds(z1) ds(z2)
=σ4
2π2I I z1z2
(s(z1)s(z2))2ds(z1) ds(z2)
= 24,
where 1
2π2HH z1z2
(s(z1)s(z2))2ds(z1) ds(z2)=2cis calculated in Bai et al. (2009) (it corresponds
to v(z1, z2), Section 5, proof of (3.4)).
Proof of (5)
By Proposition 1, for σ2= 1 and g(x) = log(x), by using the variable change x= 1 + c
2ccos θ, 0 θπ, we have
m(g) = g(a(c)) + g(b(c))
41
2πZb(c)
a(c)
x
p4c(x1c)2dx,j= 1, . . . , k
=log(1 c)
21
2πZπ
0
log(1 + c2ccos θ) dθ
=log(1 c)
21
4πZ2π
0
log |1ce|2dθ
=log(1 c)
2,
where R2π
0log |1ce|2dθ= 0 is calculated in Bai and Silverstein (2010).
13
Proof of (6)
By Proposition 1 and (10), for σ2= 1 and g(x) = x, we have
v(g) = 1
2π2I I g(z1)g(z2)
(s(z1)s(z2))2ds(z1) ds(z2)
=1
2π2I I log(z1) log(z2)
(s(z1)s(z2))2ds(z1)ds(z2)
=2 log(1 cn),
where the last integral is calculated in Bai and Silverstein (2010).
Proof of (8)
Fcn1is the Marˇcenko-Pastur distribution of index cn. By using the variable change x=
1 + cn2cncos θ, 0 θπ, we have
Zlog(x)dFcn1(x) = Zb(cn)
a(cn)
log x
2πxcnp(b(cn)x)(xa(cn)) dx
=1
2πcnZπ
0
log(1 + cn2cncos θ)
1 + cn2cncos θ4cnsin2θdθ
=1
2πZ2π
0
2 sin2θ
1 + cn2cncos θlog |1cne|2dθ
=cn1
cn
log(1 cn)1,
where the last integral is calculated in Bai and Silverstein (2010).
Proof of (7)
In the normal case with σ2= 1, Zheng (2012) gives the following equivalent expression of (3):
v(fj, fl) = lim
r1+
κ
4π2I I|ξ1|=|ξ2|=1
fj(|1 + 1|2)fl(|1 + 2|2)1
(ξ12)2dξ1dξ2,
where κ= 2 in the real case and h=cin our case. We take fj(x) = log(x) and fl(x) = x,
so we need to calculate
v(log(x), x) = lim
r1+
1
2π2I I|ξ1|=|ξ2|=1 |1 + 2|2log(|1 + 1|2)
(ξ12)2dξ1dξ2.
14
We follow the calculations done in Zheng (2012): when |ξ|= 1, |1 + |2= (1 + )(1 +
1), so log(|1 + |2) = 1
2log(1 + )2+ log(1 + 1)2. Consequently,
I|ξ1|=1
log(|1 + 1|2)
(ξ12)2dξ1=1
2I|ξ1|=1
log(1 + 1)2
(ξ12)2dξ1+1
2I|ξ1|=1
log(1 + 1
1)2
(ξ12)2dξ1
=1
2I|ξ1|=1
log(1 + 1)21
(ξ12)2+1
(1 1ξ2)2dξ1
= 0 +
1
(2)2
2c
1 + c
2
= 2c
2(2+c).
Thus,
v(log(x), x) = 1
I|ξ2|=1 |1 + 2|2c
ξ2(ξ2+c)dξ2
=1
I|ξ|=1 1 + c+c(ξ+ξ1)c
ξ(ξ+c)dξ
=1
I|ξ|=1 c(1 + c)
ξ(ξ+c)+c
ξ+c+c
ξ2(ξ+c)dξ
= 2(1 + c(1 + c) + c+ 1 1)
= 2c.
References
T. W. Anderson and Y. Amemiya. The asymptotic normal distribution of estimators in factor
analysis under general conditions. Ann. Statist., 16(2):759–771, 1988.
Z. Bai and J. W. Silverstein. CLT for linear spectral statistics of large-dimensional sample
covariance matrices. Ann. Probab., 32(1A):553–605, 2004.
Z. Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices. Springer
Series in Statistics. Springer, New York, second edition, 2010.
Z. Bai, D. Jiang, J. Yao, and S. Zheng. Corrections to LRT on large-dimensional covariance
matrix by RMT. Ann. Statist., 37(6B):3822–3840, 2009.
Z. Bai, J. Chen, and J. Yao. On estimation of the population spectral distribution from a
high-dimensional sample covariance matrix. Aust. N. Z. J. Stat., 52(4):423–437, 2010.
15
M. S. Bartlett. Test of significance in factor analysis. Brit. Jour. Psych., 3:97–104, 1950.
I. M. Johnstone. High dimensional statistical inference and random matrices. In International
Congress of Mathematicians. Vol. I, pages 307–333. Eur. Math. Soc., Z¨urich, 2007.
I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data.
Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci., 367(1906):4237–4253, 2009.
ISSN 1364-503X.
M. O. Ulfarsson and V. Solo. Dimension estimation in noisy PCA with SURE and random
matrix theory. IEEE Trans. Signal Process., 56(12):5804–5816, 2008.
Q. Wang and J. Yao. On the sphericity test with large-dimensional observations. Electron.
J. Stat., 7:2164–2192, 2013.
Q. Wang, J. W. Silverstein, and J. Yao. A note on the clt of the lss for sample covariance
matrix from a spiked population model. J. Multivariate Anal., 130:194–207, 2014.
S. Zheng. Central limit theorems for linear spectral statistics of large dimensional f-matrices.
Ann. Inst. Henri Poincar´e Probab. Stat., 48(2):444–476, 2012.
S. Zheng, Z. Bai, and J. Yao. Substitution principle for CLT of linear spectral statistics of
high-dimensional sample covariance matrices with applications to hypothesis testing. Ann.
Statist., to appear, 2014.
16
... Following the works in Passemier et al. (2017), it is necessary to get an accurate estimation of σ 2 before estimating the number of factors (or spikes). Then, we refer the work of Jiang (2022), which provided a bias-corrected estimationσ 2 * based on random matrix theory. ...
... (a) The estimatorσ 2 P in Passemier et al. (2017) is also a bias correction of the maximum likelihood estimation of the noise, but for the case where the non-spikes are all 1 and {e t } 1 t T are independent, which is defined aŝ ...
... with the factors F jt being N (0, 1) variates and θ = 3. Different from simulated design in Bai and Ng (2002) and Passemier et al. (2017), we generalised the settings as ...
Preprint
This paper proposes new estimators of the number of factors for a generalised factor model with more relaxed assumptions than the strict factor model. Under the framework of large cross-sections $N$ and large time dimensions $T$, we first derive the bias-corrected estimator $\hat \sigma^2_*$ of the noise variance in a generalised factor model by random matrix theory. Then we construct three information criteria based on $\hat \sigma^2_*$, further propose the consistent estimators of the number of factors. Finally, simulations and real data analysis illustrate that our proposed estimations are more accurate and avoid the overestimation in some existing works.
... Most of the related work was performed to determine the number of spikes in a highdimensional setting based on random matrix theory, such as in the work of Kritchman and Nadler (2008) and Passemier and Yao (2012). In contrast, Onatski (2009), Passemier et al. (2017, and Johnstone and Onatski (2020) performed work relevant to tests of the spikes. Johnstone and Onatski (2020) focused on testing the existence of spikes. ...
... Johnstone and Onatski (2020) focused on testing the existence of spikes. Passemier et al. (2017) derived a goodness-of-fit test for a high-dimensional principal component model and determined the number of the principal components. Onatski (2009) developed a test of the number of factors in large factor models with a white noise assumption. ...
... Simulations were conducted to compare our proposed estimatorσ 2 c in (3.15) with the other estimation methods, such as the maximum likelihood estimation (MLE)σ 2 defined in (3.14), the estimatorσ 2 * presented by Passemier et al. (2017), the estimatorσ 2 us presented by Ulfarsson and Solo (2008), and the estimatorσ 2 m presented by Johnstone and Lu (2009). The Gaussian and Gamma population assumptions in Section 2 were still used. ...
Preprint
This paper aims to test the number of spikes in a generalized spiked covariance matrix, the spiked eigenvalues of which may be extremely larger or smaller than the non-spiked ones. For a high-dimensional problem, we first propose a general test statistic and derive its central limit theorem by random matrix theory without a Gaussian population constraint. We then apply the result to estimate the noise variance and test the equality of the smallest roots in generalized spiked models. Simulation studies showed that the proposed test method was correctly sized, and the power outcomes showed the robustness of our statistic to deviations from a Gaussian population. Moreover, our estimator of the noise variance resulted in much smaller mean absolute errors and mean squared errors than existing methods. In contrast to previously developed methods, we eliminated the strict conditions of diagonal or block-wise diagonal form of the population covariance matrix and extend the work to a wider range without the assumption of normality. Thus, the proposed method is more suitable for real problems.
... We further assume that the population covariance matrix associated with each class is a low-rank perturbation of a scaled identity; that is, it is isotropic except for a finite number of symmetry-breaking directions. Such a model is used in many real applications such as detection [10], electroencephalogram (EEG) signals [11], [12], and financial econometrics [13], [14], and is known in the random matrix theory terminology as the spiked covariance model. Based on this model, we propose to employ for each class a parametrized covariance matrix estimator following the same model as the population covariance matrix. ...
... Remark. The assumed model of the covariance matrices is encountered in many real applications such as detection [10], EEG signals [11], [12], and financial econometrics [13], [14]. ...
... In practice, there exist several efficient algorithms in the literature for the estimation of these parameters. For more details, we refer the reader to the following works [13], [14], [20], [21]. ...
Preprint
Quadratic discriminant analysis (QDA) is a widely used classification technique that generalizes the linear discriminant analysis (LDA) classifier to the case of distinct covariance matrices among classes. For the QDA classifier to yield high classification performance, an accurate estimation of the covariance matrices is required. Such a task becomes all the more challenging in high dimensional settings, wherein the number of observations is comparable with the feature dimension. A popular way to enhance the performance of QDA classifier under these circumstances is to regularize the covariance matrix, giving the name regularized QDA (R-QDA) to the corresponding classifier. In this work, we consider the case in which the population covariance matrix has a spiked covariance structure, a model that is often assumed in several applications. Building on the classical QDA, we propose a novel quadratic classification technique, the parameters of which are chosen such that the fisher-discriminant ratio is maximized. Numerical simulations show that the proposed classifier not only outperforms the classical R-QDA for both synthetic and real data but also requires lower computational complexity, making it suitable to high dimensional settings.
... We further assume that the population covariance matrix associated with each class is a low-rank perturbation of a scaled identity; that is, it is isotropic except for a finite number of symmetrybreaking directions. Such a model is used in many real applications such as detection [10], electroencephalogram (EEG) signals [11], [12], and financial econometrics [13], [14], and is known in the random matrix theory terminology as the spiked covariance model. Based on this model, we propose to employ for each class a parametrized covariance matrix estimator following the same model as the population covariance matrix. ...
... Remark. The assumed model of the covariance matrices is encountered in many real applications such as detection [10], EEG signals [11], [12], and financial econometrics [13], [14]. ...
... In practice, there exist several efficient algorithms in the literature for the estimation of these parameters. For more details, we refer the reader to the following works [13], [14], [20], [21]. ...
Article
Full-text available
Quadratic discriminant analysis (QDA) is a widely used classification technique that generalizes the linear discriminant analysis (LDA) classifier to the case of distinct covariance matrices among classes. For the QDA classifier to yield high classification performance, an accurate estimation of the covariance matrices is required. Such a task becomes all the more challenging in high dimensional settings, wherein the number of observations is comparable with the feature dimension. A popular way to enhance the performance of QDA classifier under these circumstances is to regularize the covariance matrix, giving the name regularized QDA (R-QDA) to the corresponding classifier. In this work, we consider the case in which the population covariance matrix has a spiked covariance structure, a model that is often assumed in several applications. Building on the classical QDA, we propose a novel quadratic classification technique, the parameters of which are chosen such that the fisher-discriminant ratio is maximized. Numerical simulations show that the proposed classifier not only outperforms the classical R-QDA for both synthetic and real data but also requires lower computational complexity, making it suitable to high dimensional settings.
... In this work, we assume that the noise variance σ 2 is known; however, in general, estimation of σ 2 may not be straightforward [94]. Recently proposed procedures such as those proposed in [94,96,113] could be employed to estimate the noise variance, and we point the interested reader to these references for more theoretical background on the problem. ...
... In this work, we assume that the noise variance σ 2 is known; however, in general, estimation of σ 2 may not be straightforward [94]. Recently proposed procedures such as those proposed in [94,96,113] could be employed to estimate the noise variance, and we point the interested reader to these references for more theoretical background on the problem. We note that in most applications, including the video example we consider, one can obtain a relatively sparse representation of the object in a multiscale basis such as a wavelet basis [60,Sec. ...
Thesis
In this thesis, we study two methods that can be used to learn, infer, and unmix weak, structured signals in noise: the Dynamic Mode Decomposition algorithm and the sparse Principal Component Analysis problem. Both problems take as input samples of a multivariate signal that is corrupted by noise, and produce a set of structured signals. We present performance guarantees for each algorithm and validate our findings with numerical simulations. First, we study the Dynamic Mode Decomposition (DMD) algorithm. We demonstrate that DMD can be used to solve the source separation problem. That is, we apply DMD to a data matrix whose rows are linearly independent, additive mixtures of latent time series. We show that when the latent time series are uncorrelated at a lag of one time-step then the recovered dynamic modes will approximate the columns of the mixing matrix. That is, DMD unmixes linearly mixed sources that have a particular correlation structure. We next broaden our analysis beyond the noise-free, fully observed data setting. We study the DMD algorithm with a truncated-SVD denoising step, and present recovery guarantees for both the noisy data and missing data settings. We also present some preliminary characterizations of DMD performed directly on noisy data. We end with some complementary perspectives on DMD, including an optimization-based formulation. Second, we study the sparse Principal Component Analysis (PCA) problem. We demonstrate that the sparse inference problem can be viewed in a variable selection framework and analyze the performance of various decision statistics. A major contribution of this work is the introduction of False Discovery Rate (FDR) control for the principal component estimation problem, made possible by the sparse structure. We derive lower bounds on the size of detectable coordinates of the principal component vectors, and utilize these lower bounds to derive lower bounds on the worst-case risk.
... 6a and 6b show a general negative bias in the estimated noise variances paired with a general positive bias in the estimated factor eigenvalues. This behavior is consistent with a corresponding behavior for homoscedastic PPCA in the setting of homoscedastic noise [44]. Providing a similar characterization for HePPCAT and a corresponding de-biasing procedure is an exciting, but nontrivial, direction for future work. ...
... This paper also raises several natural conjectures about the landscape of the nonconvex objective, which are beyond our present scope and are exciting areas for further theoretical analysis. Finally, it was observed in the homoscedastic setting that noise variance estimates tend to have a downward bias that can be characterized and accounted for [44]. A similar bias in variance estimates appears to occur in the heteroscedastic setting, and extending the previous approaches is a promising direction. ...
Preprint
Principal component analysis (PCA) is a classical and ubiquitous method for reducing data dimensionality, but it is suboptimal for heterogeneous data that are increasingly common in modern applications. PCA treats all samples uniformly so degrades when the noise is heteroscedastic across samples, as occurs, e.g., when samples come from sources of heterogeneous quality. This paper develops a probabilistic PCA variant that estimates and accounts for this heterogeneity by incorporating it in the statistical model. Unlike in the homoscedastic setting, the resulting nonconvex optimization problem is not seemingly solved by singular value decomposition. This paper develops a heteroscedastic probabilistic PCA technique (HePPCAT) that uses efficient alternating maximization algorithms to jointly estimate both the underlying factors and the unknown noise variances. Simulation experiments illustrate the comparative speed of the algorithms, the benefit of accounting for heteroscedasticity, and the seemingly favorable optimization landscape of this problem.
... In computing the inverse CNCM (34), it is assumed to know the noise power σ 2 n , which can be obtained as in [54][55][56][57]. ...
Preprint
Space-time adaptive processing (STAP) is one of the most effective approaches to suppressing ground clutters in airborne radar systems. It basically takes two forms, i.e., full-dimension STAP (FD-STAP) and reduced-dimension STAP (RD-STAP). When the numbers of clutter training samples are less than two times their respective system degrees-of-freedom (DOF), the performances of both FD-STAP and RD-STAP degrade severely due to inaccurate clutter estimation. To enhance STAP performance under the limited training samples, this paper develops a STAP theory with random matrix theory (RMT). By minimizing the output clutter-plus-noise power, the estimate of the inversion of clutter plus noise covariance matrix (CNCM) can be obtained through optimally manipulating its eigenvalues, and thus producing the optimal STAP weight vector. Two STAP algorithms, FD-STAP using RMT (RMT-FD-STAP) and RD-STAP using RMT (RMT-RD-STAP), are proposed. It is found that both RMT-FD-STAP and RMT-RD-STAP greatly outperform other-related STAP algorithms when the numbers of training samples are larger than their respective clutter DOFs, which are much less than the corresponding system DOFs. Theoretical analyses and simulation demonstrate the effectiveness and the performance advantages of the proposed STAP algorithms.
... For example, Ma et al. (2015) and Wang and Xu (2018) respectively pointed out the condition tr( 4 ) = o{tr 2 ( 2 )} is violated under a low-dimensional factor model and a spiked model, where tr( ) denotes the trace of covariance matrix . Testing hypotheses under spiked and low-dimensional factor models have been considered in literature such as Birnbaum et al. (2013); Cai et al. (2013); Ma et al. (2015); Passemier et al. (2017); Wang and Xu (2018) and the references therein. ...
Article
Full-text available
In this paper, the problem of testing the hypothesis of linear combination of k-sample means of high-dimensional data is investigated under a low-dimensional factor model. We propose a new test and derive that the asymptotic distribution of the test statistic is a weighted distribution of independent chi-squared distribution of 1 degree of freedom under the null hypothesis and mild conditions. We provide numerical studies on both sizes and powers to illustrate performance of the proposed test.
... [1] used probabilistic PCA and several classification algorithms in gene expression data sets. Passemier et al. [8] developed new statistical theory for probabilistic principal component analysis models in high dimensions. Smallman et al. [9] enhanced a sparse method for unsupervised dimension reduction for data from an exponential-family distribution. ...
Article
Full-text available
High-dimensional data sets frequently occur in several scientific areas, and special techniques are required to analyze these types of data sets. Especially, it becomes important to apply a suitable model in classification problems. In this study, a novel approach is proposed to estimate a statistical model for high-dimensional data sets. The proposed method uses analytical hierarchical process (AHP) and information criteria for determining the optimal PCs for the classification model. The high-dimensional “colon” and “gravier” datasets were used in evaluation part. Application results demonstrate that the proposed approach can be successfully used for modeling purposes.
Article
In this paper, the problem of simultaneously testing mean vector and covariance matrix of one-sample population is investigated in high-dimensional settings. We propose a new test statistic and obtain its asymptotic distributions under null and local alternative hypotheses, respectively. Our asymptotic result for proposed test does not need some conditions such as linearity between the sample size and dimension used in existing studies. Simulation results also demonstrate our new test not only can control reasonably the nominal level but also has greater empirical powers than competing tests.
Article
Full-text available
We consider a spiked population model, proposed by Johnstone, whose population eigenvalues are all unit except for a few fixed eigenvalues. The question is to determine how the sample eigenvalues depend on the non-unit population ones when both sample size and population size become large. This paper completely determines the almost sure limits for a general class of samples.
Article
Full-text available
The paper proposes new estimators of spiked eigenvalues of the population covariance matrix from the sample covariance matrix and investigates its consistency and asymp-totic normality.
Article
Full-text available
Sample covariance matrices are widely used in multivariate statistical analysis. The central limit theorems (CLT's) for linear spectral statistics of high-dimensional non-centered sample covariance matrices have received considerable attention in random matrix theory and have been applied to many high-dimensional statistical problems. However, known population mean vectors are assumed for non-centered sample covariance matrices, some of which even assume Gaussian-like moment conditions. In fact, there are still another two most frequently used sample covariance matrices: the MLE (by subtracting the sample mean vector from each sample vector) and the unbiased sample covariance matrix (by changing the denominator $n$ as $N=n-1$ in the MLE) without depending on unknown population mean vectors. In this paper, we not only establish new CLT's for non-centered sample covariance matrices without Gaussian-like moment conditions but also characterize the non-negligible differences among the CLT's for the three classes of high-dimensional sample covariance matrices by establishing a {\em substitution principle}: substitute the {\em adjusted} sample size $N=n-1$ for the actual sample size $n$ in the major centering term of the new CLT's so as to obtain the CLT of the unbiased sample covariance matrices. Moreover, it is f