Page 1
arXiv:1203.4572v1 [math.ST] 20 Mar 2012
Optimal Estimation and Prediction for Dense
Signals in High-Dimensional Linear Models
Lee Dicker
Department of Statistics and Biostatistics
Rutgers University
501 Hill Center, 110 Frelinghuysen Road
Piscataway, NJ 08854
e-mail: ldicker@stat.rutgers.edu
Abstract: Estimation and prediction problems for dense signals are often framed in
terms of minimax problems over highly symmetric parameter spaces. In this paper, we
study minimax problems over ℓ2-balls for high-dimensional linear models with Gaussian
predictors. We obtain sharp asymptotics for the minimax risk that are applicable in
any asymptotic setting where the number of predictors diverges and prove that ridge
regression is asymptotically minimax. Adaptive asymptotic minimax ridge estimators
are also identified. Orthogonal invariance is heavily exploited throughout the paper
and, beyond serving as a technical tool, provides additional insight into the problems
considered here. Most of our results follow from an apparently novel analysis of an
equivalent non-Gaussian sequence model with orthogonally invariant errors. As with
many dense estimation and prediction problems, the minimax risk studied here has rate
d/n, where d is the number of predictors and n is the number of observations; however,
when d ≍ n the minimax risk is influenced by the spectral distribution of the predictors
and is notably different from the linear minimax risk for the Gaussian sequence model
(Pinsker, 1980) that often appears in other dense estimation and prediction problems.
AMS 2000 subject classifications: Primary 62J05; secondary 62C20.
Keywords and phrases: adaptive estimation, asymptotic minimax, non-Gaussian se-
quence model, oracle estimators, ridge regression.
1. Introduction
This paper is about estimation and prediction problems involving non-sparse (or “dense”)
signals in high-dimensional linear models. By contrast, a great deal of recent research into
high-dimensional linear models has focused on sparsity. Though there are many notions of
sparsity (e.g. ℓp-sparsity (Abramovich et al., 2006)), a vector β ∈ Rdis typically consid-
ered to be sparse if many of its coordinates are very close to 0. Perhaps one of the general
principals that has emerged from the literature on sparse high-dimensional linear models
may be summarized as follows: if the parameter of interest is sparse, then this can often be
1
Page 2
L. Dicker/Dense Signals and High-Dimensional Linear Models2
leveraged to develop methods that perform very well, even when the number of predictors
is much larger than the number of observations. Indeed, powerful theoretical performance
guarantees are available for many methods developed under this paradigm, provided the pa-
rameter of interest is sparse (Bickel et al., 2009; Bunea et al., 2007; Cand` es and Tao, 2007;
Fan and Lv, 2011; Rigollet and Tsybakov, 2011; Zhang, 2010). Furthermore, in many appli-
cations – especially in engineering and signal processing – sparsity assumptions have been re-
peatedly validated (Donoho, 1995; Duarte et al., 2008; Erlich et al., 2010; Lustig et al., 2007;
Wright et al., 2008). However, there is less certainty about the manifestations of sparsity in
other important applications where high-dimensional data is abundant. For example, sev-
eral recent papers have questioned the degree of sparsity in modern genomic datasets (see,
for instance, (Hall et al., 2009), and the references contained therein – including (Goldstein,
2009; Hirschhorn, 2009; Kraft and Hunter, 2009) – and, more recently, (Bansal et al., 2010;
Manolio, 2010)). In situations like these, sparse methods may be sub-optimal and methods
designed for dense problems may be more appropriate.
Let d and n denote the number of predictors and observations, respectively, in a linear
regression problem. In dense estimation and prediction problems, where the parameter of
interest is not assumed to be sparse, d/n → 0 is often required to ensure consistency. Indeed,
this is the case for the problems considered in this paper. In this sense, dense problems are
more challenging than sparse problems, where consistency may be possible when d/n → ∞.
This lends credence to Friedman et al.’s (2004) “bet on sparsity” principle for high-dimensional
data analysis:
Use a procedure that does well in sparse problems, since no procedure does well in dense problems.
The “bet on sparsity” principle has proven to be very useful, especially in applications where
sparsity prevails, and it may help to explain some of the recent emphasis on understanding
sparse problems. However, the emergence of important problems in high-dimensional data
analysis where the role of sparsity is less clear highlights the importance of characterizing
and thoroughly understanding dense problems in high-dimensional data analysis. This paper
addresses some of these problems.
Minimax problems over highly symmetric parameter spaces have often been equated with
dense estimation problems in many statistical settings (Donoho and Johnstone, 1994; Johnstone,
2011). In this paper, we study the minimax risk over ℓ2-balls for high-dimensional linear
models with Gaussian predictors. We identify several informative, asymptotically equivalent
formulations of the problem and provide a complete asymptotic solution when the number of
predictors d grows large. In particular, we obtain sharp asymptotics for the minimax risk that
are applicable in any asymptotic setting where d → ∞ and we show that ridge regression es-
timators (Hoerl and Kennard, 1970; Tikhonov, 1943) are asymptotically minimax. Adaptive
asymptotic minimax ridge estimators are also discussed. Our results follow from carefully ana-
Page 3
L. Dicker/Dense Signals and High-Dimensional Linear Models3
lyzing an equivalent non-Gaussian sequence model with orthogonally invariant errors and the
novel use of two classical tools – Brown’s identity (Brown, 1971) and Stam’s inequality (Stam,
1959) – to relate this sequence model to the Gaussian sequence model with iid errors. The re-
sults in this paper share some similarities with those found in (Goldenshluger and Tsybakov,
2001, 2003), which address minimax prediction over ℓ2-ellipsoids. However, the implications
of our results and the methods used to prove them differ substantially from Goldenshluger
and Tsybakov’s (this is discussed in more detail in Sections 2.2-2.3 below).
2. Background and preliminaries
2.1. Statistical setting
Suppose that the observed data consists of outcomes y1,...,yn∈ R and d-dimensional predic-
tors x1,...,xn∈ Rd. The outcomes and predictors follow a linear model and are related via
the equation
yi= xT
iβ + ǫi, i = 1,...,n,(1)
where β = (β1,...,βd)T∈ Rdis an unknown parameter vector (also referred to as “the
signal”) and ǫ1,...,ǫnare unobserved errors. To simplify notation, let y = (y1,...,yn)T∈ Rn,
X = (x1,...,xn)T, and ǫ = (ǫ1,...,ǫn)T. Then (1) may be rewritten as y = Xβ + ǫ. In many
high-dimensional settings it is natural to consider the predictors xi to be random. In this
paper, we assume that
x1,...,xn
iid
∼ N(0,I) and ǫ1,...,ǫn
iid
∼ N(0,1)(2)
are independent, where I = Idis the d × d identity matrix. These distributional assumptions
impose significant additional structure on the linear model (1). However, similar models have
been studied previously (Baranchik, 1973; Breiman and Freedman, 1983; Brown, 1990; Leeb,
2009; Stein, 1960) and we believe that the insights imparted by the resulting simplifications
are worthwhile. For the results in this paper, perhaps the most noteworthy simplifying con-
sequence of the normality assumption (2) is that the distributions of X and ǫ are invariant
under orthogonal transformations.
We point out that the assumption E(xi) = 0 (which is implicit in (2)) is not particularly
limiting: if E(xi) ?= 0, then we can reduce to the mean 0 case by centering and decorrelating
the data. If Var(ǫi) = σ2?= 1 and σ2is known, then this can easily be reduced to the case
where Var(ǫi) = 1. If σ2is unknown and d < n, then σ2can be effectively estimated and
one can reduce to the case where Var(ǫi) = 1 (Dicker, 2012). We conjecture that σ2can be
effectively estimated when d > n, provided supd/n < ∞ (for sparse β, Sun and Zhang (2011)
and Fan et al. (2012) have shown that σ2can be estimated when d ≫ n). Dicker (2012) has
Page 4
L. Dicker/Dense Signals and High-Dimensional Linear Models4
discussed the implications if Cov(xi) = Σ ?= I. Essentially, when the emphasis is prediction
and non-sparse signals, if a norm-consistent estimator for Cov(xi) = Σ is available, then
it is possible to reduce to the case where Cov(xi) = I; if a norm-consistent estimator is not
available, then limitations entail, however, these limitations may not be overly restrictive (this
is discussed further in Section 3.2 below).
Let ||·|| = ||·||2denote the ℓ2-norm. In this paper we study the performance of estimators
ˆβ for β with respect to the risk function
R(ˆβ,β) = Rd,n(ˆβ,β) = Eβ||ˆβ − β||2,(3)
where the expectation is taken over (ǫ,X) and the subscript β in Eβindicates that y = Xβ+ǫ
(below, for expectations that do not involve y, we will often omit this subscript). We emphasize
that the expectation in (3) is taken over the predictors X as well as the errors ǫ, i.e. it is
not conditional on X. The risk R(ˆβ,β) is a measure of estimation error. However, it can also
be interpreted as the unconditional out-of-sample prediction error (predictive risk) associated
with the estimatorˆβ (Breiman and Freedman, 1983; Leeb, 2009; Stein, 1960).
2.2. Dense signals, sparse signals, and ellipsoids
Let B(c) = Bd(c) = {β ∈ Rd; ||β|| ≤ c} denote the ℓ2-ball of radius c ≥ 0. Though a
given signal β ∈ Rdis often considered to be dense if it has many nonzero entries, when
studying broader properties of dense signals and dense estimators it is common to consider
minimax problems over highly symmetric, convex (or loss-convex (Donoho and Johnstone,
1994)) parameter spaces. Following this approach, one of the primary quantities that we
use as a benchmark for evaluating estimators and determining performance limits in dense
estimation problems is the minimax risk over B(c):
R(b)(c) = R(b)
d,n(c) = inf
ˆβ
sup
β∈B(c)R(ˆβ,β).(4)
The infimum on the right-hand side in (4) is taken over all measurable estimatorsˆβ and the
superscript “b” in R(b)(c) indicates that the relevant parameter space is the ℓ2-ball.
A basic consequence of the results in this paper is R(b)(c) ≍ d/n. Thus, one must have
d/n → 0 in order to ensure consistent estimation over B(c). This is a well-known feature of
dense estimation problems and, as mentioned in Section 1, contrasts with many results on
sparse estimation that imply β may be consistently estimated when d/n → ∞. However, the
sparsity conditions on β that are required for these results may not hold in general and our
motivating interest lies precisely in such situations. In this paper we derive sharp asymptotics
for R(b)(c) and related quantities in settings where d/n → 0, d/n → ρ ∈ (0,∞), and d/n → ∞
Page 5
L. Dicker/Dense Signals and High-Dimensional Linear Models5
(we assume that d → ∞ throughout). Though consistent estimation is only guaranteed when
d/n → 0, there are important situations where one might hope to analyze high-dimensional
datasets with d/n substantially larger than 0, even if there is little reason to believe that
sparsity assumptions are valid. The results in this paper provide detailed information that
may be useful in situations like these.
In addition to sparse estimation problems, minimax rates faster than d/n have also been
obtained for minimax problems over ℓ2-ellipsoids, which have been studied extensively in situa-
tions similar to those considered here (Cavalier and Tsybakov, 2002; Goldenshluger and Tsybakov,
2001, 2003; Pinsker, 1980). Much of this work has been motivated by problems in nonpara-
metric function estimation. The results in this paper are related to many of these existing
results, however, there are important differences – both in their implications and the tech-
niques used to prove them. Goldenshluger and Tsybakov’s (2001, 2003) work may be most
closely related to ours. Define the ℓ2-ellipsoid B(c,α) = {β ∈ Rd;
α = (α1,...,αd)T∈ Rd, 0 ≤ α1≤ ··· ≤ αd. Goldenshluger and Tsybakov studied minimax
problems over ℓ2-ellipsoids for a linear model with random predictors similar to the model
considered here (in fact, Goldenshluger and Tsybakov’s results apply to infinite-dimensional
non-Gaussian xi, though xiare required to have Gaussian tails and independent coordinates).
They identified asymptotically minimax estimators over B(c,α) and adaptive asymptotically
minimax estimators and showed that the minimax rate may be substantially faster than
d/n. However, their results also require the axes of B(c,α) to decay rapidly (i.e. ad/c → ∞
quickly) and do not apply to ℓ2-balls B(c) = B(c,(1,...,1)T) unless d/n → 0. Though these
decay conditions are natural for many inverse problems in nonparametric function estimation,
they drive the improved minimax rates obtained by Goldenshluger and Tsybakov and may
be overly restrictive in other settings, such as the genomics applications discussed in Section
1 above.
?n
i=1αiβ2
i≤ c2}, with
2.3. The sequence model
Minimax problems over restricted parameter spaces have been studied extensively in the
context of the sequence model. In the sequence model, given an index set J,
zj= θj+ δj, j ∈ J,(5)
are observed, θ = (θj)j∈Jis an unknown parameter, and δ = (δj)j∈Jis a random error. The
sequence model is extremely flexible, and many existing results about the Gaussian sequence
model (where the coordinates of δ are iid Gaussian random variables) have implications for
high-dimensional linear models (Cavalier and Tsybakov, 2002; Pinsker, 1980). However, these
results tend to apply in linear models where one conditions on the predictors, as opposed to
random predictor models like the one considered here.
Page 26
L. Dicker/Dense Signals and High-Dimensional Linear Models 26
Since R{ˆβr(c),β} − R{ˆβunif(c),β} = R{ˆβr(c),β} − R(e)(β) ≥ 0, Lemma 3 implies
???R{ˆβr(c),β} − R(e)(β)
−E
??? ≤ Etr{XXT+ d/c2I}−1
??
?s1
+2
1 −
s1
nsn
?
tr
?
XXT+n(d − 2)
c2(n − 2)I
?
?−1?
≤
1
nEsntr(XXT+ d/c2I)−1
d − n
c2(n − 2)Etr(XXT+ d/c2I)−2.
Theorem 1 (b) follows.
Acknowledgements
The author thanks Sihai Zhao for his thoughtful comments and suggestions.
References
Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2006). Adapting to
unknown sparsity by controlling the false discovery rate. Annals of Statistics 34 584–653.
Bai, Z. (1993). Convergence rate of expected spectral distributions of large random matrices.
Part II. Sample covariance matrices. Annals of Probability 21 649–672.
Bansal, V., Libiger, O., Torkamani, A. and Schork, N. (2010). Statistical analysis
strategies for association studies involving rare variants. Nature Reviews Genetics 11 773–
785.
Baranchik, A. (1973). Inadmissibility of maximum likelihood estimators in some multiple
regression problems with three or more independent variables. Annals of Statistics 1 312–
321.
Beran, R. (1996). Stein estimation in high dimensions: A retrospective. In Research devel-
opments in probability and statistics: Festschrift in honor of Madan L. Puri on the occasion
of his 65th birthday. VSP International Science Publishers.
Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer.
Bickel, P. (1981). Minimax estimation of the mean of a normal distribution when the
parameter space is restricted. Annals of Statistics 9 1301–1309.
Bickel, P., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis of lasso and
Dantzig selector. Annals of Statistics 37 1705–1732.
Borel,´E. (1914). Introduction g´ eom´ etrique ` a quelques th´ eories physiques. Gauthier-Villars.
Page 27
L. Dicker/Dense Signals and High-Dimensional Linear Models27
Breiman, L. and Freedman, D. (1983). How many variables should be entered in a
regression equation? Journal of the American Statistical Association 78 131–136.
Brown, L. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value
problems. Annals of Mathematical Statistics 42 855–903.
Brown, L. (1990). An ancillarity paradox which appears in multiple linear regression. Annals
of Statistics 18 471–493.
Brown, L. and Gajek, L. (1990). Information inequalities for the bayes risk. Annals of
Statistics 18 1578–1594.
Brown, L. and Low, M. (1991). Information inequality bounds on the minimax risk (with
an application to nonparametric regression). Annals of Statistics 19 329–337.
Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the
lasso. Electronic Journal of Statistics 1 169–194.
Cand` es, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much
larger than n. Annals of Statistics 35 2313–2351.
Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with ran-
dom noise. Probability Theory and Related Fields 123 323–354.
DasGupta, A. (2010). False vs. missed discoveries, gaussian decision theory, and the donsker-
varadhan principle. In Borrowing Strength: Theory Powering Applications A Festschrift
for Lawrence D. Brown. Institute of Mathematical Statistics.
Diaconis, P. and Freedman, D. (1987). A dozen de finetti-style results in search of a
theory. Annales de l’Henri Poincar´ e, Probabilit´ es et Statistiques 23 397–423.
Dicker, L. (2012). Dense signals, linear estimators, and out-of-sample prediction for high-
dimensional linear models. Preprint.
Donoho, D. (1995). De-noising by soft-thresholding. Information Theory, IEEE Transac-
tions on 41 613–627.
Donoho, D. and Johnstone, I. (1994). Minimax risk over ℓp-balls for ℓq-error. Probability
Theory and Related Fields 99 277–303.
Duarte, M., Davenport, M., Takhar, D., Laska, J., Sun, T., Kelly, K. and Bara-
niuk, R. (2008). Single-pixel imaging via compressive sampling. Signal Processing Maga-
zine, IEEE 25 83–91.
Erlich, Y., Gordon, A., Brand, M., Hannon, G. and Mitra, P. (2010). Compressed
genotyping. Information Theory, IEEE Transactions on 56 706–723.
Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation
in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Sta-
tistical Methodology) 74 37–65.
Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with np-dimensionality. Infor-
mation Theory, IEEE Transactions on 57 5467–5484.
Friedman, J., Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). Discussion
Page 28
L. Dicker/Dense Signals and High-Dimensional Linear Models28
of boosting papers. Ann. Statist 32 102–107.
Goldenshluger, A. and Tsybakov, A. (2001). Adaptive prediction and estimation in
linear regression with infinitely many parameters. Annals of Statistics 29 1601–1619.
Goldenshluger, A. and Tsybakov, A. (2003). Optimal prediction for linear regression
with infinitely many parameters. Journal of Multivariate Analysis 84 40–60.
Goldstein, D. (2009). Common genetic variation and human traits. New England Journal
of Medicine 360 1696–1698.
Hall, P., Jin, J. and Miller, H. (2009). Feature selection when there are many influential
features. Arxiv preprint arXiv:0911.4076.
Hirschhorn, J. (2009). Genomewide association studies – illuminating biologic pathways.
New England Journal of Medicine 360 1699–1701.
Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12 55–67.
James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability: held at the Statis-
tical Laboratory, University of California, June 20-July 30, 1960. University of California
Press.
Johnstone, I. (2011). Gaussian Estimation: Sequence and Wavelet Models. Unpublished
manuscript.
Kraft, P. and Hunter, D. (2009). Genetic risk prediction – are we there yet? New England
Journal of Medicine 360 1701–1703.
Leeb, H. (2009). Conditional predictive inference post model selection. Annals of Statistics
37 2838–2876.
L´ evy, P. (1922). Le¸ cons d’Analyse Fonctionnelle. Gauthier-Villars.
Lustig, M., Donoho, D. and Pauly, J. (2007). Sparse MRI: The application of compressed
sensing for rapid mr imaging. Magnetic Resonance in Medicine 58 1182–1195.
Manolio, T. (2010). Genomewide association studies and assessment of the risk of disease.
New England Journal of Medicine 363 166–176.
Marˇ cenko, V. and Pastur, L. (1967). Distribution of eigenvalues for some sets of random
matrices. Mathematics of the USSR–Sbornik 1 457–483.
Marchand, E. (1993). Estimation of a multivariate mean with constraints on the norm.
Canadian Journal of Statistics 21 359–366.
Pinsker, M. (1980). Optimal filtration of functions from l2 in gaussian noise. Problems of
Information Transmission 16 52–68.
Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse
estimation. Annals of Statistics 39 731–771.
Robert, C. (1990). Modified bessel functions and their applications in probability and
statistics. Statistics & probability letters 9 155–161.
Page 29
L. Dicker/Dense Signals and High-Dimensional Linear Models 29
Stam, A. (1959). Some inequalities satisfied by the quantities of information of fisher and
shannon1. Information and Control 2 101–112.
Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal
distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics
and probability, vol. 1.
Stein, C. (1960). Multiple regression. In Contributions to Probability and Statistics: Essays
in Honor of Harold Hotelling. Stanford University Press.
Sun, T. and Zhang, C. (2011).Scaled sparse linear regression.
arXiv:1104.4595.
Tikhonov, A. (1943). On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39
195–198.
Wright, J., Yang, A., Ganesh, A., Sastry, S. and Ma, Y. (2008). Robust face recog-
nition via sparse representation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 31 210–227.
Zamir, R. (1998). A proof of the fisher information inequality via a data processing argument.
Information Theory, IEEE Transactions on 44 1246–1250.
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals
of Statistics 38 894–942.
Arxiv preprint