Page 1

arXiv:1203.4572v1 [math.ST] 20 Mar 2012

Optimal Estimation and Prediction for Dense

Signals in High-Dimensional Linear Models

Lee Dicker

Department of Statistics and Biostatistics

Rutgers University

501 Hill Center, 110 Frelinghuysen Road

Piscataway, NJ 08854

e-mail: ldicker@stat.rutgers.edu

Abstract: Estimation and prediction problems for dense signals are often framed in

terms of minimax problems over highly symmetric parameter spaces. In this paper, we

study minimax problems over ℓ2-balls for high-dimensional linear models with Gaussian

predictors. We obtain sharp asymptotics for the minimax risk that are applicable in

any asymptotic setting where the number of predictors diverges and prove that ridge

regression is asymptotically minimax. Adaptive asymptotic minimax ridge estimators

are also identified. Orthogonal invariance is heavily exploited throughout the paper

and, beyond serving as a technical tool, provides additional insight into the problems

considered here. Most of our results follow from an apparently novel analysis of an

equivalent non-Gaussian sequence model with orthogonally invariant errors. As with

many dense estimation and prediction problems, the minimax risk studied here has rate

d/n, where d is the number of predictors and n is the number of observations; however,

when d ≍ n the minimax risk is influenced by the spectral distribution of the predictors

and is notably different from the linear minimax risk for the Gaussian sequence model

(Pinsker, 1980) that often appears in other dense estimation and prediction problems.

AMS 2000 subject classifications: Primary 62J05; secondary 62C20.

Keywords and phrases: adaptive estimation, asymptotic minimax, non-Gaussian se-

quence model, oracle estimators, ridge regression.

1. Introduction

This paper is about estimation and prediction problems involving non-sparse (or “dense”)

signals in high-dimensional linear models. By contrast, a great deal of recent research into

high-dimensional linear models has focused on sparsity. Though there are many notions of

sparsity (e.g. ℓp-sparsity (Abramovich et al., 2006)), a vector β ∈ Rdis typically consid-

ered to be sparse if many of its coordinates are very close to 0. Perhaps one of the general

principals that has emerged from the literature on sparse high-dimensional linear models

may be summarized as follows: if the parameter of interest is sparse, then this can often be

1

Page 2

L. Dicker/Dense Signals and High-Dimensional Linear Models2

leveraged to develop methods that perform very well, even when the number of predictors

is much larger than the number of observations. Indeed, powerful theoretical performance

guarantees are available for many methods developed under this paradigm, provided the pa-

rameter of interest is sparse (Bickel et al., 2009; Bunea et al., 2007; Cand` es and Tao, 2007;

Fan and Lv, 2011; Rigollet and Tsybakov, 2011; Zhang, 2010). Furthermore, in many appli-

cations – especially in engineering and signal processing – sparsity assumptions have been re-

peatedly validated (Donoho, 1995; Duarte et al., 2008; Erlich et al., 2010; Lustig et al., 2007;

Wright et al., 2008). However, there is less certainty about the manifestations of sparsity in

other important applications where high-dimensional data is abundant. For example, sev-

eral recent papers have questioned the degree of sparsity in modern genomic datasets (see,

for instance, (Hall et al., 2009), and the references contained therein – including (Goldstein,

2009; Hirschhorn, 2009; Kraft and Hunter, 2009) – and, more recently, (Bansal et al., 2010;

Manolio, 2010)). In situations like these, sparse methods may be sub-optimal and methods

designed for dense problems may be more appropriate.

Let d and n denote the number of predictors and observations, respectively, in a linear

regression problem. In dense estimation and prediction problems, where the parameter of

interest is not assumed to be sparse, d/n → 0 is often required to ensure consistency. Indeed,

this is the case for the problems considered in this paper. In this sense, dense problems are

more challenging than sparse problems, where consistency may be possible when d/n → ∞.

This lends credence to Friedman et al.’s (2004) “bet on sparsity” principle for high-dimensional

data analysis:

Use a procedure that does well in sparse problems, since no procedure does well in dense problems.

The “bet on sparsity” principle has proven to be very useful, especially in applications where

sparsity prevails, and it may help to explain some of the recent emphasis on understanding

sparse problems. However, the emergence of important problems in high-dimensional data

analysis where the role of sparsity is less clear highlights the importance of characterizing

and thoroughly understanding dense problems in high-dimensional data analysis. This paper

addresses some of these problems.

Minimax problems over highly symmetric parameter spaces have often been equated with

dense estimation problems in many statistical settings (Donoho and Johnstone, 1994; Johnstone,

2011). In this paper, we study the minimax risk over ℓ2-balls for high-dimensional linear

models with Gaussian predictors. We identify several informative, asymptotically equivalent

formulations of the problem and provide a complete asymptotic solution when the number of

predictors d grows large. In particular, we obtain sharp asymptotics for the minimax risk that

are applicable in any asymptotic setting where d → ∞ and we show that ridge regression es-

timators (Hoerl and Kennard, 1970; Tikhonov, 1943) are asymptotically minimax. Adaptive

asymptotic minimax ridge estimators are also discussed. Our results follow from carefully ana-

Page 3

L. Dicker/Dense Signals and High-Dimensional Linear Models3

lyzing an equivalent non-Gaussian sequence model with orthogonally invariant errors and the

novel use of two classical tools – Brown’s identity (Brown, 1971) and Stam’s inequality (Stam,

1959) – to relate this sequence model to the Gaussian sequence model with iid errors. The re-

sults in this paper share some similarities with those found in (Goldenshluger and Tsybakov,

2001, 2003), which address minimax prediction over ℓ2-ellipsoids. However, the implications

of our results and the methods used to prove them differ substantially from Goldenshluger

and Tsybakov’s (this is discussed in more detail in Sections 2.2-2.3 below).

2. Background and preliminaries

2.1. Statistical setting

Suppose that the observed data consists of outcomes y1,...,yn∈ R and d-dimensional predic-

tors x1,...,xn∈ Rd. The outcomes and predictors follow a linear model and are related via

the equation

yi= xT

iβ + ǫi, i = 1,...,n,(1)

where β = (β1,...,βd)T∈ Rdis an unknown parameter vector (also referred to as “the

signal”) and ǫ1,...,ǫnare unobserved errors. To simplify notation, let y = (y1,...,yn)T∈ Rn,

X = (x1,...,xn)T, and ǫ = (ǫ1,...,ǫn)T. Then (1) may be rewritten as y = Xβ + ǫ. In many

high-dimensional settings it is natural to consider the predictors xi to be random. In this

paper, we assume that

x1,...,xn

iid

∼ N(0,I) and ǫ1,...,ǫn

iid

∼ N(0,1)(2)

are independent, where I = Idis the d × d identity matrix. These distributional assumptions

impose significant additional structure on the linear model (1). However, similar models have

been studied previously (Baranchik, 1973; Breiman and Freedman, 1983; Brown, 1990; Leeb,

2009; Stein, 1960) and we believe that the insights imparted by the resulting simplifications

are worthwhile. For the results in this paper, perhaps the most noteworthy simplifying con-

sequence of the normality assumption (2) is that the distributions of X and ǫ are invariant

under orthogonal transformations.

We point out that the assumption E(xi) = 0 (which is implicit in (2)) is not particularly

limiting: if E(xi) ?= 0, then we can reduce to the mean 0 case by centering and decorrelating

the data. If Var(ǫi) = σ2?= 1 and σ2is known, then this can easily be reduced to the case

where Var(ǫi) = 1. If σ2is unknown and d < n, then σ2can be effectively estimated and

one can reduce to the case where Var(ǫi) = 1 (Dicker, 2012). We conjecture that σ2can be

effectively estimated when d > n, provided supd/n < ∞ (for sparse β, Sun and Zhang (2011)

and Fan et al. (2012) have shown that σ2can be estimated when d ≫ n). Dicker (2012) has

Page 4

L. Dicker/Dense Signals and High-Dimensional Linear Models4

discussed the implications if Cov(xi) = Σ ?= I. Essentially, when the emphasis is prediction

and non-sparse signals, if a norm-consistent estimator for Cov(xi) = Σ is available, then

it is possible to reduce to the case where Cov(xi) = I; if a norm-consistent estimator is not

available, then limitations entail, however, these limitations may not be overly restrictive (this

is discussed further in Section 3.2 below).

Let ||·|| = ||·||2denote the ℓ2-norm. In this paper we study the performance of estimators

ˆβ for β with respect to the risk function

R(ˆβ,β) = Rd,n(ˆβ,β) = Eβ||ˆβ − β||2,(3)

where the expectation is taken over (ǫ,X) and the subscript β in Eβindicates that y = Xβ+ǫ

(below, for expectations that do not involve y, we will often omit this subscript). We emphasize

that the expectation in (3) is taken over the predictors X as well as the errors ǫ, i.e. it is

not conditional on X. The risk R(ˆβ,β) is a measure of estimation error. However, it can also

be interpreted as the unconditional out-of-sample prediction error (predictive risk) associated

with the estimatorˆβ (Breiman and Freedman, 1983; Leeb, 2009; Stein, 1960).

2.2. Dense signals, sparse signals, and ellipsoids

Let B(c) = Bd(c) = {β ∈ Rd; ||β|| ≤ c} denote the ℓ2-ball of radius c ≥ 0. Though a

given signal β ∈ Rdis often considered to be dense if it has many nonzero entries, when

studying broader properties of dense signals and dense estimators it is common to consider

minimax problems over highly symmetric, convex (or loss-convex (Donoho and Johnstone,

1994)) parameter spaces. Following this approach, one of the primary quantities that we

use as a benchmark for evaluating estimators and determining performance limits in dense

estimation problems is the minimax risk over B(c):

R(b)(c) = R(b)

d,n(c) = inf

ˆβ

sup

β∈B(c)R(ˆβ,β).(4)

The infimum on the right-hand side in (4) is taken over all measurable estimatorsˆβ and the

superscript “b” in R(b)(c) indicates that the relevant parameter space is the ℓ2-ball.

A basic consequence of the results in this paper is R(b)(c) ≍ d/n. Thus, one must have

d/n → 0 in order to ensure consistent estimation over B(c). This is a well-known feature of

dense estimation problems and, as mentioned in Section 1, contrasts with many results on

sparse estimation that imply β may be consistently estimated when d/n → ∞. However, the

sparsity conditions on β that are required for these results may not hold in general and our

motivating interest lies precisely in such situations. In this paper we derive sharp asymptotics

for R(b)(c) and related quantities in settings where d/n → 0, d/n → ρ ∈ (0,∞), and d/n → ∞

Page 5

L. Dicker/Dense Signals and High-Dimensional Linear Models5

(we assume that d → ∞ throughout). Though consistent estimation is only guaranteed when

d/n → 0, there are important situations where one might hope to analyze high-dimensional

datasets with d/n substantially larger than 0, even if there is little reason to believe that

sparsity assumptions are valid. The results in this paper provide detailed information that

may be useful in situations like these.

In addition to sparse estimation problems, minimax rates faster than d/n have also been

obtained for minimax problems over ℓ2-ellipsoids, which have been studied extensively in situa-

tions similar to those considered here (Cavalier and Tsybakov, 2002; Goldenshluger and Tsybakov,

2001, 2003; Pinsker, 1980). Much of this work has been motivated by problems in nonpara-

metric function estimation. The results in this paper are related to many of these existing

results, however, there are important differences – both in their implications and the tech-

niques used to prove them. Goldenshluger and Tsybakov’s (2001, 2003) work may be most

closely related to ours. Define the ℓ2-ellipsoid B(c,α) = {β ∈ Rd;

α = (α1,...,αd)T∈ Rd, 0 ≤ α1≤ ··· ≤ αd. Goldenshluger and Tsybakov studied minimax

problems over ℓ2-ellipsoids for a linear model with random predictors similar to the model

considered here (in fact, Goldenshluger and Tsybakov’s results apply to infinite-dimensional

non-Gaussian xi, though xiare required to have Gaussian tails and independent coordinates).

They identified asymptotically minimax estimators over B(c,α) and adaptive asymptotically

minimax estimators and showed that the minimax rate may be substantially faster than

d/n. However, their results also require the axes of B(c,α) to decay rapidly (i.e. ad/c → ∞

quickly) and do not apply to ℓ2-balls B(c) = B(c,(1,...,1)T) unless d/n → 0. Though these

decay conditions are natural for many inverse problems in nonparametric function estimation,

they drive the improved minimax rates obtained by Goldenshluger and Tsybakov and may

be overly restrictive in other settings, such as the genomics applications discussed in Section

1 above.

?n

i=1αiβ2

i≤ c2}, with

2.3. The sequence model

Minimax problems over restricted parameter spaces have been studied extensively in the

context of the sequence model. In the sequence model, given an index set J,

zj= θj+ δj, j ∈ J,(5)

are observed, θ = (θj)j∈Jis an unknown parameter, and δ = (δj)j∈Jis a random error. The

sequence model is extremely flexible, and many existing results about the Gaussian sequence

model (where the coordinates of δ are iid Gaussian random variables) have implications for

high-dimensional linear models (Cavalier and Tsybakov, 2002; Pinsker, 1980). However, these

results tend to apply in linear models where one conditions on the predictors, as opposed to

random predictor models like the one considered here.

Page 6

L. Dicker/Dense Signals and High-Dimensional Linear Models6

In order to prove the main result in this paper (Theorem 1), we study a sequence model

with non-Gaussian orthogonally invariant errors that is equivalent to the linear model (1).

Goldenshluger and Tsybakov (2001) also studied a non-Gaussian sequence model that derives

from a high-dimensional linear model with random predictors, but their results have limita-

tions in settings where d/n → ρ > 0, as discussed in Section 2.2 above. In our analysis,

orthogonal invariance is heavily exploited to obtain precise results in any asymptotic setting

where d → ∞. This appears to be a key difference between our analysis and Goldenshluger

and Tsybakov’s.

2.4. Minimax problems over ℓ2-spheres and orthogonal equivariance

Define the ℓ2-sphere of radius c, S(c) = Sd(c) = {β ∈ Rd; ||β|| = c}. Though it is common

in dense estimation problems to study the minimax risk over ℓ2-balls R(b)(c), which is one of

the primary objects of study here, we find it convenient and informative to consider a closely

related quantity, the minimax risk over S(c),

R(s)(c) = R(s)

d,n(c) = inf

ˆβ

sup

β∈S(c)R(ˆβ,β)

(the superscript “s” in R(s)(c) stands for “sphere”). For our purposes, the primary significance

of considering ℓ2-spheres comes from connections with orthogonal invariance and equivariance.

Let O(d) denote the group of d × d orthogonal matrices.

Definition 1. An estimatorˆβ =ˆβ(y,X) for β is orthogonally equivariant if

UTˆβ(y,X) =ˆβ(y,XU) (6)

for all U ∈ O(d).

?

Orthogonally equivariant estimators are compatible with orthogonal transformations of the

predictor basis. They may be appropriate when there is little information carried in the given

predictor basis vis-` a-vis the outcome; by contrast, knowledge about sparsity is exactly one

such piece of information. Indeed, sparsity assumptions generally imply that in the given basis

some predictors are significantly more influential than others. Sparse estimators attempt to

take advantage of this to improve performance and are typically not orthogonally equivariant.

The concept of equivariance plays an important role in statistical decision theory (e.g.

(Berger, 1985), Chapter 6). However, it seems to have received relatively little attention in the

context of linear models. Significant aspects of equivariance include: (i) in certain cases, one

can show that it suffices to consider equivariant estimators when studying minimax problems

and (ii) equivariance may provide a convenient means for identifying minimax estimators. This

Page 7

L. Dicker/Dense Signals and High-Dimensional Linear Models7

is basically the content of the Hunt-Stein theorem and both of these features prevail in the

present circumstances. To make this more precise, define the class of equivariant estimators

E = E(n,d) = {ˆβ;ˆβ is an orthogonally equivariant estimator for β}

and define

R(e)(β) = R(e)

d,n(β) = inf

ˆβ∈ER(ˆβ,β).

Additionally, let πcdenote the uniform measure on S(c) and let

ˆβunif(c) =ˆβunif(y,X;c) = Eπc(β|y,X)

be the posterior mean of β under the assumption that β ∼ πcis independent of (ǫ,X). Since,

for U ∈ O(d),

UTˆβunif(y,X;c) = Eπc(UTβ|y,X) = Eπc(β|y,XU) =ˆβunif(y,XU;c),

it follows thatˆβunif(c) ∈ E. The next result follows directly from the Hunt-Stein theorem

and its proof is omitted.

Proposition 1. Suppose that ||β|| = c. Then

R(s)(c) = R(e)(β) = R{ˆβunif(c),β}. (7)

Furthermore, ifˆβ ∈ E, then R(ˆβ,β) depends on β only through c.

In a sense, Proposition 1 completely solves the minimax problem over S(c). On the other

hand, the minimax estimatorˆβunif(c) is challenging to compute and it is desirable to identify

good estimators that have a simpler form. Moreover, thoughˆβunif(c) solves the minimax

problem over S(c), it is unclear how R(s)(c) relates to the minimax risk over ℓ2-balls, which is

a more commonly studied quantity in dense estimation problems. Finally, the minimax esti-

matorˆβunif(c) depends on c = ||β||, which is typically unknown in practice. All of these issues

must be addressed in order to identify practical estimators that perform well in dense problems

for high-dimensional linear models. This is accomplished below, where we show: (i) a linear

estimator (ridge regression) is asymptotically equivalent toˆβunif(c), (ii) R(b)(c) ∼ R(s)(c) (i.e.

R(b)(c)/R(s)(c) → 1), and (iii) under certain conditions c = ||β|| may be effectively estimated.

Similar results have been obtained for the Gaussian sequence model with iid errors (Beran,

1996; Marchand, 1993). Our results rely on an inequality of Marchand’s (Proposition 11 below)

and extend Marchand’s and Beran’s results to linear models with Gaussian predictors.

Proposition 1 and the related discussion imply that equivariant estimators have certain nice

properties and are closely linked with dense estimation problems. On the other hand, the next

Page 8

L. Dicker/Dense Signals and High-Dimensional Linear Models8

result describes some of the limitations of orthogonally equivariant estimators when d > n

and is indicative of some of the challenges inherent in dense estimation problems beyond the

consistency requirement d/n → 0.

Lemma 1. Suppose thatˆβ =ˆβ(y,X) ∈ E. Thenˆβ is orthogonal to the null-space of X.

Proof. Suppose that rank(X) = r < d and let X = UDVTbe the singular value decomposi-

tion of X, where U ∈ O(n), V ∈ O(d),

?D0 0

is an n × d matrix, and D0is an r × r diagonal matrix with rank r. Let V0denote the first

r columns of V and let V1denote the remaining d − r columns of V . Finally, suppose that

W1∈ O(d − r) and let

?

Then the null space of X is equal to the column space of V1 and it suffices to show that

VT

D =

00

?

W =

I

0 W1

0

?

∈ O(d).

1ˆβ = 0. By equivariance,

ˆβ = V Wˆβ(y,XV W) = V Wˆβ(y,UD). (8)

Thus,

VT

1ˆβ = VT

1V Wˆβ(y,UD) = (0 W)ˆβ(y,UD). (9)

Sinceˆβ(y,UD) does not depend on W and (9) holds for all W ∈ O(d − r), it follows that

VT

1ˆβ = 0, as was to be shown.

Lemma 1 is a non-estimability result for orthogonally equivariant estimators. It will be used

in Sections 3.3 and 6 below.

2.5. Linear estimators: Ridge regression

Linear estimators play an important role in dense estimation problems in many statistical

settings. Fundamental references include (James and Stein, 1961; Pinsker, 1980; Stein, 1955).

Pinsker (1980) showed that under certain conditions, linear estimators in the Gaussian se-

quence model are asymptotically minimax over ℓ2-ellipsoids. In the linear model, linear es-

timators have the formˆβ = Ay, where A is a data-dependent d × n matrix, and they are

convenient because of their simplicity. Define the ridge regression estimator

ˆβr(c) = (XTX + d/c2I)−1XTy, c ∈ [0,∞].

Page 9

L. Dicker/Dense Signals and High-Dimensional Linear Models9

By convention, we takeˆβr(0) = 0 andˆβr(∞) =ˆβols= (XTX)−1XTy to be the ordinary least

squares (OLS) estimator. Furthermore, throughout the paper, if a matrix A is not invertible,

then A−1is taken to be its Moore-Penrose pseudoinverse (thus, the OLS estimator is defined

for all d,n). Clearly,ˆβr(c) is a linear estimator. Furthermore, it is easy to check thatˆβr(c) ∈ E.

Dicker (2012) studied finite sample and asymptotic properties of R{ˆβr(c),β}. Some of these

properties will be used in this paper and are summarized presently.

2.5.1. Oracle estimators

Define the oracle ridge regression estimator

ˆβ

∗

r=ˆβr(||β||).

This estimator is called an oracle estimator because it depends on ||β||, which is typically

unknown. Proposition 5 of (Dicker, 2012) implies

R(ˆβ

∗

r,β) =inf

c∈[0,∞]R{ˆβr(c),β} = Etr(XTX + d/||β||2I)−1

(10)

and, furthermore,

R{ˆβr(||β||),β0} ≤ R{ˆβr(||β||),β}, if ||β0|| ≤ ||β||. (11)

The next result gives an expression for the asymptotic predictive risk ofˆβ

heavily on properties of the Marˇ cenko-Pastur distribution (Bai, 1993; Marˇ cenko and Pastur,

1967).

∗

r. Its proof relies

Proposition 2 (Proposition 8 from (Dicker, 2012)). Suppose that 0 < ρ−≤ d/n ≤ ρ+< ∞

for some fixed constants ρ−,ρ+∈ R and define

?

(a) If 0 < ρ−< ρ+< 1 or 1 < ρ−< ρ+< ∞ and n − d > 5, then

r>0(ρ,c) =

1

2ρ

c2(ρ − 1) − ρ +

?

{c2(ρ − 1) − ρ}2+ 4c2ρ2?

.

???R(ˆβ

∗

r,β) − r>0(d/n,||β||)

??? = O

?

||β||2

||β||2+ 1n−1/4

?

.

(b) If 0 < ρ−< 1 < ρ+< ∞, then

???R(ˆβ

∗

r,β) − r>0(d/n,||β||)

??? = O(||β||2n−5/48).

Page 10

L. Dicker/Dense Signals and High-Dimensional Linear Models10

Notice that Proposition 2 implies the asymptotic predictive risk ofˆβ

d/n → ρ > 0. The main results in this paper are essentially asymptotic optimality results for

ˆβ

and asymptotically optimal among the class of orthogonally equivariant estimators. Combined

with Propositions 2-3, these results immediately yield sharp asymptotic for R(b)(c), R(s)(c),

and R(e)(β).

Taking a Bayesian point-of-view, our optimality results forˆβ

Section 2.3 we observed that if ||β|| = c, thenˆβunif(c) = Eπc(β|y,X) is minimax over S(c) and

is optimal among orthogonally equivariant estimators for β. On the other hand, if ||β|| = c,

then the oracle ridge estimatorˆβ

posterior mean of β under the assumption that β ∼ N{0,(c2/d)I} is independent of ǫ and X.

Furthermore, if d is large, then the normal distribution N{0,(c2/d)I} is “close” to πc(there

is an enormous body of literature that makes this idea more precise – Diaconis and Freedman

(1987) attribute early work to Borel (1914) and L´ evy (1922)). Thus, it is reasonable to ex-

pect thatˆβunif(c) ≈ˆβr(c) and that, asymptotically, the oracle ridge estimator shares the

optimality properties ofˆβunif(c), which is indeed the case.

∗

ris non-vanishing if

∗

r. In particular, we show thatˆβ

∗

ris asymptotically minimax over ℓ2-balls and ℓ2-spheres,

∗

rare not surprising. Indeed, in

∗

r=ˆβr(c) = EN(0,c2/dI)(β|y,X) may be interpreted as the

2.5.2. Adaptive estimators

Adaptive ridge estimators will also be discussed in this paper. As mentioned above, ||β|| is

typically unknown; hence,ˆβ

imated by an adaptive estimator where ||β|| is replaced with an estimate – this estimator

“adapts” to the unknown quantity ||β||. Define

?||y||2

and define the adaptive ridge estimator

∗

ris typically non-implementable. However,ˆβ

∗

rmay be approx-

?

||β||

2= max

n

− 1,0

?

ˇβ

∗

r=ˆβr(?

||β||).(12)

Note that?

If n − d > 5, then

||β||

2is a consistent estimator of ||β||2, as n → ∞.

Proposition 3. Suppose that 0 < ρ−≤ d/n ≤ ρ+< 1 for some fixed constants ρ−,ρ+∈ R.

???R(ˆβ

∗

r,β) − R(ˇβ

∗

r,β)

??? = O

?

1

||β||2+ 1n−1/2

?

.

Page 11

L. Dicker/Dense Signals and High-Dimensional Linear Models11

The proof of Proposition 3 is nearly identical to the proof of Proposition 10 from (Dicker,

2012) and is omitted. Proposition 3 implies that if d/n → ρ ∈ (0,1), then the adaptive

ridge estimator has nearly the same asymptotic risk as the oracle ridge estimator. Note the

restriction d/n < 1 in Proposition 3. This restriction also appears in (Dicker, 2012), where

Var(ǫi) = σ2is unknown and the signal-to-noise ratio ||β||2/σ2, as opposed to ||β||2, is the

quantity that must be estimated to obtain an adaptive ridge estimator; in this context, d/n < 1

is a fairly natural condition for estimating σ2. It is possible to extend Proposition 3 to settings

where d/n > 1. However, if d/n > 1, then the corresponding error term in Proposition 3 is

no longer uniformly bounded in ||β||2. Additionally, notice that Proposition 3 does not apply

to settings where d/n → 0. A more careful analysis may lead to extensions in this direction

as well. Since adaptive estimation is not the main focus of this article, these issues are not

pursued further here; however, future research into these issues may prove interesting.

2.6. Outline of the paper

The main results of the paper are stated in Section 3. Most of these results follow from

Theorem 1, which is stated at the beginning of the section. The remainder of the paper is

devoted to proving Theorem 1. In Section 4, the equivalence between the linear model and the

sequence model is formalized. The first part of Theorem 1, which applies to the setting where

d ≤ n, is proved in Section 5. This part of the proof involves converting error bounds for

the Gaussian sequence model with iid errors into useful bounds for the relevant non-Gaussian

sequence model. The second part of Theorem 1 (d > n) is proved in Section 6. When d > n,

XTX does not have full rank. The major steps in the proof for d > n involve reducing the

problem to a full rank problem.

3. Main results

The results in this section are presented in terms of the linear model. However, most have

equivalent formulations in terms of the sequence model introduced in Section 4 below.

Theorem 1. Suppose that n > 2 and let s1 ≥ ··· ≥ sd∧n > 0 denote the nonzero (with

probability 1) eigenvalues of (XTX)−1.

(a) If d ≤ n, then

???R(ˆβ

∗

r,β) − R(e)(β)

??? ≤1

dE

?

s1

sdtr

?

XTX +

d

||β||2I

?−1?

Page 12

L. Dicker/Dense Signals and High-Dimensional Linear Models 12

(b) If d > n, then

???R(ˆβ

∗

r,β) − R(e)(β)

??? ≤

1

nE

?

s1

sntr

?

XXT+

d

||β||2I

?−1?

+2d − n

n − 2

1

||β||2Etr

?

XXT+

d

||β||2I

?−2

.

From (10) and Proposition 1, it is clear that R(ˆβ

properties of the Wishart and inverse Wishart distributions imply that the upper bounds in

Theorem 1 are finite, provided |n−d| > 1; when |n−d| ≤ 1, these bounds are infinite. However,

if |n−d| ≤ 1, then the inequalities Rd,n(ˆβ

be combined with Theorem 1 (b) to obtain nontrivial bounds.

In what remains of this section, we discuss some of the consequences of Theorem 1 and

related results in three asymptotic settings: d/n → 0 (with d → ∞, as well), d/n → ρ ∈

(0,∞), and d/n → ∞.

∗

r,β) and R(e)(β) are finite. Moreover, basic

∗

r,β) ≤ Rd,n−1(ˆβ

∗

r,β) and R(e)

d,n(β) ≤ R(e)

d,n−1(β) may

3.1. d/n → 0

Proposition 4. Define

r0(ρ,c) =

c2ρ

c2+ ρ.

If d/n → 0 and d → ∞, then

R(ˆβ

∗

r,β) ∼ R(e)(β) ∼ R(s)(||β||) ∼ R(b)(||β||) ∼ r0(d/n,||β||)

uniformly for β ∈ Rd.

Proof. If d + 1 < n, then (10) and Jensen’s inequality imply that

d/n

1 + d/(n||β||2)≤ R(ˆβ

∗

r,β) ≤

d/n

1 − (d + 1)/n + d/(n||β||2).(13)

It follows that R(ˆβ

∗

r,β) ∼ r0(d/n,||β||). By Theorem 1, in order to prove

R(e)(β) ∼ r0(d/n,||β||),(14)

it suffices to show that

1

dE

?s1

sdtr?XTX + d/||β||2I?−1?

= o{r0(d/n,||β||)}.

Page 13

L. Dicker/Dense Signals and High-Dimensional Linear Models13

But this is clear:

1

dE

?

s1

sdtr

?

XTX +

d

||β||2I

?−1?

≤

||β||2

d(||β||2+ d/n)

?s1

= O

dr0(d/n,||β||)

= o{r0(d/n,||β||)},

(15)

·E

?1

sd

?

ds1+d

n

??

?

where we have used the facts E(sk

2012)). Thus, (14). Since R(s)(||β||) = R(e)(β), all that is left is to prove is R(b)(||β||) ∼

R(s)(||β||). This follows because

1) = O(n−k) and E(s−k

d) = O(nk) (Lemma A2, (Dicker,

R(s)(||β||) ≤ R(b)(||β||) ≤ R(ˆβ

∗

r,β) ∼ R(s)(||β||),(16)

where we have used (11) to obtain the second inequality.

The asymptotic risk r0(ρ,c) appears frequently in the analysis of linear estimators for

the Gaussian sequence model (Pinsker, 1980) and is often referred to as the “linear minimax

risk.” The condition d → ∞ in Proposition 4 is important because it drives the approximation

πc≈ N(0,c2/dI), which enables us to conclude R(e)(β) ∼ R(ˆβ

end of Section 2.4). Notice that limd/n→0r0(ρ,c) = 0. Thus, the minimax risk vanishes when

d/n → 0.

Proposition 4 implies that the ridge estimatorˆβ

d → ∞. On the other hand, other simple linear estimators are also asymptotically minimax

in this setting. Define the estimator

∗

r,β) (re: the discussion at the

∗

ris asymptotically minimax if d/n → 0 and

ˆβ

∗

scal=

1 − (d + 1)/n

1 − (d + 1)/n + d/(n||β||2)

ˆβols.

Note thatˆβ

d,n sinceˆβolsis defined using pseudoinverses. Various versions ofˆβscalhave been studied

previously (Baranchik, 1973; Brown, 1990; Stein, 1960). Dicker (2012) showed that if d+1 < n,

then

∗

scalis a scalar multiple of the OLS estimator and thatˆβ

∗

scalis defined for all

R(ˆβ

∗

r,β) ≤ R(ˆβ

∗

scal,β) =

d/n

1 − (d + 1)/n + d/(n||β||2)

d/n

1 − (d + 1)/n.

(17)

≤ R(ˆβols,β) =

The following corollary to Proposition 4 follows immediately.

Page 14

L. Dicker/Dense Signals and High-Dimensional Linear Models 14

Corollary 1.(a) If d/n → 0 and d → ∞, then

R(ˆβ

∗

scal,β) ∼ R(b)(||β||)

uniformly for β ∈ Rd.

(b) If d/n → 0, d → ∞, and d/(n||β||2) → s ≥ 0, then

R(ˆβols,β)

R(b)(||β||)→ 1 + s.

In other words, if d/n → 0 and d → ∞, thenˆβscalis asymptotically minimax over ℓ2-balls

(and, moreover, asymptotically equivalent toˆβ

asymptotically minimax over ℓ2-balls, but this depends on the magnitude of the signal β: If

||β||2is large, then the OLS estimator is asymptotically minimax; if ||β||2is small, then it is

not.

∗

r). Furthermore, the OLS estimator may be

3.2. d/n → ρ ∈ (0,∞)

The setting where d/n → ρ ∈ (0,∞) may be the most interesting one for the dense estima-

tion problems considered here. The minimax risk is non-vanishing in this setting; however,

informative closed form expressions for the limiting minimax risk are available. Moreover,

differences between the linear estimatorsˆβ

become pronounced when d/n → ρ ∈ (0,∞). These differences are largely attributable to the

spectral distribution of n−1XTX, which is asymptotically trivial if d/n → 0 and converges to

the Marˇ cenko-Pastur distribution (Marˇ cenko and Pastur, 1967) if d/n → ρ ∈ (0,∞).

∗

scalandˆβ

∗

rwhich are insignificant when d/n → 0

Proposition 5. Suppose that ρ ∈ (0,∞) and let R∗(β) denote any of R(ˆβ

R(s)(||β||), or R(b)(β). If ρ ?= 1, then

∗

r,β), R(e)(β),

lim

d/n→ρsup

β∈Rd|R∗(β) − r>0(d/n,||β||)| = 0,(18)

where r>0(ρ,c) is defined in Proposition 2 above. Furthermore, as d/n → ρ,

R(ˆβ

∗

r,β) ∼ R(e)(β) ∼ R(s)(||β||) ∼ R(b)(||β||) ∼ r>0(d/n,||β||).(19)

If ρ ?= 1, then the implied convergence in (19) holds uniformly for β ∈ Rd; if ρ = 1, then the

convergence is uniform over B(c) for any fixed c ∈ (0,∞).

Page 15

L. Dicker/Dense Signals and High-Dimensional Linear Models15

Proof. Proposition 2 implies that |R(ˆβ

with the appropriate uniformity conditions when ρ ?= 1 or ρ = 1. For ρ ≤ 1, the asymptotic

equivalences |R(e)(β) − R(ˆβ

prove the equivalences for ρ > 1, notice that

?s1

+2

c2(n − 2)Etr(XXT+ d/c2I)−2

∗

r,β)−r>0(d/n,||β||)| → 0 and R(ˆβ

∗

r,β) ∼ r>0(d/n,||β||),

∗

r,β)| → 0 and R(e)(β) ∼ R(ˆβ

∗

r,β) follow from (13) and (15); to

1

nEsntr(XXT+ d/c2I)−1

d − n

?

= O

?

||β||2

n(||β||2+ 1)

?

.

Since R(e)(β) = R(s)(||β||), it suffices to show that

lim

d/n→ρsup

β∈Rd

??R(s)(||β||) − R(b)(||β||)??= 0

and that R(s)(||β||) ∼ R(b)(||β||) uniformly for β ∈ Rdin order to prove the proposition; both

follow from (16).

Two types of asymptotic equivalence are addressed in Proposition 5: differences (18) and

quotients (19). The equivalence (18) is more informative for large ||β||; (19) is more informative

for small ||β||. Notice that for fixed ||β|| = c ∈ (0,∞), limd/n→ρr>0(d/n,c) = r>0(ρ,c) > 0

and it follows that (18) and (19) are equivalent.

For d/n → 0, we saw thatˆβ

instance, both were also asymptotically equivalent to the OLS estimator; Corollary 1). When

d/n → ρ ∈ (0,∞),ˆβ

for d/n → 0, we have

R(ˆβ

∗

scalandˆβ

∗

rwere asymptotically equivalent (and that, in some

∗

randˆβ

∗

scalare not asymptotically equivalent. Indeed, (17) implies that

∗

scal,β) ∼ rscal(d/n,||β||),

where

rscal(ρ,c) =

1 − ρ

1 − ρ + ρ/c2.

One easily checks that for ρ > 0, r>0(ρ,c) ≤ rscal(ρ,c) with equality if and only if c = 0. Thus,

ˆβ

Despite its suboptimal performance, the estimatorˆβ

Indeed, if Cov(xi) = Σ ?= I, then it is straightforward to implement a modified version of

ˆβ

unknown and a norm-consistent estimator for Σ is not available, then this may have a more

dramatic effect on the ridge estimatorˆβ

it is argued that in dense problems where little is known about Cov(xi), an appropriately

modified version ofˆβ

that R(ˆβ

∗

scalis not asymptotically minimax over ℓ2-balls when d/n → ρ ∈ (0,∞).

∗

scalmay be useful in certain situations.

∗

scalwith similar properties (replace ||β||2inˆβ

∗

scalwith βTΣβ); on the other hand, if Σ is

∗

r. This is discussed in detail in (Dicker, 2012), where

∗

scalis a reasonable alternative to ridge regression (note, for instance,

∗

r,β) = O(1) if d/n → ρ ∈ (0,∞)).

∗

scal,β)/R(ˆβ

Page 16

L. Dicker/Dense Signals and High-Dimensional Linear Models 16

3.3. d/n → ∞

Theorem 1 plays a crucial role in our asymptotic analysis when d/n → ρ < ∞. It is less

relevant in the setting where d/n → ∞. Instead, Lemma 1 from Section 2.4 plays the key

role. We have the following proposition.

Proposition 6. Suppose that d > n and thatˆβ ∈ E. Then

R(ˆβ,β) ≥d − n

d

||β||2.

Proof. Let X = UDVTbe the singular value decomposition of X, as in the proof of Lemma

1. Let V0and V1be the first r and the remaining d − r columns of V , respectively, where

r = rank(X) (note that r = n with probability 1). Then

R(ˆβ,β) = E||ˆβ − β||2

= E||VT

≥ E||VT

0(ˆβ − β)||2+ E||VT

1β||2

d − n

n

1β||2

(20)

= ||β||2,(21)

where (20) follows from Lemma 1 and (21) follows from symmetry.

The proof of Proposition 6 essentially implies that for d > n, the squared bias of an equiv-

ariant estimator must be at least ||β||2(d −n)/d. This highlights one of the major challenges

in high-dimensional dense estimation problems, especially in settings where d ≫ n. The next

proposition, which is the main result in this subsection, implies that if d/n → ∞, then the

trivial estimatorˆβnull= 0 is asymptotically minimax. In a sense, this means that in dense

problems β is completely non-estimable when d/n → ∞.

Proposition 7. Letˆβnull= 0. Then R(ˆβnull,β) = ||β||2. Furthermore, if d/n → ∞, then

R(ˆβ

∗

r,β) ∼ R(e)(β) ∼ R(s)(||β||) ∼ R(b)(||β||) ∼ R(ˆβnull,β) ∼ ||β||2

uniformly for β ∈ Rd.

Proof. Clearly, R(ˆβnull,β) = ||β||2. It follows from Proposition 6 that for d > n,

d − n

n

||β||2≤ R(e)(β) = R(s)(||β||2) ≤ R(b)(||β||2)

≤ R(ˆβ

∗

r,β) ≤ R(ˆβnull,β) = ||β||2.

The proposition follows by dividing by ||β||2and taking d/n → ∞.

Page 17

L. Dicker/Dense Signals and High-Dimensional Linear Models 17

3.4. Adaptive estimators

The results in Section 3.1-3.3 imply that the oracle ridge estimatorˆβ

ically minimax over ℓ2-balls and ℓ2-spheres and is asymptotically optimal among equivariant

estimators for β in any asymptotic setting where d → ∞. The next result describes asymp-

totic optimality properties of the adaptive ridge estimatorˇβ

not depend on ||β||.

∗

r=ˆβr(||β||) is asymptot-

∗

r(defined in (12)), which does

Proposition 8. Suppose that ρ ∈ (0,1) and let R∗(β) denote any of R(ˆβ

R(s)(||β||), R(b)(β), or r>0(d/n,||β||). Let {an}∞

numbers such that ann1/2→ ∞.Then

∗

r,β), R(e)(β),

n=1⊆ R denote a sequence of positive real

lim

d/n→ρsup

β∈Rd

???R∗(β) − R(ˇβ

∗

r,β)

??? = 0 and lim

d/n→ρ

sup

||β||2≥an

R(ˇβ

R∗(β)

∗

r,β)

= 1.

Proposition 8 follows immediately from Propositions 3 and 5. The restriction ||β||2≫ n1/2

in the second part of Proposition 8 is related to the fact that for d/n → ρ ∈ (0,∞), R(ˆβ

O(||β||2) and the error bound in Proposition 3 is O(n−1/2). As discussed in Section 2.5.2, more

detailed results on adaptive ridge estimators are likely possible (that may apply, for instance,

in settings where d/n → 0 or d/n → ρ ≥ 1), but this not pursued further here.

∗

r,β) =

4. An equivalent sequence model

The rest of the paper is devoted to proving Theorem 1. In this section and Section 5, we

assume that d ≤ n. In Section 6, we address the case where d > n. The major goal in this

section is to relate the linear model (1) to an equivalent non-Gaussian sequence model.

4.1. The model

Let Σ be a random orthogonally invariant m×m positive definite matrix with rank m, almost

surely (by orthogonally invariant, we mean that Σ and UΣUThave the same distribution for

any U ∈ O(m)). Additionally, let δ0∼ N(0,Im) be a d-dimensional Gaussian random vector

that is independent of Σ. Recall that in the sequence model (5), the vector z = (zj)j∈J =

θ + δ is observed and J is an index set. In the formulation considered here, J = {1,...,m},

δ = Σ1/2δ0, and Σ is observed along with z. Thus, the available data are (z,Σ) and

z = θ + δ = θ + Σ1/2δ0∈ Rm.(22)

Page 18

L. Dicker/Dense Signals and High-Dimensional Linear Models18

Notice that δ is in general non-Gaussian. However, conditional on Σ, δ is a Gaussian random

vector with covariance Σ. We are interested in the risk for estimating θ under squared error

loss. For an estimatorˆθ =ˆθ(z,Σ), this is defined by

˜R(ˆθ,θ) = Eθ||ˆθ(z,Σ) − θ||2= Eθ||ˆθ − θ||2,

where the expectation is taken with respect to δ0and Σ (we use “∼,” as in˜R, to denote

quantities related to the sequence model, as opposed to the linear model).

4.2. Equivariance and optimality concepts

Most of the key concepts initially introduced in the context of the linear model have analogues

in the sequence model (22). In this subsection, we describe some that will be used in our proof

of Theorem 1.

Definition 2. Letˆθ =ˆθ(z,Σ) be an estimator for θ. Thenˆθ is an orthogonally equivariant

estimator for θ if

Uˆθ(z,Σ) =ˆθ(Uz,UTΣU)

for all U ∈ O(d).

?

Let

˜E =˜Ed= {ˆθ;ˆθ is an orthogonally equivariant estimator for θ}

denote the class of orthogonally equivariant estimators for θ. Also define the posterior mean

for θ under the assumption that θ ∼ πc,

ˆθunif(c) = Eπc(θ|z,Σ)

and the posterior mean under the assumption that θ ∼ N(0,c2/mI).

ˆθr(c) = EN(0,c2/mI)(θ|z,Σ) = c2/d?Σ + c2/mI?−1z

(for both of these Bayes estimators we assume that θ is independent of δ0 and Σ). The

estimatorsˆθunif(c) andˆθr(c) for θ are analogous to the estimatorsˆβunif(c) andˆβr(c) for β,

respectively. Moreover, they are both orthogonally equivariant, i.e.ˆθunif(c),ˆθr(c) ∈˜E, and

ˆθr(c) is a linear estimator. Now define the minimal equivariant risk for the sequence model

˜R(e)(θ) =˜R(e)

m(θ) = inf

ˆθ∈Eseq

˜R(ˆθ,θ)

Page 19

L. Dicker/Dense Signals and High-Dimensional Linear Models 19

and the minimax risk over the ℓ2-sphere of radius c,

˜R(s)(c) =˜R(s)

m(c) = inf

ˆθ

sup

θ∈S(c)

˜R(ˆθ,θ),

where the infimum above is taken over all measurable estimatorˆθ =ˆθ(z,Σ). The Hunt-Stein

theorem yields the following result, which is entirely analogous to Proposition 1.

Proposition 9. Suppose that ||θ|| = c. Then

˜R(s)(c) =˜R(e)(θ) =˜R{ˆθunif(c),θ}.

Furthermore, ifˆθ ∈˜E, then˜R(ˆθ,θ) depends on θ only through c.

4.3. Equivalence of the sequence model and the linear model

The next proposition helps characterize the equivalence between the linear model (1) and the

sequence model (22).

Proposition 10. Suppose that d ≤ n, m = d, and Σ = (XTX)−1.

(a) If β = θ and z = (XTX)−1XTy, thenˆβunif(c) =ˆθunif(c),ˆβr(c) =ˆθr(c), and

R{ˆβr(c),β} =˜R{ˆθr(c),θ}.

(b) If ||θ|| = ||β|| = c, then

R{ˆβunif(c),β} = R(e)(β) = R(s)(c)

=˜R(s)(c) =˜R(e)(θ) =˜R{ˆθunif(c),θ}.

Part (a) of Proposition 10 is obvious; part (b) follows from the fact that ((XTX)−1XTy,

(XTX)−1) is sufficient for β and the Rao-Blackwell inequality. Proposition 10 implies that it

suffices to consider the sequence model in order to prove Theorem 1.

5. Proof of Theorem 1 (a): Normal approximation for the uniform prior

It follows from Proposition 9 that the Bayes estimatorˆθunif(c) is optimal among all orthog-

onally equivariant estimators for θ, if ||θ|| = c. In this section, we prove Theorem 1 (a) by

bounding

???R{ˆθr(c),θ} − R{ˆθunif(c),θ}

Marchand (1993) studied the relationship betweenˆθunif(c) andˆθr(c) under the assumption

that ||θ|| = c and Σ = τ2I (i.e. in the Gaussian sequence model with iid errors). Marchand

proved the following result, which is one of the keys to the proof of Theorem 1 (a).

???

(23)

and applying Proposition 10.

Page 20

L. Dicker/Dense Signals and High-Dimensional Linear Models 20

Proposition 11 (Theorem 3.1 from (Marchand, 1993)). Suppose that Σ = τ2I and ||θ|| = c.

Then

???˜R{ˆθr(c),θ} −˜R{ˆθunif(c),θ}

??? ≤

1

m

1

m

c2τ2m

c2+ τ2m

˜R{ˆθr(c),θ}.

=

Thus, in the Gaussian sequence model with iid errors, the risk ofˆθr(c) is nearly as small as

that ofˆθunif(c). Marchand’s result relies on somewhat delicate calculations involving modified

Bessel functions (Robert, 1990). A direct approach to bounding (23) for general Σ might

involve attempting to mimic these calculations. However, this seems daunting (Bickel, 1981).

Brown’s identity, which relates the risk of a Bayes estimator to the Fisher, allows us to sidestep

these calculations and apply Marchand’s result directly.

Define the Fisher information of a random vector ξ ∈ Rm, with density fξ(with respect to

Lebesgue measure on Rm) by

?

where ∇fξ(t) is the gradient of fξ(t). Brown’s identity has typically been used for univari-

ate problems or problems in the sequence model with iid Gaussian errors (Bickel, 1981;

Brown and Gajek, 1990; Brown and Low, 1991; DasGupta, 2010). The next proposition is

a straightforward generalization to the correlated multivariate Gaussian setting. Its proof is

based on Stein’s lemma.

I(ξ) =

Rd

∇fξ(t)∇fξ(t)T

fξ(t)

dt,

Proposition 12 (Brown’s Identity). Suppose that rank(Σ) = m, with probability 1. Let

IΣ(θ + Σ1/2δ0) denote the Fisher information of θ + Σ1/2δ0, conditional on Σ, under the

assumption that θ ∼ πcis independent of δ0and Σ. If ||θ|| = c, then

˜R{ˆθunif(c),θ} = Etr(Σ) − Etr?Σ2IΣ(θ + Σ1/2δ)?.

Proof. Suppose that c = ||θ|| and let

?

be the density of z = θ + Σ1/2δ0, conditional on Σ and under the assumption that θ ∼ πc.

Then

ˆθunif(c) = Eπc(θ|z,Σ) = z − Eπc(Σ1/2δ0|z,Σ) = z +Σ∇f(z)

f(z) =

S(c)

(2π)−d/2det(Σ−1/2)e−1

2(z−θ)TΣ−1(z−θ)dπc(θ)

f(z)

.

Page 21

L. Dicker/Dense Signals and High-Dimensional Linear Models 21

It follows that

E||ˆθunif(c) − θ||2

= E

???? ????Σ1/2δ +Σ∇f(z)

= Etr(Σ) + 2E

f(z)

?δTΣ3/2∇f(z)

?∇m(z)TΣ2∇f(z)

?δTΣ3/2∇f(z)

+Etr?Σ2IΣ(θ + Σ1/2δ)?

???? ????

2

f(z)

?

+E

f(z)2

?

= Etr(Σ) + 2E

f(z)

?

(24)

By Stein’s lemma (integration by parts),

E

?δTΣ3/2∇f(z)

f(z)

?

= E?tr?Σ2∇2logf(z)??

= −Etr?Σ2IΣ(θ + Σ1/2δ)?.(25)

Brown’s identity follows by combining (24) and (25).

Using Brown’s identity, Fisher information bounds may be converted to risk bounds, and

vice-versa. Its usefulness in the present context springs from (i) the decomposition

z = θ + Σ1/2δ0=?θ + (γsm)1/2δ1

where δ1,δ2

∼ N(0,Im) are independent of Σ, smis the smallest eigenvalue of Σ, and 0 < γ <

1 is a constant and (ii) Stam’s inequality for the Fisher information of sums of independent

random variables.

?+ (Σ − γsm)1/2δ2, (26)

iid

Proposition 13 (Stam’s inequality; this version due to Zamir (1998)). Let v,w ∈ Rmbe

independent random variables that are absolutely continuous with respect to Lebesgue measure

on Rm. For every m × m positive definite matrix Σ,

tr?Σ2I(v + w)?≤ tr

Notice that conditional on Σ, the term θ + (γsm)1/2δ1in (26) may be viewed as an ob-

servation from the Gaussian sequence model with iid errors. The necessary bound on (23) is

obtained by piecing together Brown’s identity, the decomposition (26), and Stam’s inequality,

so that Marchand’s inequality (Proposition 11) may be applied to θ + (γsm)1/2δ1.

?

Σ2?I(v)−1+ I(w)−1?−1?

.

Page 22

L. Dicker/Dense Signals and High-Dimensional Linear Models 22

Proposition 14. Suppose that Σ has rank m with probability 1 and that ||θ|| = c. Let

s1≥ ··· ≥ sm≥ 0 denote the eigenvalues of Σ. Then

???˜R{ˆθr(c),θ} − R{ˆθunif(c),θ}

Proof. It is straightforward to check that

??? ≤1

mE

?s1

smtr?Σ−1+ m/c2I?−1?

.

R{ˆθr(c),θ} = Etr(Σ−1+ m/c2I)−1. (27)

Thus, Brown’s identity and (27) imply

˜R{ˆθr(c),θ} −˜R{ˆθunif(c),θ} = Etr?Σ2IΣ(θ + δ)?

+Etr(Σ−1+ m/c2I)−1− Etr(Σ)

= Etr?Σ2IΣ(θ + δ)?

−Etr?Σ2(Σ + c2/mI)−1?.

Taking v = θ + (γsm)1/2δ1and w = (Σ − γsm)1/2δ2in Stam’s inequality, where δ1, δ2, and

0 < γ < 1 are given in (26), one obtains

˜R{ˆθr(c),θ} −˜R{ˆθunif(c),θ} ≤ Etr

?

Σ2?

+Σ − γsmI

−Etr?Σ2(Σ + c2/mI)−1?

IΣ{θ + (γsm)1/2δ1}−1

?−1?

By orthogonal invariance, IΣ{θ + (γsm)1/2δ1} = ζImfor some ζ ≥ 0. Marchand’s inequality,

another application of Brown’s identity, and (27) with Σ = γsmImimply that

?

Since

1

ζ− γsm≥ (m − 1)

it follows that

?

ζ ≤

1

γsm

?γsm+ c2/m2

γsm+ c2/m.

γsmc2

γsmm2+ c2,

˜R{ˆθr(c),θ} −˜R{ˆθunif(c),θ} ≤ EtrΣ2

?

Σ + (m − 1)

γsmc2

γsmm2+ c2I

?−1?

−Etr?Σ2(Σ + c2/mI)−1?.

Page 23

L. Dicker/Dense Signals and High-Dimensional Linear Models23

Taking γ ↑ 1,

˜R{ˆθr(c),θ} −˜R{ˆθunif(c),θ} ≤ Etr

?

−Etr?Σ2(Σ + c2/mI)−1?

mE

Σ2

?

Σ + (m − 1)

smc2

smm2+ c2I

?−1?

≤

1

?s1

smtr?Σ−1+ m/c2I?−1?

.

The proposition follows because˜R{ˆθunif(c),θ} ≤˜R{ˆθr(c),θ}.

Theorem 1 (a) follows immediately from Propositions 10 and 14.

6. Proof of Theorem 1 (b): d > n

It only remains to prove Theorem 1 (b), which is achieved through a sequence of lemmas.

The first step of the proof focuses on the linear model (as opposed to the sequence model)

and on reducing the problem where d > n and XTX is not invertible to a full rank problem.

This step builds on Lemma 1 from Section 2.4.

Suppose that d > n and let X = UDVTbe the singular value decomposition of X, where

U ∈ O(n), V ∈ O(d), D = (D0 0), and D0is a rank n diagonal matrix (with probability 1).

Let W ∈ O(d) be uniformly distributed on O(n) (according to Haar measure) and independent

of ǫ and X. Define the n × n matrix X0= UD0WTand consider the full rank linear model

y0= X0β0+ ǫ,(28)

where β0∈ Rn. Notice that unlike X, the entries in X0are not iid N(0,1). However, XT

is orthogonally invariant. As with the linear model (1), one can consider estimatorsˆβ0=

ˆβ0(y0,X0) for β0and compute the risk

0X0

R0(ˆβ0,β0) = Eβ0||ˆβ0− β0||2, (29)

where the expectation in (29) is taken over ǫ and X0. We have the following lemma.

Lemma 2. Suppose that d > n, ||β|| = c, andˆβ ∈ E(n,d). Let P0denote any fixed n × d

projection matrix with orthogonal rows. Then there is an orthogonally equivariant estimator

P0ˆβ ∈ E(n,n) such that

?

R(ˆβ,β) =

Sd(c)

R0(P0ˆβ,P0b) dπc(b) +d − n

d

c2.

Page 24

L. Dicker/Dense Signals and High-Dimensional Linear Models24

Proof. As above, let X = UDVTbe the singular value decomposition of X. Let V0denote

the first n columns of V and let V1denote the remaining d − n columns of V . By (8),

ˆβ(y,X) = V0ˆβ0(y,UD0),

where P0ˆβ(y,UD0) =ˆβ0(y,UD0) is the first n coordinates ofˆβ(y,UD). Furthermore, it is

easy to check that P0ˆβ is orthogonally equivariant, i.e. P0ˆβ ∈ E(n,n). Thus,

R(ˆβ,β) = Eβ||ˆβ0(y,UD0) − VT

0β||2+ Eβ||VT

0β||2+d − n

1β||2

= Eβ||ˆβ0(y,UD0) − VT

d

c2.

To prove the lemma, it suffices to show that

Eβ||ˆβ0(y,UD0) − VT

0β||2=

?

Sd(c)

R0(ˆβ0,P0b) dπc(b).

By Proposition 1, orthogonal invariance of πc, and orthogonal equivariance ofˆβ0,

Eβ||ˆβ0(y,UD0) − VT

0β||2

=

?

Sd(c)

??

Eb||ˆβ0(y,UD0) − VT

0b||2dπc(b)

= E

Sd(c)

||ˆβ0(UD0VT

0b + ǫ,UD0)

−VT

0b||2dπc(b)

?

= E

??

Sd(c)

||ˆβ0(UD0WTP0b + ǫ,UD0)

−WTP0b||2dπc(b)

?

=

?

Sd(c)

E||ˆβ0(y0,X0) − P0b||2dπc(b),

as was to be shown.

Lemma 2 allows us to express the risk of an equivariant estimator for β in the linear model

(1) with d > n in terms of the risk of another equivariant estimator in a different linear model

(28) with d = n. Though the linear model (28) differs from the original linear model with

Guassian predictors – thus, Theorem 1 (a) does not apply directly – (28) is equivalent to the

sequence model (22), with m = n and Σ = (XT

0X0)−1.

Page 25

L. Dicker/Dense Signals and High-Dimensional Linear Models 25

Lemma 3. Suppose that 2 < m = n < d and Σ = (XT

Also suppose that ||β|| = c. Let P0be a fixed n × d projection matrix with orthogonal rows

and let s1≥ ··· ≥ sn≥ 0 denote the eigenvalues of (XTX)−1. Then

?

?

+d − n

d

??

0X0)−1in the sequence model (22).

R{ˆβunif(c),β} ≥

Sd(c)

˜R{ˆθunif(P0t),P0t} dπc(t) +d − n

??

c2

?

d

c2

≥

Sd(c)

E1 −

s1

nsn

?

tr

?

XXT+

n

||P0t||2I

?−1?

dπc(t)

≥ E

1 −

s1

nsn

tr

?

XXT+n(d − 2)

c2(n − 2)I

?−1?

+d − n

d

c2.

Proof. The first inequality follows from Lemma 2 and a suitably modified version of Proposi-

tion 10 that describes the equivalence between the linear model (28) and the sequence model

(22). The second inequality follows from Proposition 14 and the fact that XT

have the same eigenvalues:

0X0and XXT

˜R{ˆθunif(P0t),P0t} ≥

˜R{ˆθr(P0t),P0t}

−1

nE

??

= E1 −

?s1

s1

nsn

s1

nsn

sntr(X0XT

?

?

0+ n/||P0t||2I)−1

0X0+ n/||P0t||2I?−1?

?

= E1 −

tr?XT

??

tr?XXT+ n/||P0t||2I?−1?

.

The last inequality in the lemma follows from Jensen’s inequality and the identity

?

Sd(c)

1

||P0t||2dπc(t) =

d − 2

c2(n − 2).

We now have the tools to complete the proof of Theorem 1 (b). Suppose that d > n and

||β|| = c. Then

R(ˆβ

∗

r,β) = Etr{XXT+ d/c2I}−1+d − n

d

c2.

Page 26

L. Dicker/Dense Signals and High-Dimensional Linear Models 26

Since R{ˆβr(c),β} − R{ˆβunif(c),β} = R{ˆβr(c),β} − R(e)(β) ≥ 0, Lemma 3 implies

???R{ˆβr(c),β} − R(e)(β)

−E

??? ≤ Etr{XXT+ d/c2I}−1

??

?s1

+2

1 −

s1

nsn

?

tr

?

XXT+n(d − 2)

c2(n − 2)I

?

?−1?

≤

1

nEsntr(XXT+ d/c2I)−1

d − n

c2(n − 2)Etr(XXT+ d/c2I)−2.

Theorem 1 (b) follows.

Acknowledgements

The author thanks Sihai Zhao for his thoughtful comments and suggestions.

References

Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2006). Adapting to

unknown sparsity by controlling the false discovery rate. Annals of Statistics 34 584–653.

Bai, Z. (1993). Convergence rate of expected spectral distributions of large random matrices.

Part II. Sample covariance matrices. Annals of Probability 21 649–672.

Bansal, V., Libiger, O., Torkamani, A. and Schork, N. (2010). Statistical analysis

strategies for association studies involving rare variants. Nature Reviews Genetics 11 773–

785.

Baranchik, A. (1973). Inadmissibility of maximum likelihood estimators in some multiple

regression problems with three or more independent variables. Annals of Statistics 1 312–

321.

Beran, R. (1996). Stein estimation in high dimensions: A retrospective. In Research devel-

opments in probability and statistics: Festschrift in honor of Madan L. Puri on the occasion

of his 65th birthday. VSP International Science Publishers.

Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer.

Bickel, P. (1981). Minimax estimation of the mean of a normal distribution when the

parameter space is restricted. Annals of Statistics 9 1301–1309.

Bickel, P., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis of lasso and

Dantzig selector. Annals of Statistics 37 1705–1732.

Borel,´E. (1914). Introduction g´ eom´ etrique ` a quelques th´ eories physiques. Gauthier-Villars.

Page 27

L. Dicker/Dense Signals and High-Dimensional Linear Models27

Breiman, L. and Freedman, D. (1983). How many variables should be entered in a

regression equation? Journal of the American Statistical Association 78 131–136.

Brown, L. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value

problems. Annals of Mathematical Statistics 42 855–903.

Brown, L. (1990). An ancillarity paradox which appears in multiple linear regression. Annals

of Statistics 18 471–493.

Brown, L. and Gajek, L. (1990). Information inequalities for the bayes risk. Annals of

Statistics 18 1578–1594.

Brown, L. and Low, M. (1991). Information inequality bounds on the minimax risk (with

an application to nonparametric regression). Annals of Statistics 19 329–337.

Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the

lasso. Electronic Journal of Statistics 1 169–194.

Cand` es, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much

larger than n. Annals of Statistics 35 2313–2351.

Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with ran-

dom noise. Probability Theory and Related Fields 123 323–354.

DasGupta, A. (2010). False vs. missed discoveries, gaussian decision theory, and the donsker-

varadhan principle. In Borrowing Strength: Theory Powering Applications A Festschrift

for Lawrence D. Brown. Institute of Mathematical Statistics.

Diaconis, P. and Freedman, D. (1987). A dozen de finetti-style results in search of a

theory. Annales de l’Henri Poincar´ e, Probabilit´ es et Statistiques 23 397–423.

Dicker, L. (2012). Dense signals, linear estimators, and out-of-sample prediction for high-

dimensional linear models. Preprint.

Donoho, D. (1995). De-noising by soft-thresholding. Information Theory, IEEE Transac-

tions on 41 613–627.

Donoho, D. and Johnstone, I. (1994). Minimax risk over ℓp-balls for ℓq-error. Probability

Theory and Related Fields 99 277–303.

Duarte, M., Davenport, M., Takhar, D., Laska, J., Sun, T., Kelly, K. and Bara-

niuk, R. (2008). Single-pixel imaging via compressive sampling. Signal Processing Maga-

zine, IEEE 25 83–91.

Erlich, Y., Gordon, A., Brand, M., Hannon, G. and Mitra, P. (2010). Compressed

genotyping. Information Theory, IEEE Transactions on 56 706–723.

Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation

in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Sta-

tistical Methodology) 74 37–65.

Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with np-dimensionality. Infor-

mation Theory, IEEE Transactions on 57 5467–5484.

Friedman, J., Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). Discussion

Page 28

L. Dicker/Dense Signals and High-Dimensional Linear Models28

of boosting papers. Ann. Statist 32 102–107.

Goldenshluger, A. and Tsybakov, A. (2001). Adaptive prediction and estimation in

linear regression with infinitely many parameters. Annals of Statistics 29 1601–1619.

Goldenshluger, A. and Tsybakov, A. (2003). Optimal prediction for linear regression

with infinitely many parameters. Journal of Multivariate Analysis 84 40–60.

Goldstein, D. (2009). Common genetic variation and human traits. New England Journal

of Medicine 360 1696–1698.

Hall, P., Jin, J. and Miller, H. (2009). Feature selection when there are many influential

features. Arxiv preprint arXiv:0911.4076.

Hirschhorn, J. (2009). Genomewide association studies – illuminating biologic pathways.

New England Journal of Medicine 360 1699–1701.

Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal

problems. Technometrics 12 55–67.

James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the

Fourth Berkeley Symposium on Mathematical Statistics and Probability: held at the Statis-

tical Laboratory, University of California, June 20-July 30, 1960. University of California

Press.

Johnstone, I. (2011). Gaussian Estimation: Sequence and Wavelet Models. Unpublished

manuscript.

Kraft, P. and Hunter, D. (2009). Genetic risk prediction – are we there yet? New England

Journal of Medicine 360 1701–1703.

Leeb, H. (2009). Conditional predictive inference post model selection. Annals of Statistics

37 2838–2876.

L´ evy, P. (1922). Le¸ cons d’Analyse Fonctionnelle. Gauthier-Villars.

Lustig, M., Donoho, D. and Pauly, J. (2007). Sparse MRI: The application of compressed

sensing for rapid mr imaging. Magnetic Resonance in Medicine 58 1182–1195.

Manolio, T. (2010). Genomewide association studies and assessment of the risk of disease.

New England Journal of Medicine 363 166–176.

Marˇ cenko, V. and Pastur, L. (1967). Distribution of eigenvalues for some sets of random

matrices. Mathematics of the USSR–Sbornik 1 457–483.

Marchand, E. (1993). Estimation of a multivariate mean with constraints on the norm.

Canadian Journal of Statistics 21 359–366.

Pinsker, M. (1980). Optimal filtration of functions from l2 in gaussian noise. Problems of

Information Transmission 16 52–68.

Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse

estimation. Annals of Statistics 39 731–771.

Robert, C. (1990). Modified bessel functions and their applications in probability and

statistics. Statistics & probability letters 9 155–161.

Page 29

L. Dicker/Dense Signals and High-Dimensional Linear Models 29

Stam, A. (1959). Some inequalities satisfied by the quantities of information of fisher and

shannon1. Information and Control 2 101–112.

Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal

distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics

and probability, vol. 1.

Stein, C. (1960). Multiple regression. In Contributions to Probability and Statistics: Essays

in Honor of Harold Hotelling. Stanford University Press.

Sun, T. and Zhang, C. (2011).Scaled sparse linear regression.

arXiv:1104.4595.

Tikhonov, A. (1943). On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39

195–198.

Wright, J., Yang, A., Ganesh, A., Sastry, S. and Ma, Y. (2008). Robust face recog-

nition via sparse representation. IEEE Transactions on Pattern Analysis and Machine

Intelligence 31 210–227.

Zamir, R. (1998). A proof of the fisher information inequality via a data processing argument.

Information Theory, IEEE Transactions on 44 1246–1250.

Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals

of Statistics 38 894–942.

Arxiv preprint