Page 1

arXiv:1203.4572v1 [math.ST] 20 Mar 2012

Optimal Estimation and Prediction for Dense

Signals in High-Dimensional Linear Models

Lee Dicker

Department of Statistics and Biostatistics

Rutgers University

501 Hill Center, 110 Frelinghuysen Road

Piscataway, NJ 08854

e-mail: ldicker@stat.rutgers.edu

Abstract: Estimation and prediction problems for dense signals are often framed in

terms of minimax problems over highly symmetric parameter spaces. In this paper, we

study minimax problems over ℓ2-balls for high-dimensional linear models with Gaussian

predictors. We obtain sharp asymptotics for the minimax risk that are applicable in

any asymptotic setting where the number of predictors diverges and prove that ridge

regression is asymptotically minimax. Adaptive asymptotic minimax ridge estimators

are also identified. Orthogonal invariance is heavily exploited throughout the paper

and, beyond serving as a technical tool, provides additional insight into the problems

considered here. Most of our results follow from an apparently novel analysis of an

equivalent non-Gaussian sequence model with orthogonally invariant errors. As with

many dense estimation and prediction problems, the minimax risk studied here has rate

d/n, where d is the number of predictors and n is the number of observations; however,

when d ≍ n the minimax risk is influenced by the spectral distribution of the predictors

and is notably different from the linear minimax risk for the Gaussian sequence model

(Pinsker, 1980) that often appears in other dense estimation and prediction problems.

AMS 2000 subject classifications: Primary 62J05; secondary 62C20.

Keywords and phrases: adaptive estimation, asymptotic minimax, non-Gaussian se-

quence model, oracle estimators, ridge regression.

1. Introduction

This paper is about estimation and prediction problems involving non-sparse (or “dense”)

signals in high-dimensional linear models. By contrast, a great deal of recent research into

high-dimensional linear models has focused on sparsity. Though there are many notions of

sparsity (e.g. ℓp-sparsity (Abramovich et al., 2006)), a vector β ∈ Rdis typically consid-

ered to be sparse if many of its coordinates are very close to 0. Perhaps one of the general

principals that has emerged from the literature on sparse high-dimensional linear models

may be summarized as follows: if the parameter of interest is sparse, then this can often be

1

Page 2

L. Dicker/Dense Signals and High-Dimensional Linear Models2

leveraged to develop methods that perform very well, even when the number of predictors

is much larger than the number of observations. Indeed, powerful theoretical performance

guarantees are available for many methods developed under this paradigm, provided the pa-

rameter of interest is sparse (Bickel et al., 2009; Bunea et al., 2007; Cand` es and Tao, 2007;

Fan and Lv, 2011; Rigollet and Tsybakov, 2011; Zhang, 2010). Furthermore, in many appli-

cations – especially in engineering and signal processing – sparsity assumptions have been re-

peatedly validated (Donoho, 1995; Duarte et al., 2008; Erlich et al., 2010; Lustig et al., 2007;

Wright et al., 2008). However, there is less certainty about the manifestations of sparsity in

other important applications where high-dimensional data is abundant. For example, sev-

eral recent papers have questioned the degree of sparsity in modern genomic datasets (see,

for instance, (Hall et al., 2009), and the references contained therein – including (Goldstein,

2009; Hirschhorn, 2009; Kraft and Hunter, 2009) – and, more recently, (Bansal et al., 2010;

Manolio, 2010)). In situations like these, sparse methods may be sub-optimal and methods

designed for dense problems may be more appropriate.

Let d and n denote the number of predictors and observations, respectively, in a linear

regression problem. In dense estimation and prediction problems, where the parameter of

interest is not assumed to be sparse, d/n → 0 is often required to ensure consistency. Indeed,

this is the case for the problems considered in this paper. In this sense, dense problems are

more challenging than sparse problems, where consistency may be possible when d/n → ∞.

This lends credence to Friedman et al.’s (2004) “bet on sparsity” principle for high-dimensional

data analysis:

Use a procedure that does well in sparse problems, since no procedure does well in dense problems.

The “bet on sparsity” principle has proven to be very useful, especially in applications where

sparsity prevails, and it may help to explain some of the recent emphasis on understanding

sparse problems. However, the emergence of important problems in high-dimensional data

analysis where the role of sparsity is less clear highlights the importance of characterizing

and thoroughly understanding dense problems in high-dimensional data analysis. This paper

addresses some of these problems.

Minimax problems over highly symmetric parameter spaces have often been equated with

dense estimation problems in many statistical settings (Donoho and Johnstone, 1994; Johnstone,

2011). In this paper, we study the minimax risk over ℓ2-balls for high-dimensional linear

models with Gaussian predictors. We identify several informative, asymptotically equivalent

formulations of the problem and provide a complete asymptotic solution when the number of

predictors d grows large. In particular, we obtain sharp asymptotics for the minimax risk that

are applicable in any asymptotic setting where d → ∞ and we show that ridge regression es-

timators (Hoerl and Kennard, 1970; Tikhonov, 1943) are asymptotically minimax. Adaptive

asymptotic minimax ridge estimators are also discussed. Our results follow from carefully ana-

Page 3

L. Dicker/Dense Signals and High-Dimensional Linear Models3

lyzing an equivalent non-Gaussian sequence model with orthogonally invariant errors and the

novel use of two classical tools – Brown’s identity (Brown, 1971) and Stam’s inequality (Stam,

1959) – to relate this sequence model to the Gaussian sequence model with iid errors. The re-

sults in this paper share some similarities with those found in (Goldenshluger and Tsybakov,

2001, 2003), which address minimax prediction over ℓ2-ellipsoids. However, the implications

of our results and the methods used to prove them differ substantially from Goldenshluger

and Tsybakov’s (this is discussed in more detail in Sections 2.2-2.3 below).

2. Background and preliminaries

2.1. Statistical setting

Suppose that the observed data consists of outcomes y1,...,yn∈ R and d-dimensional predic-

tors x1,...,xn∈ Rd. The outcomes and predictors follow a linear model and are related via

the equation

yi= xT

iβ + ǫi, i = 1,...,n, (1)

where β = (β1,...,βd)T∈ Rdis an unknown parameter vector (also referred to as “the

signal”) and ǫ1,...,ǫnare unobserved errors. To simplify notation, let y = (y1,...,yn)T∈ Rn,

X = (x1,...,xn)T, and ǫ = (ǫ1,...,ǫn)T. Then (1) may be rewritten as y = Xβ + ǫ. In many

high-dimensional settings it is natural to consider the predictors xi to be random. In this

paper, we assume that

x1,...,xn

iid

∼ N(0,I) and ǫ1,...,ǫn

iid

∼ N(0,1)(2)

are independent, where I = Idis the d × d identity matrix. These distributional assumptions

impose significant additional structure on the linear model (1). However, similar models have

been studied previously (Baranchik, 1973; Breiman and Freedman, 1983; Brown, 1990; Leeb,

2009; Stein, 1960) and we believe that the insights imparted by the resulting simplifications

are worthwhile. For the results in this paper, perhaps the most noteworthy simplifying con-

sequence of the normality assumption (2) is that the distributions of X and ǫ are invariant

under orthogonal transformations.

We point out that the assumption E(xi) = 0 (which is implicit in (2)) is not particularly

limiting: if E(xi) ?= 0, then we can reduce to the mean 0 case by centering and decorrelating

the data. If Var(ǫi) = σ2?= 1 and σ2is known, then this can easily be reduced to the case

where Var(ǫi) = 1. If σ2is unknown and d < n, then σ2can be effectively estimated and

one can reduce to the case where Var(ǫi) = 1 (Dicker, 2012). We conjecture that σ2can be

effectively estimated when d > n, provided supd/n < ∞ (for sparse β, Sun and Zhang (2011)

and Fan et al. (2012) have shown that σ2can be estimated when d ≫ n). Dicker (2012) has

Page 4

L. Dicker/Dense Signals and High-Dimensional Linear Models4

discussed the implications if Cov(xi) = Σ ?= I. Essentially, when the emphasis is prediction

and non-sparse signals, if a norm-consistent estimator for Cov(xi) = Σ is available, then

it is possible to reduce to the case where Cov(xi) = I; if a norm-consistent estimator is not

available, then limitations entail, however, these limitations may not be overly restrictive (this

is discussed further in Section 3.2 below).

Let ||·|| = ||·||2denote the ℓ2-norm. In this paper we study the performance of estimators

ˆβ for β with respect to the risk function

R(ˆβ,β) = Rd,n(ˆβ,β) = Eβ||ˆβ − β||2, (3)

where the expectation is taken over (ǫ,X) and the subscript β in Eβindicates that y = Xβ+ǫ

(below, for expectations that do not involve y, we will often omit this subscript). We emphasize

that the expectation in (3) is taken over the predictors X as well as the errors ǫ, i.e. it is

not conditional on X. The risk R(ˆβ,β) is a measure of estimation error. However, it can also

be interpreted as the unconditional out-of-sample prediction error (predictive risk) associated

with the estimatorˆβ (Breiman and Freedman, 1983; Leeb, 2009; Stein, 1960).

2.2. Dense signals, sparse signals, and ellipsoids

Let B(c) = Bd(c) = {β ∈ Rd; ||β|| ≤ c} denote the ℓ2-ball of radius c ≥ 0. Though a

given signal β ∈ Rdis often considered to be dense if it has many nonzero entries, when

studying broader properties of dense signals and dense estimators it is common to consider

minimax problems over highly symmetric, convex (or loss-convex (Donoho and Johnstone,

1994)) parameter spaces. Following this approach, one of the primary quantities that we

use as a benchmark for evaluating estimators and determining performance limits in dense

estimation problems is the minimax risk over B(c):

R(b)(c) = R(b)

d,n(c) = inf

ˆβ

sup

β∈B(c)R(ˆβ,β).(4)

The infimum on the right-hand side in (4) is taken over all measurable estimatorsˆβ and the

superscript “b” in R(b)(c) indicates that the relevant parameter space is the ℓ2-ball.

A basic consequence of the results in this paper is R(b)(c) ≍ d/n. Thus, one must have

d/n → 0 in order to ensure consistent estimation over B(c). This is a well-known feature of

dense estimation problems and, as mentioned in Section 1, contrasts with many results on

sparse estimation that imply β may be consistently estimated when d/n → ∞. However, the

sparsity conditions on β that are required for these results may not hold in general and our

motivating interest lies precisely in such situations. In this paper we derive sharp asymptotics

for R(b)(c) and related quantities in settings where d/n → 0, d/n → ρ ∈ (0,∞), and d/n → ∞

Page 5

L. Dicker/Dense Signals and High-Dimensional Linear Models5

(we assume that d → ∞ throughout). Though consistent estimation is only guaranteed when

d/n → 0, there are important situations where one might hope to analyze high-dimensional

datasets with d/n substantially larger than 0, even if there is little reason to believe that

sparsity assumptions are valid. The results in this paper provide detailed information that

may be useful in situations like these.

In addition to sparse estimation problems, minimax rates faster than d/n have also been

obtained for minimax problems over ℓ2-ellipsoids, which have been studied extensively in situa-

tions similar to those considered here (Cavalier and Tsybakov, 2002; Goldenshluger and Tsybakov,

2001, 2003; Pinsker, 1980). Much of this work has been motivated by problems in nonpara-

metric function estimation. The results in this paper are related to many of these existing

results, however, there are important differences – both in their implications and the tech-

niques used to prove them. Goldenshluger and Tsybakov’s (2001, 2003) work may be most

closely related to ours. Define the ℓ2-ellipsoid B(c,α) = {β ∈ Rd;

α = (α1,...,αd)T∈ Rd, 0 ≤ α1≤ ··· ≤ αd. Goldenshluger and Tsybakov studied minimax

problems over ℓ2-ellipsoids for a linear model with random predictors similar to the model

considered here (in fact, Goldenshluger and Tsybakov’s results apply to infinite-dimensional

non-Gaussian xi, though xiare required to have Gaussian tails and independent coordinates).

They identified asymptotically minimax estimators over B(c,α) and adaptive asymptotically

minimax estimators and showed that the minimax rate may be substantially faster than

d/n. However, their results also require the axes of B(c,α) to decay rapidly (i.e. ad/c → ∞

quickly) and do not apply to ℓ2-balls B(c) = B(c,(1,...,1)T) unless d/n → 0. Though these

decay conditions are natural for many inverse problems in nonparametric function estimation,

they drive the improved minimax rates obtained by Goldenshluger and Tsybakov and may

be overly restrictive in other settings, such as the genomics applications discussed in Section

1 above.

?n

i=1αiβ2

i≤ c2}, with

2.3. The sequence model

Minimax problems over restricted parameter spaces have been studied extensively in the

context of the sequence model. In the sequence model, given an index set J,

zj= θj+ δj, j ∈ J,(5)

are observed, θ = (θj)j∈Jis an unknown parameter, and δ = (δj)j∈Jis a random error. The

sequence model is extremely flexible, and many existing results about the Gaussian sequence

model (where the coordinates of δ are iid Gaussian random variables) have implications for

high-dimensional linear models (Cavalier and Tsybakov, 2002; Pinsker, 1980). However, these

results tend to apply in linear models where one conditions on the predictors, as opposed to

random predictor models like the one considered here.