MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions
ABSTRACT In this paper, we propose a general class of algorithms for optimizing an
extensive variety of nonsmoothly penalized objective functions that satisfy
certain regularity conditions. The proposed framework utilizes the
majorization-minimization (MM) algorithm as its core optimization engine. The
resulting algorithms rely on iterated soft-thresholding, implemented
componentwise, allowing for fast, stable updating that avoids the need for any
high-dimensional matrix inversion. We establish a local convergence theory for
this class of algorithms under weaker assumptions than previously considered in
the statistical literature. We also demonstrate the exceptional effectiveness
of new acceleration methods, originally proposed for the EM algorithm, in this
class of problems. Simulation results and a microarray data example are
provided to demonstrate the algorithm's capabilities and versatility.
-
Citations (0)
-
Cited In (0)
Page 1
arXiv:1001.4776v1 [stat.CO] 26 Jan 2010
MM Algorithms for Minimizing Nonsmoothly Penalized
Objective Functions
Elizabeth D. Schifano∗, Robert L. Strawderman∗, Martin T. Wells∗
Abstract
The use of regularization, or penalization, has become increasingly common in high-
dimensional statistical analysis over the past decade, where a common goal is to simultaneously
select important variables and estimate their effects. It has been shown by several authors that these
goals can be achieved by minimizing some parameter-dependent “goodness of fit” function (e.g.,
a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that
are nonsmooth (i.e. not differentiable) at the origin have received substantial attention, arguably
beginning with LASSO (Tibshirani, 1996).
The current literature tends to focus on specific combinations of smooth data fidelity (i.e.,
goodness-of-fit) and nonsmooth penalty functions. One result of this combined specificity has been
a proliferation in the number of computational algorithms designed to solve fairly narrow classes of
optimization problems involving objective functions that are not everywhere continuously differen-
tiable. In this paper, we propose a general class of algorithms for optimizing an extensive variety of
nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed
framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine.
The resulting algorithms rely on iterated soft-thresholding, implemented componentwise, allowing
forfast, stableupdatingthatavoidstheneedforanyhigh-dimensionalmatrixinversion. We establish
a local convergence theory for this class of algorithms under weaker assumptions than previously
considered in the statistical literature. We also demonstrate the exceptional effectiveness of new ac-
celeration methods, originally proposed for the EM algorithm, in this class of problems. Simulation
results and a microarray data example are provided to demonstrate the algorithm’s capabilities and
versatility.
Keywords: Iterative Soft Thresholding, MIST, MM algorithm.
1Introduction
Variable selection is an important and challenging issue in the rapidly growing realm of high-
dimensional statistical modeling. Insuch cases, itisoften ofinterest toidentify afewimportant variables
in a veritable sea of noise. Modern methods, increasingly based on the principle of penalized likelihood
estimation applied to high dimensional regression problems, attempt to achieve this goal through an
adaptive variable selection process that simultaneously permits estimation of regression effects. Indeed,
the literature on the penalization of a “goodness of fit” function (e.g., negative loglikelihood), with a
penalty singular at the origin, is quickly becoming vast, proliferating in part due to the consideration of
∗Department of Statistical Science, 301 Malott Hall, Cornell University, Ithaca NY 14853 USA
1
Page 2
specific combinations of data fidelity (i.e., goodness-of-fit) and penalty functions, the associated statisti-
cal properties of resulting estimators, and the development of several combination-specific optimization
algorithms, (e.g., Tibshirani, 1996; Zou, 2006; Zou and Hastie, 2005; Zou and Zhang, 2009; Fan and Li,
2001; Park and Hastie, 2007; Friedman et al., 2008).
In this paper, we propose a unified optimization framework that appeals to the Majorization-
Minimization (MM) algorithm (Lange, 2004) as the primary optimization tool. The resulting class
of algorithms is referred to as MIST, an acronym for Minimization by Iterative Soft Thresholding. The
MM algorithm has been considered before for solving specific classes of singularly penalized likelihood
estimation problems (e.g., Daubechies et al., 2004; Hunter and Li, 2005; Zou and Li, 2008); to a large
extent, this work is motivated by these ideas. A distinct advantage of the proposed work is the excep-
tional versatility of the class of MIST algorithms, their associated ease of implementation and numerical
stability, and the development of a fixed point convergence theory that permits weaker assumptions than
existing papers in this area. We emphasize here the focus of this paper is on the development of a stable
and versatile class of algorithms applicable to a wide variety of singularly penalized estimation prob-
lems. In particular, the consideration of asymptotic and oracle properties of estimators derived from
particular combinations of fidelity and penalty functions, as well as methods for effectively choosing
associated penalty parameters, are not focal points of this paper. A comprehensive treatment of these
results may be found in Johnson et al. (2008), where asymptotics and oracle properties for estimators
derived from a general class of penalized estimating equations are developed in some detail.
The paper is organized as follows. In Section 2, we introduce notation and provide sufficient condi-
tions for local convergence of the MM algorithm applied to a large class of data-fidelity and non-smooth
penalty functions. In Section 3, we present a specialized version of this general algorithm, demonstrat-
ing in particular how the minimization step of the MM algorithm can be carried out using iterated
soft-thresholding. In its most general form, iterated soft-thresholding is required at each minimization
step; we further demonstrate how to carry out this minimization step in one iteration through a judicious
choice of majorization function. As a consequence, we present a simplified class of iterative algorithms
that are applicable to a wide class of singularly penalized, generalized linear regression models.
Simulation results are provided in Section 4, while an application in survival analysis to Diffuse
Large B Cell Lymphoma expression data (Rosenwald et al., 2002) is presented in Section 5. We con-
clude with a discussion in Section 6. Proofs and other relevant results are collected in the Appendix.
2MM algorithms for nonsmooth objective functions
Let ξ(β) denote a real-valued objective function to be minimized for β = (β1,...,βp)Tin some convex
subset B of Rp. Let ξSUR(β,α) denote a real-valued “surrogate” objective function, where α ∈ B.
Define the minimization map
M(α) = argmin
β∈B
ξSUR(β,α).
(1)
Then, if ξSUR(β,α) majorizes ξ(β) for each α, a generic MM algorithm for minimizing ξ(β) takes the
following form (e.g., Lange, 2004):
1. Initialize β(0)
2. For n ≥ 0, compute β(n+1)= M(β(n)), iterating until convergence.
2
Page 3
Provided that the objective function, its surrogate and the mapping M(·) satisfy certain regularity condi-
tions, one can establish convergence of this algorithm to a local or global solution. Lange (2004, Ch. 10)
develops such a theory assuming that the objective functions ξ(β) and ξSUR(β,α) are twice continuously
differentiable. For problems that lack this degree of smoothness (e.g., all singularly penalized regres-
sion problems, including lasso, Tibshirani (1996); adaptive lasso, Zou (2006); and SCAD, Fan and Li
(2001)), a more general theory of local convergence is required. One such theory is summarized in
Appendix A.1; related results for the EM algorithm may be found in Wu (1983), Tseng (2004) and
Chr´ etien and Hero (2008).
Let ?·? denote the usual Euclidean vector norm. Based on the theory summarized in Appendix A.1,
we propose a new and general class of algorithms for minimizing penalized objective functions of the
form
ξ(β) = g(β) + p(β;λ) + λε?β?2, λ > 0, ε ≥ 0(2)
where g(β) and p(β;λ) are respectively data fidelity (e.g., negative loglikelihood) and penalty functions
that satisfy regularity conditions to be delineated below. As will be shown later, the class of problems
represented by (2) contains all of the penalized regression problems commonly considered in the current
literature. It also covers numerous other problems by expanding the class of permissible fidelity and
penalty functions in a substantial way.
We assume throughout that g(β) is convex and coercive for β ∈ B, where B is an open convex subset
of Rp. We further assume that
p(β;λ) =
p
?
j=1
˜ p(|βj|;λj),
(3)
where the vector λ = (λT
each λjhas dimension greater than or equal to one, that all blocks have the same dimension, and that
the λj1= λ for each j ≥ 1. Evidently, the case where dim(λj) = 1 for j ≥ 1 simply corresponds to
the setting in which each coefficient is penalized in exactly the same way; permitting the dimension of
λjto exceed one allows the penalty to depend on additional parameters (e.g., weights, such as in the
case of the adaptive lasso considered in Zou (2006)). We are interested in problems with penalization;
therefore, λ is assumed bounded and strictly positive throughout this paper. Several specific examples
will be discussed below. For any bounded θ with λ > 0 as the first element, and the remainder of θ
collecting any additional parameters used to define the penalty, the scalar function ˜ p(r;θ) is assumed to
satisfy the following condition:
1,...,λT
p)Tand λjdenotes the block of λ associated with βj. It is assumed that
(P1) ˜ p(r;θ) > 0 for r > 0; ˜ p(0;θ) = 0; ˜ p(r;θ) is a continuously differentiable concave function with
˜ p′(r;θ) ≥ 0 for r > 0, and, ˜ p′(0+;θ) ∈ [M−1
θ, Mθ] for some finite Mθ> 0.
Evidently, (P1) implies that ˜ p′(r;θ) > 0 for r ∈ (0,Kθ), where Kθ> 0 may be finite or infinite. The
combination of the concavity and nonnegative derivative conditions thus imply that the penalty increases
away from the origin, but with a decreasing rate of growth that may become zero. The case where (3)
is identically zero for r > 0 is ruled out by the positivity of the right derivative at the origin imposed in
(P1); similarly, the concavity assumption also rules out the possibility of a strictly convex penalty term.
Neither of these restrictions is particularly problematic. Our specific interest lies in the development
of algorithms for estimation problems subject to a penalty singular at the origin. Were (3) absent, or
3
Page 4
replaced by a strictly convex penalty term, the convexity of g(β) implies (2) can be minimized directly
using any suitable convex optimization algorithm, such as that discussed in Theorem 3.2 below.
Theorem 2.1 establishes local convergence of the indicated class of MM algorithms for minimizing
objective functions of the form (2). A proof is provided in Appendix A.2, where it is shown that
conditions imposed in the statement of the theorem are sufficient conditions for the application of the
general MM local convergence theory summarized in Appendix A.1.
Theorem 2.1. Let g(β) be convex and coercive and assume p(β;λ) satisfies both (3) and condition (P1).
Let h(β,α) ≥ 0 be a real-valued, continuous function of β and α that is continuously differentiable in β
for each α and satisfies h(β,α) = 0 when β = α. Let
q(β,α;λ) =
p
?
j=1
˜ q(|βj|,|αj|;λj),
(4)
where ˜ q(r, s;θ) = ˜ p(s;θ) + ˜ p′(s;θ)(r − s) for r, s ≥ 0, and define
ψ(β,α) = h(β,α) + q(β,α;λ) − p(β;λ).
Assume the set of stationary points S for ξ(β),β ∈ B is finite and isolated. Then:
(i) ξ(β) in (2) is locally Lipschitz continuous and coercive;
(ii) q(β,α;λ) − p(β;λ) is either identically zero or non-negative for all β ? α;
(iii) ξSUR(β,α) ≡ ξ(β) + ψ(β,α) majorizes ξ(β) and the MM algorithm derived from ξSUR(β,α) con-
verges to a stationary point of ξ(β) if ξSUR(β,α) is uniquely minimized in β for each α and at
least one of h(β,α) or q(β,α;λ) − p(β;λ) is strictly positive for each β ? α.
Condition (iii) of Theorem 2.1 establishes convergence under the assumption that ξSUR(β,α) strictly
majorizes ξ(β) and has a unique minimizer in β for each α. Such a uniqueness condition is shown by
Vaida (2005) to ensure convergence of the EM and MM algorithms to a stationary point under more
restrictive differentiability conditions. Importantly, the assumption of globally strict majorization is
only a sufficient condition for convergence; this condition is only important insofar as it guarantees a
strict decrease in the objective function at every iteration. As can be seen from the proof, it is possible
to relax this condition to locally strict majorization, in which ξSUR(β,α) majorizes ξ(β), with strict
majorization being necessary only in an open neighborhood containing M(α).
The use of the MM algorithm and selection of (4) are motivated by the results Zou and Li (2008);
we refer the reader to Remark 3.1 below for further comments in this direction. The assumptions on
g(β) clearly cover the case of the linear and canonically parameterized generalized linear models upon
setting g(β) = −ℓ(β), where ℓ(β) denotes the corresponding loglikelihood function. Estimation under
the semiparametric Cox regression model (Cox, 1972) and accelerated failure time models are also
covered upon setting g(β) to be either the negative logarithm of the partial likelihood function (e.g.,
Andersen et al., 1993, Thm VII.2.1) or the Gehan objective function (e.g., Fygenson and Ritov, 1994;
Johnson and Strawderman, 2009).
The assumption (P1) on the penalty function covers a wide variety of popular and interesting exam-
ples; see Figure 1 for illustration. For example, the lasso (LAS; e.g., Tibshirani, 1996), adaptive lasso
(ALAS; e.g., Zou, 2006), elastic net (EN; e.g., Zou and Hastie, 2005), and adaptive elastic net (AEN;
4
Page 5
e.g., Zou and Zhang, 2009) penalties are all recovered as special cases upon considering the combi-
nation of (3) and the ridge-type penalty λε?β?2. Specifically, with λj = (λ,ωj)Tfor ωj ≥ 0, taking
˜ p(r;λj) = λωjr in (3) gives LAS (ωj= 1,ǫ = 0), ALAS (ωj> 0,ǫ = 0), EN (ωj= 1,ǫ > 0) and the
AEN (ωj> 0,ǫ > 0) penalties. It is easy to see that selecting ˜ p(r;λj) = λωjr also implies the equality
of (3) and (4), a result relevant in both (ii) and (iii) of Theorem 2.1 above.
−4−2024
0
1
2
3
4
5
LASSO [θ=λ=1]
r
p~(lrl, θ)
−4−2024
0
1
2
3
4
5
SCAD [θ=(λ,a)=(1.25,3.7)]
r
p~(lrl, θ)
−4−2024
0
1
2
3
4
5
Geman & Reynolds, 1992 [θ=(λ,δ)=(5,1)]
r
p~(lrl, θ)
Figure 1: Three examples of penalties satisfying (P1).
The proposed penalty specification also covers the smoothly clipped absolute deviation (SCAD;
e.g., Fan and Li, 2001) penalty upon setting ˜ p(r;λj) = ˜ pS(r;λ,a) for each j ≥ 1, where ˜ pS(r;λ,a) is
defined as the definite integral of
˜ p′
S(u;λ,a) = λ[I(u ≤ λ) +(aλ − u)+
(a − 1)λI(u > λ)](5)
on the interval 0 ≤ u ≤ r and some fixed value of a > 2 (e.g., a = 3.7). The resulting penalty function is
continuously differentiable and concave on r ∈ [0,∞). The concavity of ˜ pS(·;λ,a) on [0,∞), combined
with ˜ pS(0;λ,a) = 0 and the fact that ˜ p′
S(0+;λ,a) is finite, implies
˜ pS(r;λ,a) ≤ ˜ pS(s;λ,a) + ˜ p′
S(s;λ,a)(r − s)(6)
for each r, s
Hiriart-Urruty and Lemar´ echal (1996, Remark 4.1.2, p. 21). In other words, ˜ pS(r;λ,a) can be majorized
by a linear function of r.
The lasso penalty, its variants, and SCAD have received the greatest attention in the literature. More
recently, Zhang (2007) introduced the minimax concave penalty (MCP), which similarly to SCAD is
defined in terms of its derivative. Specifically, one takes ˜ p(r;λj) = ˜ pM(r;λ,a) for each j ≥ 1 in (3),
where ˜ pM(r;λ,a) is defined for a > 2 as the definite integral of
≥
0, the boundary cases for r
=
0 and/or s
=
0 following from
˜ p′
M(u;λ,a) =
?
λ −u
a
?
+
(7)
on the interval 0 ≤ u ≤ r and some fixed value of a > 2 (e.g., a = 3.7 as in Fan et al., 2009b). Further
examples of differentiable concave penalties include ˜ p(r;λj) = ˜ pG(r;λ,δ) for
˜ pG(r;λ,δ) = λ
δr
1 + δr, δ > 0(8)
5
Page 6
(e.g. Geman and Reynolds, 1992; Nikolova, 2000); and ˜ p(r;λj) = ˜ pY(r;λ,δ) for
˜ pY(r;λ,δ) = λlog(δr + 1), δ > 0; (9)
(e.g. Antoniadis et al., 2009). These penalties represent just a small sample of the set of possible penal-
ties satisfying (P1) that one might reasonably consider.
Remark 2.2. The SCAD and MCP penalties are not strictly concave and lead to surrogate majorizers
that fail to satisfy the globally strict majorization condition in (iii) of Theorem 2.1 unless h(β,α) is
strictly positive whenever β ? α; see Remark 3.1 for further discussion and also Theorem 3.4 below.
3MIST: Minimization by Iterated Soft Thresholding
3.1The MIST algorithm
In general, the statistical literature on penalized estimation has proposed optimization algorithms tai-
lored for specific combinations of fidelity and penalty functions. The class of MM algorithms suggested
by Theorem 2.1 provides a very general and useful framework for proposing new algorithms, the key to
which is a methodology for solving the minimization problem (1), a step repeated with each iteration of
the MM algorithm. In this regard, it is helpful to note that the problem of minimizing ξSUR(β,α) for a
given α is equivalent to minimizing
g(β) + λε?β?2
2+ h(β,α) +
p
?
j=1
˜ p′(|αj|;λj)|βj|
(10)
in β. In particular, if g(β)+λε?β?2+h(β,α) is strictly convex for each bounded α, which clearly occurs
if both g(β) and h(β,α) are convex in β and at least one is strictly convex, then (10) is also strictly
convex and the corresponding minimization problem has a unique solution.
Remark 3.1. For ε = h(β,α) = 0 and g(β) = −ℓ(β) for ℓ(β) =?n
differentiable loglikelihood function, the majorizer used by the MM algorithm induced by the surrogate
function (10) corresponds (up to sign) to the minorizer employed in the LLA algorithm of Zou and Li
(2008), an improvement on the so-called LQA algorithm proposed in Hunter and Li (2005). Zou and Li
(2008, Proposition 1) assert convergence of their LLA algorithm under imprecisely stated assumptions
and are additionally unclear as to the nature of convergence result actually estabished. For example,
while Zou and Li (2008, Theorem 1) demonstrate that the LLA algorithm does indeed have an ascent
property, their result appears to be insufficient to ensure that the proper analog of condition Z3(ii) of
Theorem A.1 holds in the case of the SCAD penalty. As a consequence, it is unclear whether even weak
“subsequence” convergence results (cf. Wu, 1983) hold with useful generality in the case of the LLA
algorithm. In contrast, Theorem 2.1 shows that strict majorization, under a few precisely stated condi-
tions, is sufficient to ensure local convergence of the resulting MM algorithm to a stationary point of (2)
In Section 3.2, it is further demonstrated how a particular choice of h(β,α) yields a strict majorizer that
permits both closed form minimization and componentwise updating at each step of the MM algorithm,
even in the case of penalties that fail to be strictly concave.
i=1ℓi(β) with ℓi(β) a twice continuously
Numerous methods exist for minimizing a differentiable convex objective function (e.g.,
Boyd and Vandenberghe, 2004). However, because (10) is not differentiable, such methods do not apply
6
Page 7
in the current setting. Specialized methods exist for nonsmooth problems of the form (10) in settings
where g(β) has a special structure; a well-known example here is LARS (Efron et al., 2004), which can
be used to efficiently solve lasso-type problems in the case where g(β) is replaced by a least squares ob-
jective function. Recently, Combettes and Wajs (2005, Proposition 3.1; Theorem 3.4) proposed a very
general class of fixed point algorithms for minimizing f1(h)+ f2(h), where fi(·),i = 1,2 are each convex
and h takes values in some real Hilbert space H. Hale et al. (2008, Theorem 4.5) specialize the results of
Combettes and Wajs (2005) to the case where H is some subset of Rpand f2(h) =?p
tive application of these results to the problem of minimizing (10) generates an iterated soft-thresholding
procedure with an appealingly simple structure. Theorem 3.2, given below, states the algorithm, along
with conditions under which the algorithm is guaranteed to converge; a proof is provided in Appendix
A.3. The resulting class of procedures, that is, MM algorithms in which the minimization of (10) is car-
ried out via this iterated soft-thresholding procedure, is hereafter referred to as MIST, an acronym for
(M)inimization by (I)terated (S)oft (T)hresholding. Two important and useful features of MIST include
the absence of high-dimensional matrix inversion and the ability to update each individual parameter
separately.
j=1|hi|. The collec-
Theorem 3.2. Suppose the conditions of Theorem 2.1 hold. Let m(β) = g(β) + h(β,α) + λǫ?β?2be
strictly convex with a Lipschitz continuous derivative of order L−1> 0 for each bounded α. Then, for
any such α and a constant ̟ ∈ (0,2L), the unique minimizer of (10) can be obtained in a finite number
of iterations using iterated soft-thresholding:
1. Set n = 1 and initialize b(0)
2. Compute d(n)= b(n−1)− ̟∇m(b(n−1))
3. Compute b(n)= S(d(n);̟τ), where for any vectors u,v ∈ Rp,
S(u;v) =
p
?
j=1
s(uj,vj)ej,
(11)
ejdenotes the jthunit vector for Rp,
s(uj,vj) = sign(uj)(|uj| − vj)+,
(12)
is the univariate soft-thresholding operator, and
τ = (˜ p′(|α1|;λ1),..., ˜ p′(|αp|;λp))T.
4. Stop if converged; else, set n = n + 1 and return to Step 2.
The proposed algorithm, as originally developed in Combettes and Wajs (2005), is suitable for min-
imizing the sum of a differentiable convex function m(·) and another convex function; hence, under
similar conditions, one could employ this algorithm directly to minimize (2) in cases where the penalty
(3) is derived from some convex function ˜ p(·;θ). Theorem 3.4 of Combettes and Wajs (2005) further
shows that the update in Step 3 can be generalized to
?
b(n)= b(n−1)+ δn
S
?
b(n−1)− ̟n∇m(b(n−1));̟nτ
?
− b(n−1)?
,
7
Page 8
where, for every n, ̟n∈ (0,2L) and δn∈ (0,1] is a suitable sequence of relaxation constants. Judicious
selection of these constants, possibly updated at each step, may improve the convergence rate of this
algorithm.
Theorem 3.2 imposes the relatively strong condition that the gradient of m(β) is L−1-Lipschitz con-
tinuous. The role of this condition, also imposed in Combettes and Wajs (2005, Proposition 3.1; The-
orem 3.4), is to ensure that the update at each step of the proposed algorithm is a contraction, thereby
guaranteeing its convergence to a fixed point. To see this, note that the update from b(n)to b(n+1)in the
algorithm of Theorem 3.2 involves the mapping S(b − ̟∇m(b);̟τ). For any bounded b and a, it is
easily shown that
?S(b − ̟∇m(b);̟τ) − S(a − ̟∇m(a);̟τ)? ≤ ?b − a − ̟(∇m(b) − ∇m(a))?.
When ∇m(b) = ∇m(a), the right-hand side reduces to ?b − a?, and the resulting mapping is only nonex-
pansive (i.e., not necessarily contractive). However, under strict convexity, this situation can occur only
if b = a. In particular, suppose that b(n)? b(n−1); then, ∇m(b(n)) ? ∇m(b(n−1)) and, using the mean value
theorem,
?
≤?I − ̟H(b(n),b(n−1))??b(n)− b(n−1)?,
?b(n+1)− b(n)?
=
?S
b(n)− ̟∇m(b(n));̟τ
?
− S
?
b(n−1)− ̟∇m(b(n−1));̟τ
?
?
where H(b,a) =
continuous now implies that choosing ̟ as indicated guarantees ?I − ̟H(b(n),b(n−1))? < 1, thereby
producing a contraction.
In view of the generality of the Contraction Mapping Theorem (e.g., Luenberger and Ye, 2008,
Thm. 10.2.1), it is possible to relax the requirement that ∇m(β) is globally L−1-Lipschitz continuous
provided that one selects a suitable starting point. The relevant extension is summarized in the corollary
below; one may prove this result in a manner similar to Theorem 4.5 of Hale et al. (2008).
?1
0∇m(b + t(a − b))dt. The assumption that the gradient of m(β) is L−1-Lipschitz
Corollary 3.3. Let the conditions of Theorem 2.1 hold. Suppose α is a bounded vector and assume that
m(β) = g(β)+h(β,α)+λǫ?β?2is strictly convex and twice continuously differentiable. Then, for a given
bounded α, there exists a unique minimizer β∗. Let Ω be a bounded convex set containing β∗and define
λmax(β) to be the largest eigenvalue of ∇2m(β). Then, the algorithm of Theorem 3.2 converges to β∗in
a finite number of iterations provided that b(0)∈ Ω, λ∗= maxβ∈Ωλmax(β) < ∞, and ̟ ∈ (0,2/λ∗).
Some useful insight into the form of the proposed thresholding algorithm can be gained by consid-
ering the behavior of the penalty derivative term ˜ p′(r;θ). As suggested earlier, (P1) implies that ˜ p′(r;θ)
decreases from its maximum value towards zero as r moves away from the origin. For some penal-
ties (e.g., SCAD, MCP), this derivative actually becomes zero at some finite value of r > 0, resulting
in situations for which τj = ˜ p′(|αj|;λj) = 0 for at least one j. If this occurs for component j, then
jthcomponent of the vector S
b(n)− ̟∇m(b(n));̟τ
gument vector b(n)− ̟∇m(b(n)). In the extreme case where τ = 0, the proposed update reduces to
b(n+1)= b(n)− ̟∇m(b(n)), an inexact Newton step in which the inverse hessian matrix is replaced by
̟Ip, Ipdenoting the p × p identity matrix, and with step-size chosen to ensure that this update yields a
contraction. Hence, if each of the components of b(n)− ̟∇m(b(n)) are sufficiently large in magnitude,
the proposed algorithm simply takes an inexact Newton step towards the solution; otherwise, one or
more components of this Newton-like update are subject to soft-thresholding.
??
simply reduces to the jthcomponent of the ar-
8
Page 9
3.2 Penalized estimation for generalized linear models
The combination of Theorems 2.1, 3.2 and Corollary 3.3 lead to a useful and stable class of algorithms
with the ability to deal with a wide range of penalized regression problems. In settings where g(β) is
strictly convex and twice continuously differentiable, one can safely assume that h(β,α) = 0 for all
choices of β and α provided that ˜ p′(r;θ) in (P1) is strictly positive for r > 0; important examples of sta-
tistical estimation problems here include many commonly used linear and generalized linear regression
models, semiparametric Cox regression (Cox, 1972), and smoothed versions of the accelerated failure
time regression model (cf. Johnson and Strawderman, 2009). The SCAD and MCP penalizations, as
well as other penalties having ˜ p′(r;θ) ≥ 0 for r > 0, can also be used; however, additional care is
required. In particular, and as pointed out in an earlier remark, if one sets h(β,α) = 0 for all β and α
then convergence of the resulting algorithm to a stationary point is no longer guaranteed by the above
results due to the resulting failure of these penalties to induce strict majorization.
The need to use an iterative algorithm for repeatedly minimizing (10) is not unusual for the class
of MM algorithms. However, it turns out that for certain choices of g(β), a suitable choice of h(β,α)
in Theorem 3.2 guarantees both strict majorization and permits one to minimize (10) in a single iter-
ation, resulting in a single soft-thresholding update at each iteration. Below, we demonstrate how the
MIST algorithm simplifies in settings where g(β) corresponds to the negative loglikelihood function of a
canonically parameterized generalized linear regression model having a bounded hessian function. The
result applies to all penalties satisfying condition (P1), including SCAD and MCP. A proof is provided
in Appendix A.4.
Theorem 3.4. Let y be N × 1 and suppose the probability distribution of y follows a generalized linear
model with a canonical link and linear predictor?X?β, where?X = [1N,X] is N×(p+1) and?β = [β0,βT]T
ℓ(?β) = 1T(c(y) − b(? η)) + yT? η
denotes the corresponding loglikelihood; here,? η =?X?β and E[yi] = b′(˜ ηi) for i = 1...N for b(·) strictly
(P1); note that β0remains unpenalized. Define
is (p + 1) × 1 with β0denoting an intercept. Assume that g(?β) = −ℓ(?β), where
convex and twice continuously differentiable. Let the penalty function be defined as in (3) and satisfy
h(?β,? α) = ℓ(?β) − ℓ(? α) − ∇ℓ(? α)T(?β − ? α) + ̟−1(?β − ? α)T(?β − ? α);
where ˜ α ≡ [α0,αT]Tis (p + 1) × 1, and ̟ is defined as in Corollary 3.3. Then:
(13)
1. The objective function (2), say ξglm(˜β), is majorized by
ξSUR
glm(˜β, ˜ α)
=
−ℓ(˜ α) − ∇ℓ(˜ α)T(˜β − ˜ α)
+̟−1(˜β − ˜ α)T(˜β − ˜ α) +
p
?
j=1
(τj|βj| + γj+ λǫβ2
j)(14)
where τj= ˜ p′(|αj|;λj) and γj= ˜ p(|αj|;λj) − ˜ p′(|αj|;λj)|αj| are bounded, nonnegative, and func-
tionally independent of˜β.
2. The functions g(?β) = −ℓ(˜β) and h(˜β, ˜ α) satisfy the regularity conditions of Theorems 2.1 and 3.2;
hence, the corresponding MM algorithm converges to a stationary point of (2).
9
Page 10
3. For each bounded ˜ α,
(a) the minimizer˜β∗of ξSUR
glm(˜β, ˜ α) is unique and satisfies
β∗
=
1
1 + ̟λǫS
α0+̟
?
α +̟
2[∇ℓ(˜ α)]A,̟
2τ
?
,
β∗
0
=
2[∇ℓ(? α)]0
(15)
where S(·;·) is the soft-thresholding operator defined in (11) and A = {1,..., p}.
(b) for each ˜ κ ≡ [κ0,κT]T∈ R(p+1),
ξSUR
glm(˜β∗+ ˜ κ, ˜ α) ≥ ξSUR
glm(˜β∗, ˜ α) + ̟−1?˜ κ?2.
(16)
In view of previous results, the result in # 3 of Theorem 3.4 shows that the resulting MM algorithm
takes a very simple form: given the current iterate?β
1. update the unpenalized intercept β(n)
(n),
0:
β(n+1)
0
= β(n)
0+̟
2
?
∇ℓ(˜β(n))
?
0
2. update the remaining parameters β(n):
β(n+1)=
1
1 + ̟λǫS
?
β(n)+̟
2[∇ℓ(˜β(n))]A;̟
2τ(n)?
,
(17)
where τ(n)= (˜ p′(|β(n)
1|;λ1),..., ˜ p′(|β(n)
p|;λp))T.
The specific choice of function h(?β,? α) clearly serves two useful purposes:
(iii) of Theorem 2.1, allowing one to establish the convergence of MIST for SCAD and other penalties
having ˜ p′(r,θ) = 0 at some finite r > 0.
Evidently, the algorithm above immediately covers the setting of penalized linear regression. For
example, suppose that y has been centered to remove β0from consideration and that the problem has
also been rescaled so that X, which is now N × p, satisfies the indicated conditions. Then, the results of
the Theorem 3.4 apply directly with
(i) it leads to
componentwise-soft thresholding; and, (ii) it leads to strict majorization, as is required in condition
−ℓ(β) =1
2?Xβ − y?2, ∇ℓ(β) = XT(y − Xβ), h(β,α) = ̟−1?β − α?2−1
2?Xβ − Xα?2,
where ̟isdefined asinCorollary 3.3. Forthe class of adaptive elastic net penalties (i.e., ˜ p(r;λj) = λωjr
in(3)), the resulting iterative scheme isexactly that proposed in(De Mol et al.,2008, pg. 17), specialized
to the setting of a Euclidean parameter. In particular, we have τj= λωjand γj= 0 in Theorem 3.4, and
the proposed update reduces to
β(k+1)=
1
ν + 2λǫS
?
(νI − X′X) β(k)+ X′y;λ
?
,
where ν = 2̟−1. Setting ν = 1 and ǫ = 0 yields the iterative procedure proposed in Daubechies et al.
(2004), provided that X′X is scaled such that I−X′X is positive definite. The proposed MIST algorithm
10
Page 11
extends these iterative componentwise soft-thresholding procedures to a much wider class of penalty
and data fidelity functions.
The restriction to canonical generalized linear models in Theorem 3.4 is imposed to ensure strict
convexity of the negative loglikelihood. Our results are easily modified to handle non-canonical gener-
alized linear models, provided the negative loglikelihood remains strictly convex in?β and the hessian
els satisfy the regularity conditions of Theorem 3.4. One such important class of problems is penalized
likelihood estimation for Poisson regression models. For example, in the classical setting of N indepen-
dent Poisson observations with E[Yi|˜Xi] = diexp{˜ xT
(i.e., up to irrelevant constants) ℓ(?β) = −?N
fi(u) = dieu− yiu.
can be appropriately bounded. Interestingly, not all canonically parameterized generalized linear mod-
i?β} for a known set of constants d1...dN, we have
i=1fi(˜ xT
i?β), where
It is easy to see that ∇ℓ(?β), hence ∇m(?β), is locally but not globally Lipschitz continuous; hence, it is
progress remains possible. For example, Corollary 3.3 implies that that one can still use a single update
of the form (17) provided that a suitable Ω, hence C and?β
p
?
j=0
not possible to choose a matrix C = ̟−1I such that (14) everywhere majorizes ξglm(?β). Nevertheless,
(0), can be identified. Alternatively, using
results summarized in Becker et al. (1997), one can instead majorize −ℓ(?β) for any bounded α using
k(?β,? α) =
where, for every i, θij≥ 0 are any sequence of constants satisfying?p
Of importance here is the fact kj(βj;αj) is a strictly convex function of βjand does not depend on βk
for k ? j. One may now take h(?β,? α) in Theorem 2.1 as being equal to k(?β,? α) + ℓ(?β), leading to the
p
?
j=1
kj(βj;αj) for kj(βj;αj) =
n ?
i=1
I{xij? 0}θijfi
?xij
θij(βj− αj) + ˜ xT
i? α
?
,
j=0θij= 1 and θij> 0 if xij? 0.
minimization of
ξSUR(?β,? α) ∝
[kj(βj;αj) + λεβ2
j+ ˜ p′(|αj|;λj)|βj|] + k0(β0,α0).
(18)
In particular, componentwise soft-thresholding is replaced by componentwise minimization of (18), the
latter being possible using any algorithm capable of minimizing a continuous nonlinear function of one
variable.
Remark 3.5. The Cox proportional hazards model (Cox, 1972), while not a generalized linear model,
shares the essential features of the generalized linear model utilized in Theorem 3.4. In particular, the
negative log partial likelihood, say g(β) = −ℓp(β), is strictly convex, twice continuously differentiable,
and has a bounded hessian (e.g., Bohning and Lindsay, 1988; Andersen et al., 1993). Consequently,
Theorem 3.4 and its proof are easily modified for this setting upon taking g(β) as indicated, setting
h(β,α) = ℓp(β) − ℓp(α) − ∇ℓp(α)T(β − α) + ̟−1?β − α?2, and taking ̟ as defined as in Corollary 3.3.
3.3Accelerating Convergence
Similarly to the EM algorithm, the stability and simplicity of the MM algorithm frequently comes
at the price of a slow convergence rate.Numerous methods of accelerating the EM algorithm
have been proposed in the literature; see McLachlan and Krishnan (2008) for a review. Recently,
11
Page 12
Varadhan and Roland (2008)proposed anew method for EMcalled SQUAREM,obtained by“squaring”
an iterative Steffensen-type (STEM) acceleration method. Both STEM and SQUAREM are structured
for use with iterative mappings of the form θn+1= M(θn), n = 0,1,2,..., hence applicable to both the
EM and MM algorithms. Specifically, the acceleration update for SQUAREM is given by
θn+1
=
θn− 2γn(M(θn) − θn) + γ2
θn− 2γnrn+ γ2
n[M(M(θn)) − 2M(θn) + θn]
=
nvn,
(19)
where rn
Varadhan and Roland (2008) suggest several steplength options, with preference for choice γn =
−?rn?/?vn?. Roland and Varadhan (2005) provide a proof of local convergence for SQUAREM under
restrictive conditions on the EM mapping M(θ), while Varadhan and Roland (2008) outline a proof for
global convergence for versions of SQUAREM that employ a back-tracking strategy. We study the
effectiveness of SQUAREM applied to the simplified form of the MIST algorithm, hereafter denoted
SQUAREM2, in Section 4.3.
=
M(θn) − θn and vn
= (M(M(θn)) − M(θn)) − rn for an adaptive steplength γn.
4 Simulation Results
The simulation results summarized below are intended to compare the estimates of β obtained from
existing methods to those obtained using the simplified MIST algorithm of Theorem 3.4. In partic-
ular, we consider the performance of MIST for the class of penalized linear and generalized linear
models, demonstrating its capability of recovering the solutions provided by existing algorithms when
both algorithms are forced to use the same set of “tuning” parameters (i.e., penalty parameter(s), plus
any additional parameters required to define the penalty itself). In cases where multiple local minima
can arise, we further show that the MIST algorithm often tends to find solutions with lower objective
function evaluations in comparison with existing algorithms, provided these algorithms utilize the same
choice of starting value.
4.1Example 1: Linear Model
Let 1m and 0m respectively denote m-dimensional vectors of ones and zeros.
Zou and Zhang (2009), we generated data from the linear regression model
Then, following
y = x′β∗+ ε
(20)
where β∗= (3 · 1T
and x follows a p-dimensional multivariate normal distribution with zero mean and covariance matrix Σ
having elements Σj,k= ρ|j−k|, 1 ≤ k, j ≤ p. We considered σ ∈ {1,3}, ρ ∈ {0.0,0.5,0.75} and p ∈ {35,81}
for N = 100 independent observations.
Penalized least squares estimation is considered for five popular choices of penalty functions, all of
which are currently implemented in the R software language (R Development Core Team, 2005): LAS,
ALAS, EN, AEN, and SCAD. The LAS, ALAS, EN and AEN penalties are all convex and lead to
unique solutions under mild conditions; the SCAD penalty is concave and the resulting minimization
problem may have multiple solutions. In each case, we used existing software for computing solutions
subject to these penalizations and compared those results to the solutions computed using the MIST
algorithm.
q,0T
p−q)Tis a p-dimensional vector with intrinsic dimension q = 3[p/9], ε ∼ N(0,σ2),
12
Page 13
Regarding existing methods, we respectively used the lars (Hastie and Efron, 2007) and elasticnet
(Zou and Hastie, 2008) packages for computing solutions in the case of the LAS and EN penalties.
For the ALAS and AEN penalties, we used software kindly provided by Zou and Zhang (2009) that also
makes use of the elasticnet package. The weights for the AENpenalty are computed using ωj= |ˆβEN
j = 1,..., p, whereˆβENis an EN estimator and γ is a positive constant. Using EN-based weights in
the AEN fitting algorithm necessitates tuning parameter specification for both EN and AEN. As in
Zou and Zhang (2009), the ℓ1parameters λ (λ1in their notation) are allowed to differ, whereas the ℓ2
parameters ǫ (λ2in their notation) are forced to be the same. Evidently, setting ǫ = 0 (λ2= 0) results
in the ALAS solution. For the SCAD penalty, we considered the estimator of Kim et al. (2008) (HD),
as well the one-step SCAD (1S) and LLA estimators of Zou and Li (2008). The code for the first two
methods was kindly provided by their respective authors; the LLA estimator was computed using the
R package SIS. The choice a = 3.7 was used for all implementations of SCAD.
We considered finding solutions using penalties in the set Λ = {0.1,1,5,10,20,100}. In particular,
for LAS and SCAD, λ = λ1 ∈ Λ. For EN, both λ = λ1 ∈ Λ and λǫ = λ2 ∈ Λ. For simplicity, we
fixed the weights for AEN for a given λ2by selecting the ‘best’ˆβENamong the six estimators involving
λ = λ1∈ Λ based on a BIC-like criteria. Likewise for ALAS, the weights were computing using the
‘best’ˆβLASamong the six estimators involving λ = λ1∈ Λ. The parameter γ for the ALAS and AEN
penalties was respectively set to three and five for p = 35 and p = 81.
For the strictly convex objective functions associated with the LAS, ALAS, EN, and AEN penalties,
wesimply used astarting value of β(0)= 0p. ForSCAD,three different starting values for the MIST,HD,
and LLA SCAD algorithms were considered: β(0)= 0p, β(0)=?βml(i.e., the unpenalized least squares
(2008), the one-step estimator 1S is computed using?βml, an appropriate choice when N > p.
simulation study. The convergence criteria used for MIST were as follows: the algorithm stopped if
either (i) the normed difference of successive iterates was less than 10−6(convergence of coefficients);
or, (ii) the difference of the objective function evaluated at successive iterates was less than 10−6and the
number of iterations exceeded 106(convergence of optimization). Due to the large number of compar-
isons and highly intensive nature of the computations, we ran B = 100 simulations for each choice of ρ,
σ, and p. We report the results for the convex penalties in Table 1 and those for the SCAD penalty in
Tables 2 and 3.
In Table 1, we summarize the average normed difference between the solution obtained using exist-
ing software and that obtained using MIST,
we report in the two leftmost panels the maximum value of this difference, computed across all com-
binations of tuning parameters. These maximum differences (all of which are multiplied by 105) are
remarkably small for all (A)LAS and (A)EN penalties, indicating that MIST recovers the same (unique)
solutions as the existing algorithms. Interestingly, the values for LAS are slightly larger than the rest,
where the maximum differences all resulted from the smallest value of λ considered (λ = 0.1). In
these cases, the algorithm tended to stop using the objective function criteria rather than the (stricter)
coefficient criteria, resulting in slightly larger differences on average.
The results for SCAD are reported in Tables 2 (p = 35) and 3 (p = 81) and display (i) the average
normed differences, multiplied by 103, for each combination of λ, ρ, σ, p and starting value; and, (ii) the
proportion of simulated datasets in which the MIST solution yields a lower evaluation of the objective
function in comparison with the solution obtained using another method for the indicated choice of
j|−γ,
estimate), and β(0)=?β1S,λ(i.e., the one-step estimate computed using the penalty λ). As in Zou and Li
The convergence criteria used by the existing software packages were used without alteration in this
???ˆβexist−ˆβmist
???, over the B = 100 simulations; in particular,
13
Page 14
Table 1: Maximum average normed differences (×105) over B = 100 simulations for Examples 1 (LM) and 2
(GLM).
LM : σ = 1
LM : σ = 3
ρ
00.50.750
p = 35
LAS0.100.351.45
0.10
ALAS 0.030.140.64
0.05
EN 0.07 0.190.50
0.07
AEN 0.030.10 0.33
0.04
p = 81
LAS1.733.82 11.76
2.33
ALAS0.120.38 1.58
0.35
EN 0.310.490.87
0.31
AEN 0.14 0.220.56
0.16
GLM
0.50.7500.5 0.75
q = 25
0.07
1.84
2.30
1.47
q = 75
0.10
1.34
2.35
1.27
0.37
0.21
0.20
0.13
1.56
1.00
0.51
0.36
4.28
2.86
5.61
3.35
6.17
3.76
8.68
5.27
5.78
1.03
0.49
0.26
18.99
4.39
0.88
0.56
6.97
2.55
4.64
2.29
9.94
3.30
6.56
2.85
starting value. The results for λ = 100 are not shown, as the solution was 0pin all cases. In comparison
with the convex penalties, larger normed differences are observed, even when controlling for the use of
the samestarting value. Suchdifferences are aresult oftwoimportant features oftheSCADoptimization
problem: (i) the possible existence of several local minima; and, (ii) the fact that the MIST, HD, and
LLA algorithms each take a different path from a given starting value towards one of these solutions.
For example, while each of the LLA, MIST, and HD algorithms involve majorization of the objective
function using a lasso-type surrogate objective function, both the majorization and minimization of the
resulting surrogate function are carried out differently in each case. In particular, the LLA algorithm, as
implemented in SIS, majorizes only the penalty term and adapts the lasso code in glmpath in order to
minimize the corresponding surrogate objective function at each step. The HD algorithm is similar in
spirit, but instead decomposes the penalty term into a sum of a concave and convex function and utilizes
the the algorithm of Rosset and Zhu (2007) to minimize the corresponding surrogate objective function.
The MIST algorithm instead uses the same penalty majorization as the LLA algorithm, but additionally
majorizes the negative loglikelihood term in a way that permits minimization of the surrogate function
in a single soft-thresholding step. Each procedure therefore takes a different path towards a solution,
even when given the same starting value.
We remark here that differences must also expected between any of LLA, HD, MIST and the one-
step solution 1S; from an optimization perspective, the one-step estimate is the result of running just one
iteration of the LLA algorithm, starting from the unpenalized least squares estimator?βml(Zou and Li,
methods (LLA, MIST, HD) iterate until some local minimizer (or stationary point) is reached. For ex-
ample, when using either?βmlor?β1S,λas the starting value, MIST always found a solution that produced
ing value of 0p, the one-step estimator did occasionally result in a lower objective function evaluation in
cases involving smaller values of λ. This behavior is not terribly surprising; with small λ, the one-step
solution should generally be close to the unpenalized least squares solution, as the objective function
itself is likely to be dominated by the least squares term.
Of all the SCAD algorithms considered here, MIST and LLA tended to find the most similar solu-
2008), and only provides an approximation the solution to the desired minimization problem. All other
a lower evaluation of the objective function in comparison to?β1S,λ. However, when using the null start-
14
Page 15
tions (i.e., have the smallest normed differences). For the cases in which the LLA solution had lower
objective function evaluations, all of the MIST solutions were also LLA solutions; i.e, when starting the
LLA algorithm with the MIST solution, the algorithm terminated at the starting value (i.e., the LLA so-
lution coincides with the MIST solution). With the exception of three of these cases, starting the MIST
algorithm with the LLA solution also resulted in the same behavior. For the most part, the HD and MIST
algorithms also gave similar results, with one source of difference being the respective stopping criteria
used. The stopping criteria for HD, assessed in order, are as follows: (1) ‘convergence of optimization’:
stop if the absolute value of the difference of the objective evaluated at successive iterates is less than
1e-6; (2) ‘convergence of penalty gradient’: stop if the sum of the absolute value of the differences of
the derivative of the centered penalty evaluated at successive iterates is less than 1e-6, (3) ‘convergence
of coefficients:’ stop if the sum of the absolute value of the differences of successive iterates is less than
1e-6, and (4) ‘jump-over’ criteria: stop if the objective at the previous iterate plus 1e-6 was less than the
objective at the current iterate. After careful analysis of the results, we can assert the following:
• The MIST solution usually has the same or a lower evaluation of the objective function in com-
parison with HD, regardless of starting value.
• HD tends to have the greatest difficulty in cases of high correlation between predictors, a likely
result of the fact that this algorithm relies on the variance of the unpenalized least squares estima-
tor, hence matrix inversion, to take steps towards solution. In contrast, MIST requires no matrix
inversion.
On balance, the MIST algorithm performs as well or better than LLA and HD in locating minimizers
in nearly all cases. As suggested above, variation in the solutions found can be traced to the path each
algorithm takes towards a solution and differences in stopping criteria. Remarkably, in cases when the
correlation among predictors was low, the choice of starting value made little difference for MIST;either
the same solution was found for all starting values or none of the starting values dominated in terms of
finding the lower or equivalent objective evaluations. In settings involving higher correlation, however,
using either 0por the 1S starting values tended to result in the lower evaluations of the objective function
in comparison with using the unpenalized least squares solution. Similar behavior was observed for the
LLA algorithm. In contrast, the choice of starting value had a much larger impact on the performance of
the HD estimator; in particular, the use of 0pas a starting value typically resulted in the lowest objective
function evaluations when compared to using a non-null starting value.
4.2 Example 2: Binary Logistic Regression
As in Example 1, we considered the LAS, ALAS, EN, AEN, and SCAD penalties. There are at least
two R packages that allow penalization using the LAS and EN penalties: glmpath (Park and Hastie,
2007), which handles binomial and poisson regression using a “predictor-corrector” method, and glmnet
(Friedman et al., 2008), which handles binomial and multinomial regression using cyclical coordinate
descent. Both methods can be tuned to find the same solutions, so for ease of presentation we only con-
sider the results of glmnet for comparison in the tables and analysis below. The SIS package (Fan et al.,
2009a) permits computations with the ALAS, AEN, and SCAD penalties using modifications of the
Park and Hastie (2007) code. For SCAD, we compared the results of MIST to the results from the one-
step (1S) algorithm (GLM version, Zou and Li, 2008) using the code provided from the authors and the
LLA algorithm as implemented in Fan et al. (2009a).
15
View other sources
Hide other sources
-
Available from Elizabeth D Schifano · 13 May 2013
-
Available from ArXiv