Proximal Methods for Sparse Hierarchical Dictionary Learning.

Conference Paper · January 2010with37 Reads
Source: DBLP
Conference: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel
Abstract

We propose to combine two approaches for mod- eling data admitting sparse representations: on the one hand, dictionary learning has proven ef- fective for various signal processing tasks. On the other hand, recent work on structured spar- sity provides a natural framework for modeling dependencies between dictionary elements. We thus consider a tree-structured sparse regulariza- tion to learn dictionaries embedded in a hierar- chy. The involved proximal operator is com- putable exactly via a primal-dual method, allow- ing the use of accelerated gradient techniques. Experiments show that for natural image patches, learned dictionary elements organize themselves in such a hierarchical structure, leading to an im- proved performance for restoration tasks. When applied to text documents, our method learns hi- erarchies of topics, thus providing a competitive alternative to probabilistic topic models.

Figures

Full-text

Available from: Francis Bach
Proximal Methods for Sparse Hierarchical Dictionary Learning
Rodolphe Jenatton
1
RODOLPHE.JENATTON@INRIA.FR
Julien Mairal
1
JULIEN.MAIRAL@INRIA.FR
Guillaume Obozinski GUILLAUME.OBOZINSKI@INRIA.FR
Francis Bach FRANCIS.BACH@INRIA.FR
INRIA - WILLOW Project, Laboratoire d’Informatique de l’Ecole Normale Sup
´
erieure (INRIA/ENS/CNRS UMR 8548)
23, avenue d’Italie, 75214 Paris. France
Abstract
We propose to combine two approaches for mod-
eling data admitting sparse representations: on
the one hand, dictionary learning has proven ef-
fective for various signal processing tasks. On
the other hand, recent work on structured spar-
sity provides a natural framework for modeling
dependencies between dictionary elements. We
thus consider a tree-structured sparse regulariza-
tion to learn dictionaries embedded in a hierar-
chy. The involved proximal operator is com-
putable exactly via a primal-dual method, allow-
ing the use of accelerated gradient techniques.
Experiments show that for natural image patches,
learned dictionary elements organize themselves
in such a hierarchical structure, leading to an im-
proved performance for restoration tasks. When
applied to text documents, our method learns hi-
erarchies of topics, thus providing a competitive
alternative to probabilistic topic models.
1. Introduction
Learned sparse representations, initially introduced by
Olshausen & Field (1997), have been the focus of much
research in machine learning and signal processing, lead-
ing notably to state-of-the-art algorithms for several prob-
lems in image processing (
Elad & Aharon, 2006). Model-
ing signals as a linear combination of a few “basis” vectors
offers more flexibility than decompositions based on prin-
cipal component analysis and its variants. Indeed, sparsity
allows for overcomplete dictionaries, whose number of ba-
sis vectors are greater than the original signal dimension.
1
Contributed equally.
Appearing in Proceedings of the 27
th
International Conference
on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the
author(s)/owner(s).
As far as we know, while much attention has been given
to efficiently solving the corresponding optimization prob-
lem (
Lee et al., 2007; Mairal et al., 2010), there are few at-
tempts in the literature to make the model richer by adding
structure between dictionary elements (Bengio et al., 2009;
Kavukcuoglu et al., 2009). We propose to use recent work
on structured sparsity (Zhao et al., 2009; Jenatton et al.,
2009; Kim & Xing, 2009) to embed the dictionary ele-
ments in a hierarchy.
Hierarchies of latent variables, typically used in neural net-
works and deep learning architectures (see
Bengio, 2009
and references therein) have emerged as a natural structure
in several applications, notably to model text documents.
Indeed, in the context of topic models (
Blei et al., 2003), hi-
erarchical models using Bayesian non-parametric methods
have been proposed by Blei et al. (2010). Quite recently,
hierarchies have also been considered in the context of ker-
nel methods (
Bach, 2009). Structured sparsity has been
used to regularize dictionary elements by
Jenatton et al.
(2010), but to the best of our knowledge, it has never been
used to model dependencies between them.
This paper makes three contributions:
We propose to use a structured sparse regularization to
learn a dictionary embedded in a tree.
We show that the proximal operator for a tree-structured
sparse regularization can be computed exactly in a finite
number of operations using a primal-dual approach, with a
complexity linear, or close to linear, in the number of vari-
ables. Accelerated gradient methods (e.g.,
Nesterov, 2007)
can then be applied to solve tree-structured sparse decom-
position problems, which may be useful in other uses of
tree-structured norms (
Kim & Xing, 2009; Bach, 2009).
Our method establishes a bridge between dictionary
learning for sparse coding and hierarchical topic models
(
Blei et al., 2010), which builds upon the interpretation of
topic models as multinomial PCA (
Buntine, 2002), and can
learn similar hierarchies of topics. See Section
5 for a dis-
cussion.
Page 1
Proximal Methods for Sparse Hierarchical Dictionary Learning
2. Problem Statement
2.1. Dictionary Learning
Let us consider a set X = [x
1
, . . . , x
n
] R
m×n
of n
signals of dimension m. Dictionary learning is a matrix
factorization problem that aims to represent these signals
as linear combinations of dictionary elements, denoted here
by the columns of a matrix D = [d
1
, . . . , d
p
] R
m×p
.
More precisely, the dictionary D is learned along with a
matrix of decomposition coefficients A = [α
1
, . . . , α
n
]
R
p×n
, so that x
i
Dα
i
for every signal x
i
, as measured
by any convex loss, e.g., the square loss in this paper.
While learning simultaneously D and A, one may want
to encode specific prior knowledge about the task at
hand, such as, for example, the positivity of the de-
composition (Lee & Seung, 1999), or the sparsity of A
(
Olshausen & Field, 1997; Lee et al., 2007; Mairal et al.,
2010). This leads to penalizing or constraining (D, A) and
results in the following formulation:
min
D∈D,A∈A
1
n
n
X
i=1
h
1
2
||x
i
Dα
i
||
2
2
+ λΩ(α
i
)
i
, (1)
where A and D denote two convex sets and is a regu-
larization term, usually a norm, whose effect is controlled
by the regularization parameter λ > 0. Note that D is as-
sumed to be bounded to avoid any degenerate solutions of
Eq. (
1). For instance, the standard sparse coding formula-
tion takes to be the
1
norm, D to be the set of matrices
in R
m×p
whose columns are in the unit ball of the
2
norm,
with A = R
p×n
(
Lee et al., 2007; Mairal et al., 2010).
However, this classical setting treats each dictionary ele-
ment independently from the others, and does not exploit
possible relationships between them. We address this po-
tential limitation of the
1
norm by embedding the dictio-
nary in a tree structure, through a hierarchical norm intro-
duced by
Zhao et al. (2009) and Bach (2009), which we
now present.
2.2. Hierarchical Sparsity-Inducing Norms
We organize the dictionary elements in a rooted-tree
T composed of p nodes, one for each dictionary ele-
ment d
j
, j {1, . . . , p}. We will identify these indices
j in {1, . . . , p} and the nodes of T . We want to exploit the
structure of T in the following sense: the decomposition
of any vector x can involve a dictionary element d
j
only if
the ancestors of d
j
in T are themselves part of the decom-
position. Equivalently, one can say that when a dictionary
element d
j
is not involved in the decomposition of a vec-
tor x then its descendants in T should not be part of the
decomposition. While these two views are equivalent, the
latter leads to an intuitive penalization term.
Figure 1.
Left: example of a tree-structured set of groups G
(dashed contours in red), corresponding to a tree T = {1, . . . , 6}
(in black). Right, example of a sparsity pattern: the groups
{2, 4}, {4} and {6} are set to zero, so that the corresponding
nodes (in gray) that form subtrees of T are removed. The remain-
ing nonzero variables {1, 3, 5} are such that, if a node is selected,
the same goes for all its ancestors.
To obtain models with the desired property, one considers
for all j in T , the group g
j
{1, . . . , p} of dictionary
elements that only contains j and all its descendants, and
penalizes the number of such groups that are involved in
the decomposition of x (a group being “involved in the de-
composition” meaning here that at least one of it s dictio-
nary element is part of the decomposition). We call G this
set of groups (Figure
1).
While this penalization is non-convex, a convex proxy has
been introduced by
Zhao et al. (2009) and was further con-
sidered by
Bach (2009) and Kim & Xing (2009) in the con-
text of regression. For any vector α R
p
, let us define
Ω(α)
=
X
g ∈ G
w
g
kα
|g
k,
where α
|g
is the vector of size p whose coordinates are
equal to those of α for indices in the set g, and 0 other-
wise
2
. k.k stands either for the
or
2
norm, and (w
g
)
g ∈ G
denotes some positive weights
3
. As analyzed by Zhao et al.
(2009), when penalizing by , some of the vectors α
|g
are
set to zero for some g G. Therefore, the components
of α corresponding to some entire subtrees of T are set to
zero, which is exactly the desired effect (Figure
1).
Note that even though we have presented for simplicity rea-
sons this hierarchical norm in the context of a single tree
with a single element at each node, it can be extended eas-
ily to the case of forests of trees, and/or trees containing
several dictionary elements at each node. More generally,
this formulation can be extended with the notion of tree-
structured groups, which we now present.
2
Note the difference with the notation α
g
, which is often used
in works on structured sparsity, where α
g
is a vector of size |g|.
3
For a complete definition of for any
q
norm, a discussion
of the choice of q, and a strategy for choosing the weights w
g
, see
(
Zhao et al., 2009; Kim & Xing, 2009).
Page 2
Proximal Methods for Sparse Hierarchical Dictionary Learning
Definition 1 (Tree-structured set of groups.) A set of
groups G = {g}
g ∈ G
is said to be tree-structured in
{1, . . . , p}, if
S
g ∈ G
g = {1, . . . , p} and for all g, h G,
(g h 6= ) (g h or h g). For such a set of groups,
there exists a (non-unique) total order relation such that:
g h
g h or g h =
.
Sparse hierarchical norms having been introduced, we now
address the optimization dealing with such norms.
3. Optimization
Optimization for dictionary learning has already been in-
tensively studied, and a typical scheme alternating between
the variables D and A = [α
1
, . . . , α
n
], i.e., minimizing
over one while keeping the other one fixed, yields good re-
sults in general (
Lee et al., 2007). The main difficulty of
our problem lies essentially in the optimization of the vec-
tors α
i
, i {1, . . . , n} for D fixed, since n may be large,
and since it requires to deal with the nonsmooth regulariza-
tion term . The optimization of the dictionary D (for A
fixed) is in general easier, as discussed in Section
3.5.
Within the context of regression, several optimization
methods to cope with have already been proposed. A
boosting-like t echnique with a path-following strategy is
used by Zhao et al. (2009). Kim & Xing (2009) uses a
reweighted least-squares scheme when k.k is the
2
norm.
The same approach is considered by
Bach (2009), but built
upon an active set strategy. In this paper, we propose to
perform the updates of the vectors α
i
based on a proximal
approach which we now introduce.
3.1. Proximal Operator for the Norm
Proximal methods have drawn increasing attention in the
machine learning community (e.g., Ji & Ye, 2009 and ref-
erences therein), especially because of their convergence
rates (optimal for the class of first-order techniques) and
their ability to deal with large nonsmooth convex problems
(e.g.,
Nesterov, 2007; Beck & Teboulle, 2009). In a nut-
shell, these methods can be seen as a natural extension of
gradient-based techniques when the objective function to
minimize has a nonsmooth part. In our context, when the
dictionary D is fixed and A = R
p
, we minimize for each
signal x the following convex nonsmooth objective func-
tion w.r.t. α R
p
:
f(α) + λΩ(α),
where f(α) stands for the data-fitting term
1
2
kx Dαk
2
2
.
At each iteration of the proximal algorithm, f is linearized
around the current estimate
ˆ
α, and the current value of α
is updated as the solution of the proximal problem:
min
αR
p
f(
ˆ
α)+(α
ˆ
α)
f(
ˆ
α) + λΩ(α) +
L
2
kα
ˆ
αk
2
2
.
The quadratic term keeps the update in a neighborhood
where f is close to its linear approximation, and L > 0
is a parameter. This problem can be rewritten as,
min
αR
p
1
2
kα
ˆ
α
1
L
f(
ˆ
α)
k
2
2
+
λ
L
Ω(α).
Solving efficiently and exactly this problem is crucial to en-
joy the fast convergence rates of proximal methods. In ad-
dition, when the nonsmooth term is not present, the pre-
vious proximal problem exactly leads to the standard gra-
dient update rule. More generally, the proximal operator
associated with our regularization term λ, is the function
that maps a vector u R
p
to the (unique) solution of
min
vR
p
1
2
ku vk
2
2
+ λΩ(v). (2)
In the simpler setting where G is the set of singletons, is
the
1
norm, and the proximal operator is the (elementwise)
soft-thresholding operator u
j
7→ sign(u
j
) max(|u
j
|
λ, 0), j {1, . . . , p}. Similarly, when the groups in G
form a partition of the set of variables, we have a group
Lasso like penalty, and the proximal operator can be com-
puted in closed-form (see
Bengio et al., 2009 and refer-
ences therein). This is a priori not possible anymore as
soon as some groups in G overlap, which is always the case
in our hierarchical setting with tree-structured groups.
3.2. Primal-Dual Interpretation
We now show that Eq. (2) can be solved with a primal-
dual approach. The procedure solves a dual formulation of
Eq. (2) involving the dual norm
4
of k.k, denoted by k.k
,
and defined by kκk
= max
kzk≤1
z
κ for any vector κ
in R
p
. The formulation is described in the following lemma
that relies on conic duality (
Boyd & Vandenberghe, 2004):
Lemma 1 (Dual of the proximal problem)
Let u R
p
and let us consider the problem
max
ξR
p×|G|
1
2
ku
X
g ∈ G
ξ
g
k
2
2
kuk
2
2
s.t. g G, kξ
g
k
λw
g
and ξ
g
j
= 0 if j / g,
(3)
where ξ = (ξ
g
)
g ∈ G
and ξ
g
j
denotes the j-th coordinate of
the vector ξ
g
in R
p
. Then, problems (
2) and (3) are dual to
each other and strong duality holds. In addition, the pair
of primal-dual variables {v, ξ} is optimal if and only if ξ
is a feasible point of the optimization problem (
3), and
v = u
P
g ∈ G
ξ
g
,
g G,
v
|g
ξ
g
= kv
|g
k kξ
g
k
and kξ
g
k
= λw
g
,
or v
|g
= 0.
4
It is easy to show that the dual norm of the
2
norm is the
2
norm itself. The dual norm of the
is the
1
norm.
Page 3
Proximal Methods for Sparse Hierarchical Dictionary Learning
For space limitation reasons, we have omitted all t he de-
tailed proofs from this section. They will be available in
a longer version of this paper. Note that we focus here on
specific tree-structured groups, but the previous lemma is
valid regardless of the nature of G.
The structure of the dual problem of Eq. (
3), i.e., the separa-
bility of the (convex) constraints for each vector ξ
g
, g G,
makes it possible to use block coordinate ascent (
Bertsekas,
1999). Such a procedure is presented in Algorithm 1. It op-
timizes sequentially Eq. (3) with respect to the variable ξ
g
,
while keeping fixed the other variables ξ
h
, for h 6= g. It is
easy to see from Eq. (
3) that such an update for a group g
in G amounts to the orthogonal projection of the vector
u
|g
P
h6=g
ξ
h
|g
onto the ball of radius λw
g
of the dual
norm k.k
. We denote this projection Π
λw
g
.
Algorithm 1 Block coordinate ascent in the dual
Inputs: u R
p
and set of groups G.
Outputs: (v, ξ) (primal-dual solutions).
Initialization: v = u, ξ = 0.
while ( maximum number of iterations not reached ) do
for g G do
v u
P
h6=g
ξ
h
.
ξ
g
Π
λw
g
(v
|g
).
end for
end while
v u
P
g ∈ G
ξ
g
.
3.3. Convergence in One Pass
In general, Algorithm 1 is not guaranteed to solve exactly
Eq. (2) in a finite number of iterations. However, when k.k
is the
2
or
norm, and provided that the groups in G
are appropriately ordered, we now prove that only one pass
of Algorithm
1, i.e., only one iteration over all groups, is
sufficient to obtain the exact solution of Eq. (
2). This result
constitutes the main technical contribution of the paper.
Before stating this result, we need to introduce a key lemma
that shows that, given two nested groups g, h such that g
h {1, . . . , p}, if ξ
g
is updated before ξ
h
in Algorithm 1,
then the optimality condition for ξ
g
is not perturbed by the
update of ξ
h
.
Lemma 2 (Projections with nested groups)
Let k.k denote either t he
2
or
norm, and g and h be
two nested groups—that is, g h {1, . . . , p}. Let v be a
vector in R
p
, and let us consider the successive projections
κ
g
= Π
t
g
(v
|g
) and κ
h
= Π
t
h
(v
|h
κ
g
)
with t
g
, t
h
> 0. Then, we have as well κ
g
=Π
t
g
(v
|g
κ
h
|g
).
The previous lemma establishes the convergence i n one
pass of Algorithm
1 in the case where G contains only two
nested groups g h, provided that ξ
g
is computed be-
fore ξ
h
. In the following proposition, this lemma is ex-
tended to general tree-structured sets of groups G:
Proposition 1 (Convergence in one pass) Suppose that
the groups in G are ordered according to and that
the norm k.k is either the
2
or
norm
5
. Then, after
initializing ξ to 0, one pass of Algorithm 1 with the order
gives the solution of Eq. (
2).
Proof sketch. The proof relies on Lemma
2. We pro-
ceed by induction, by showing that we keep the optimality
conditions of Eq. (3) satisfied after each update in Algo-
rithm
1. The induction is initialized by the leaves. Once
the induction reaches the last group, i.e., after one com-
plete pass over G, the dual variable ξ satisfies the optimality
conditions for Eq. (
3), which implies that {v, ξ} is optimal.
Since strong duality holds, v is the solution of Eq. (2).
3.4. Efficient Computation of the Proximal Operator
Since one pass of Algorithm
1 involves |G| = p projections
onto the ball of the dual norm (respectively the
2
and the
1
norms) of vectors in R
p
, a naive implementation leads to
a complexity in O(p
2
), since each of these projections can
be obtained in O(p) operations (see
Mairal et al., 2010, and
references therein). However, the primal solution v, which
is the quantity of interest, can be obtained with a better
complexity, as exposed below:
Proposition 2 (Complexity of the procedure)
i) For the
2
norm, the primal solution v of Algorithm
1 can
be obtained in O(p) operations.
ii) For the
norm, v can be obtained in O(pd) opera-
tions, where d is the depth of the tree.
The linear complexity in the
2
norm case results from a
recursive implementation. It exploits the fact that each pro-
jection amounts to a scaling, whose factor can be found
without explicitly performing the full projection at each it-
eration. As for the
norm, since all the groups at a depth
k {1, . . . , d} do not overlap, the cost for performing all
the projections at this depth k is O(p), which leads to a
total complexity of O(dp). Note that d could depend on
p as well. For instance, in an unbalanced case, the worse
case could be d = O(p), in a balanced tree, one could have
d = O(log(p)). In practice, the structures we have consid-
ered are relatively flat, with a depth not exceeding d = 5.
Details will be provided in a longer version of this paper.
3.5. Learning the Dictionary
We alternate between the updates of D and A, minimizing
over one while keeping the other variable fixed.
5
Interestingly, we have observed that this was not true in gen-
eral when k.k is an
q
norm, for q 6= 2 and q 6= .
Page 4
Proximal Methods for Sparse Hierarchical Dictionary Learning
Updating D. We have chosen to follow the matrix-
inversion free procedure of
Mairal et al. (2010) for up-
dating the dictionary. This method consists in a block-
coordinate scheme over the columns of D. Specifically,
we assume that the domain set D has the form
D
µ
= {D R
m×p
, µkd
j
k
1
+ (1 µ)kd
j
k
2
2
1}, (4)
or D
+
µ
= D
µ
R
m×p
+
, with µ [0, 1]. The choice for these
particular domain sets is motivated by the experiments of
Section
4. For natural image patches, the dictionary ele-
ments are usually constrained to be in the unit
2
norm ball
(i.e., D = D
0
), while for topic modeling, the dictionary
elements are distributions of words and therefore belong
to the simplex (i.e., D = D
+
1
). The update of each dic-
tionary element amounts to performing a Euclidean pro-
jection, which can be computed efficiently (
Mairal et al.,
2010). Concerning the stopping criterion, we follow the
strategy from the same authors and go over the columns
of D only a few times, typically 5 in our experiments.
Updating the vectors α
i
. The procedure for updating
the columns of A is built upon the results derived in Sec-
tion
3.2. We have shown that the proximal operator from
Eq. (2) can be computed exactly and efficiently. It makes
it possible to use fast proximal techniques, suited to non-
smooth convex optimization.
Specifically, we have tried the accelerated scheme from
both
Nesterov (2007) and Beck & Teboulle (2009), and fi-
nally opted for the latter since, for a comparable level of
precision, fewer calls of the proximal operator are required.
The procedure from
Beck & Teboulle (2009) basically fol-
lows Section 3.1, except that the proximal operator is not
directly applied on the current estimate, but on an auxiliary
sequence of points that linearly combines past estimates.
This algorithm has an optimal convergence rate in the class
of first-order techniques, and also allows warm restarts,
which is crucial in our alternating scheme. Furthermore,
positivity constraints can be added on the domain of A,
by noticing that for our norm and any u R
p
, adding
these constraints when computing the proximal operator is
equivalent to solving
min
vR
p
1
2
k[u]
+
vk
2
2
+ λΩ(v),
with ([u]
+
)
j
= max{u
j
, 0}. We will indeed use positive
decompositions to model text corpora in Section
4.
Finally, we monitor the convergence of the algorithm by
checking the relative decrease in the cost function. We also
investigated the derivation of a duality gap, but this implies
the computation of the dual norm
for which no closed-
form is available; computing approximations of
based
on bounds from
Jenatton et al. (2009) turned out t o be too
slow for our experiments.
4. Experiments
4.1. Natural Image Patches
This experiment studies whether a hierarchical structure
can help dictionaries for denoising natural image patches,
and in which noise regime the potential gain is significant.
We aim at reconstructing corrupted patches from a test set,
after having learned dictionaries on a training set of non-
corrupted patches. Though not typical in machine learning,
this setting is reasonable in the context of images, where
lots of non-corrupted patches are easily available.
6
We have extracted 100, 000 patches of size m = 8 × 8 pix-
els from the Berkeley segmentation database of natural i m-
ages
7
, which contains a high variability of scenes. We have
then split this dataset into a training set X
tr
, a validation
set X
v al
, and a test set X
te
, respectively of size 50, 000,
25, 000, and 25, 000 patches. All the patches are centered
and normalized to have unit
2
norm.
For the first experiment, the dictionary D is learned on X
tr
using the formulation of Eq. (
1), with µ = 0 for D
µ
de-
fined in Eq. (
4). The validation and test sets are corrupted
by removing a certain percentage of pixels, the task being
to reconstruct the missing pixels from the known pixels.
We thus introduce for each element x of the validation/test
set, a vector ˜x, equal to x for the known pixel values and
0 otherwise. In the s ame way, we define
˜
D as the matrix
equal to D, except for the rows corresponding t o missing
pixel values, which are set to 0. By decomposing ˜x on
˜
D,
we obtain a sparse code α, and the estimate of the recon-
structed patch is defined as Dα. Note that this procedure
assumes that we know which pixel is missing and which is
not for every element x.
The parameters of the experiment are the regularization pa-
rameter λ
tr
used during the train step, the regularization
parameter λ
te
used during the validation/test st ep, and the
structure of the tree. For every reported result, these pa-
rameters have been selected by taking the ones offering
the best performance on the validation set, before report-
ing any result from the test set. The values for the reg-
ularization parameters λ
tr
, λ
te
were tested on a logarith-
mic scale {2
10
, 2
9
, . . . , 2
2
}, and then further refined on
a finer logarithmic scale of factor 2
1/4
. For simplicity
reasons, we have chosen arbitrarily to use the
-norm
in the structured norm , with all the weights equal to
one. We have tested 21 balanced tree structures of depth
3 and 4, with different branching factors p
1
, p
2
, . . . , p
d1
,
where d is the depth of the tree and p
k
, k {1, . . . , d 1}
6
Note that we study the ability of the model to reconstruct
independent patches, and additional work is required to apply our
framework to a full image processing task, where patches usually
overlap (
Elad & Aharon, 2006).
7
www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
Page 5
Proximal Methods for Sparse Hierarchical Dictionary Learning
Table 1.
Quantitative results of the reconstruction task on natural
image patches. First row: percentage of missing pixels. Second
and third rows: mean square error multiplied by 100, respectively
for class ical sparse coding, and tree-structured sparse coding.
noise 50 % 60 % 70 % 80 % 90 %
flat 19.3 ± 0.1 26.8 ± 0.1 36.7 ± 0.1 50.6 ± 0.0 72.1 ± 0.0
tree 18.6 ± 0.1 25.7 ± 0.1 35.0 ± 0.1 48.0 ± 0.0 65.9 ± 0.3
is the number of children for the nodes at depth k. The
branching factors tested for the trees of depth 3 where
p
1
{5, 10, 20, 40, 60, 80, 100}, p
2
{2, 3}, and for
trees of depth 4, p
1
{5, 10, 20, 40}, p
2
{2, 3} and
p
3
= 2, giving 21 possible structures associated with dic-
tionaries with at most 401 elements. For each tree struc-
ture, we evaluated the performance obtained with the tree-
structured dictionary along with the non-structured dictio-
nary containing the same number of elements. These ex-
periments were carried out four times, each time with a
different initialization, and with a different noise realiza-
tion. Quantitative results are reported on Table
1. For ev-
ery number of missing pixels, the tree-structured dictionary
outperforms the “unstructured one”, and the most signifi-
cant improvement is obtained in the noisiest setting. Note
that having more dictionary elements is worthwhile when
using the tree structure. To study the influence of the cho-
sen structure, we have reported on Figure
2 the results ob-
tained by the 14 tested structures of depth 3, along with
those obtained with the unstructured dictionaries contain-
ing the same number of elements, when 90% of the pixels
are missing. For every number of dictionary elements, the
tree-structured dictionary significantly outperforms the un-
structured ones. An example of a learned tree-structured
dictionary is presented on Figure
3. Dictionary elements
naturally organize in groups of patches, with often low fre-
quencies near the root of the tree, and high frequencies near
the leaves. Dictionary elements tend to be highly correlated
with their parents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
50
60
70
80
Figure 2.
Mean square error multiplied by 100 obtained with 14
structures with error bars, sorted by number of dictionary ele-
ments. Red plain bars represents the tree-structured dictionaries.
White bars correspond to the flat dictionary model containing the
same number of dictionary as the tree-structured one. For read-
ability purpos e, the y-axis of the graph starts at 50.
Figure 3.
Learned dictionary with tree structure of depth 4. The
root of the tree is in the middle of the figure. The branching factors
are p
1
= 10, p
2
= 2, p
3
= 2. The dictionary is learned on
50, 000 patches of size 16 × 16 pixels.
4.2. Text Documents
This second experimental section shows that our approach
can also be applied to model text corpora. The goal of
probabilistic topic models is to find a low-dimensional rep-
resentation of a collection of documents, where the rep-
resentation should provide a semantic description of the
collection. Within a parametric Bayesian framework, la-
tent Dirichlet allocation (LDA) (
Blei et al., 2003) models
documents as a mixture of a predefined number of latent
topics that are distributions over a fixed vocabulary. When
one marginalizes over the topic random variable, one gets
multinomial PCA (
Buntine, 2002). The number of topics is
usually small compared to the size of the vocabulary (e.g.,
100 against 10,000), so that the topic proportions of each
document give a compact representation of the corpus. For
instance, these new features can be used to feed a classifier
in a subsequent classification task. We will similarly use
our dictionary learning approach to find low-dimensional
representations of text corpora.
Suppose that the signals X = [x
1
, . . . , x
n
] R
m×n
are n
documents over a vocabulary of m words, the k-th compo-
nent of x
i
standing for the frequency of the k-th word in
the document i. If we further assume that the entries of D
and A are nonnegative, and that the dictionary elements d
j
have unit
1
norm, the decomposition DA can be seen as
a mixture of p topics. The regularization organizes these
topics in a tree, so that, if a document involves a certain
topic, then all its ancestors in the tree are also present in
Page 6
Proximal Methods for Sparse Hierarchical Dictionary Learning
Figure 4.
Example of a topic hierarchy estimated from 1714 NIPS
proceedings papers (from 1988 through 1999). Each node corre-
sponds to a topic whose 5 most important words are displayed.
Single characters such as n, t, r are part of the vocabulary and
often appear in NIPS papers, and their place in the hierarchy is
semantically relevant to children topics.
the topic decomposition. Since the hierarchy is shared by
all documents, the topics located at the top of the tree will
be part of every decomposition, and should therefore corre-
spond to topics common to all documents. Conversely, the
deeper t he topics in the tree, the more specific they should
be. It is worth mentioning the extension of LDA that con-
siders hierarchies of topics from a non-parametric Bayesian
viewpoint (
Blei et al., 2010). We plan to compare to this
model in future work.
Visualization of NIPS proceedings. We first qualitatively
illustrate our dictionary learning approach on the NIPS pro-
ceedings papers from 1988 through 1999
8
. After removing
words appearing fewer than 10 times, the dataset is com-
posed of 1714 articles, with a vocabulary of 8274 words.
As explained above, we consider D
+
1
and take A to be
R
p×n
+
. Figure
4 displays an example of a learned dictio-
nary with 13 topics, obtained by using the
norm in
and selecting manually λ = 2
15
. Similarly to
Blei et al.
(2010), we interestingly capture the stopwords at the root
of the tree, and the different subdomains of the conference
such as neuroscience, optimization or learning theory.
Posting classification. We now consider a binary clas-
sification task of postings fr om the 20 Newsgroups data
set
9
. We classify the postings from the two newsgroups
alt.atheism and talk.religion.misc, following the setting of
8
http://psiexp.ss.uci.edu/research/programs data/toolbox.htm
9
See
http://people.csail.mit.edu/jrennie/20Newsgroups/
3 7 15 31 63
60
70
80
90
100
Number of Topics
Classification Accuracy (%)
PCA + SVM
NMF + SVM
LDA + SVM
SpDL + SVM
SpHDL + SVM
Figure 5.
Binary classification of two newsgroups: classification
accuracy for different dimensionality reduction techniques cou-
pled with a linear SVM classifier. The bars and the errors are
respectively the mean and the standard deviation, based on 10 ran-
dom split of the data set. B est seen in color.
Zhu et al. (2009). After removing words appearing fewer
than 10 times and standard stopwords, these postings form
a data set of 1425 documents over a vocabulary of 13312
words. We compare different dimensionality reduction
techniques that we use to feed a linear SVM classifier,
i.e., we consider (i) LDA (with the code from
Blei et al.,
2003), (ii) principal component analysis (PCA), (iii) non-
negative matrix factorization (NMF), (iv) standard sparse
dictionary learning (denoted by SpDL) and (v) our sparse
hierarchical approach (denoted by SpHDL). Both SpDL
and SpHDL are optimized over D
+
1
and A = R
p×n
+
, with
the weights w
g
equal to 1. We proceed as follows: given
a random split into a training/test set of 1000/425 post-
ings, and given a number of topics p (also the number of
components for PCA, NMF, SpDL and SpHDL), we train
a SVM classifier based on the low-dimensional representa-
tion of the postings. This is performed on the training set of
1000 postings, where the parameters, λ {2
26
, . . . , 2
5
}
and/or C
svm
{4
3
, . . . , 4
1
} are selected by 5-fold cross-
validation. We report in Figure
5 the average classifica-
tion scores on the test set of 425 postings, based on 10
random splits, for different number of topics. Unlike the
experiment on the image patches, we consider only one
tree structure, namely complete binary trees with depths in
{1, . . . , 5}. The results from Figure
5 show that SpDL and
SpHDL perform better than the other dimensionality reduc-
tion techniques on this task. As a baseline, the SVM classi-
fier applied directly to the raw data (the 13312 words) ob-
tains a score of 90.9±1.1, which is better than all the tested
methods, but without dimensionality reduction (as already
reported by
Blei et al., 2003). Moreover, the error bars in-
dicate that, though nonconvex, SpDL and SpHDL do not
seem to suffer much from instability issues. Even if SpDL
and SpHDL perform similarly, SpHDL has the advantage
to give a more interpretable topic mixture in terms of hier-
archy, which standard unstructured sparse coding cannot.
Page 7
Proximal Methods for Sparse Hierarchical Dictionary Learning
5. Discussion
We have shown in this paper that tree-structured sparse de-
composition problems can be solved at the same computa-
tional cost as addressing classical decomposition based on
the
1
norm. We have used this approach to learn dictionar-
ies embedded in trees, with application to representation of
natural image patches and text documents.
We believe that the connection established between sparse
methods and probabilistic topic models should prove
fruitful as the two lines of work have focused on dif-
ferent aspects of the same unsupervised learning prob-
lem: our approach is based on convex optimization tools,
and provides experimentally more stable data representa-
tions. Moreover, it can be easily extended with the same
tools to other types of structures corresponding to other
norms (
Jenatton et al., 2009; Jacob et al., 2009). However,
it is not able to learn elegantly and automatically model
parameters such as dictionary size of tree topology, which
Bayesian methods can. Finally, another interesting com-
mon line of research to pursue is the supervised design
of dictionaries, which has been proved useful in the two
frameworks (
Mairal et al., 2009; Blei & McAuliffe., 2008).
Acknowledgments
This paper was partially supported by grants from the
Agence Nationale de la Recherche (MGA Project) and
from the European Research Council (SIERRA Project).
References
Bach, F. High-dimensional non-linear variable selection
through hierarchical kernel learning. Technical report,
arXiv:0909.0844, 2009.
Beck, A. and Teboulle, M. A fast iterative shrinkage-
thresholding algorithm for linear inverse problems.
SIAM J. Imag. Sci., 2(1):183–202, 2009.
Bengio, S., Pereira, F., Singer, Y., and Strelow, D. Group
sparse coding. In Adv. NIPS, 2009.
Bengio, Y. Learning deep architectures for AI. Foundations
and Trends in Machine Learning, 2(1), 2009.
Bertsekas, D. P. Nonlinear programming. Athena Scientific
Belmont, 1999.
Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation.
J. Mach. Learn. Res., 3:993–1022, 2003.
Blei, D. and McAuliffe, J. Supervised topic models. In
Adv. NIPS, 2008.
Blei, D., Griffiths, T., and Jordan, M. The nested chi-
nese restaurant process and Bayesian nonparametric in-
ference of topic hierarchies. Journal of the ACM, 2010.
Boyd, S. P. and Vandenberghe, L. Convex optimization.
Cambridge University Press, 2004.
Buntine, W.L. Variational Extensions to EM and Multino-
mial PCA. In Proc. ECML, 2002.
Elad, M. and Aharon, M. Image denoising via sparse
and redundant representations over learned dictionaries.
IEEE Trans. Image Process., 54(12):3736–3745, 2006.
Jacob, L., Obozinski, G., and Vert, J.-P. Group Lasso with
overlap and graph Lasso. In Proc. ICML, 2009.
Jenatton, R., Audibert, J.-Y., and Bach, F. Structured vari-
able selection with sparsity-inducing norms. Technical
report, arXiv:0904.3523, 2009.
Jenatton, R., Obozinski, G., and Bach, F. Structured sparse
principal component analysis. In Proc. AISTATS, 2010.
Ji, S., and Ye, J. An accelerated gradient method f or trace
norm minimization. In Proc. ICML, 2009.
Kavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun,
Y. Learning invariant features through topographic fil-
ter maps. In Proc. CVPR, 2009.
Kim, S. and Xing, E. P. Tree-guided group lasso for multi-
task regression with structured sparsity. Technical report,
arXiv:0909.1373, 2009.
Lee, D. D. and Seung, H. S. Learning the parts of objects by
non-negative matrix factorization. Nature, 401(6755):
788–791, 1999.
Lee, H., Battle, A., Raina, R., and Ng, A. Y. Efficient sparse
coding algorithms. In Adv. NIPS, 2007.
Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman,
A. Supervised Dictionary Learning. In Adv. NIPS, 2009.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learn-
ing for matrix factorization and sparse coding. J. Mach.
Learn. Res., 11, 19–60, 2010.
Nesterov, Y. Gradient methods for minimizing composite
objective function. Technical report, CORE, 2007.
Olshausen, B. A. and Field, D. J. Sparse coding with an
overcomplete basis set: A strategy employed by V1? Vi-
sion Research, 37:3311–3325, 1997.
Zhao, P., Rocha, G., and Yu, B. The composite absolute
penalties family for grouped and hierarchical variable se-
lection. Ann. Stat., 37(6A):3468–3497, 2009.
Zhu, J., Ahmed, A., and Xing, E. P. MedLDA: maximum
margin supervised topic models for regression and clas-
sification. In Proc. ICML, 2009.
Page 8