Page 1
Multitask Multiple Kernel Learning
Pratik Jawanpuria∗
J. Saketha Nath†
Abstract
This paper presents two novel formulations for learning
shared feature representations across multiple tasks. The
idea is to pose the problem as that of learning a shared
kernel, which is constructed from a given set of base kernels,
leading to improved generalization in all the tasks. The first
formulation employs a (l1,lp),p ≥ 2 mixed norm regularizer
promoting sparse combinations of the base kernels and
unequal weightings across tasks — enabling the formulation
to work with unequally reliable tasks. While this convex
formulation can be solved using a suitable mirrordescent
algorithm, it may not learn shared feature representations
which are sparse.The second formulation extends these
ideas for learning sparse feature representations constructed
from multiple base kernels and shared across multiple tasks.
The sparse feature representation learnt by this formulation
is essentially a direct product of lowdimensional subspaces
lying in the induced feature spaces of few base kernels. The
formulation is posed as a (l1,lq),q ≥ 1 mixed Schatten
norm regularized problem. One main contribution of this
paper is a novel mirrordescent based algorithm for solving
this problem which is not a standard setup studied in
the optimization literature.
can also be understood as generalizations of the framework
of multiple kernel learning to the case of multiple tasks
and hence are suitable for various learning applications.
Simulation results on realworld datasets show that the
proposed formulations generalize better than stateofthe
art. The results also illustrate the efficacy of the proposed
mirrordescent based algorithms.
Keywords: multitask feature learning, multiple kernel
learning, mirrordescent, Schattennorm regularization
The proposed formulations
1Introduction
The problem of learning a shared feature representation
across multiple related tasks (multitask feature learn
ing) is commonly encountered in many realworld appli
cations. For instance, consider the application of object
recognition. Consider each task as that of identifying
the images containing a particular object. Note that
images of different objects share a common set of fea
tures, perhaps those describing lines, shapes, textures
∗CSE, IITBombay, INDIA. Email: pratik.j@cse.iitb.ac.in
†CSE, IITBombay, INDIA. Email: saketh@cse.iitb.ac.in
etc., which are different from the input features describ
ing the pixels. It is important to discover such shared
feature representations leading to improved recognition
in all the tasks [1, 2, 3]. Also note that in such ap
plications, the highlevel features shared across tasks
are far fewer than the lowlevel input features. Thus
it is also natural to seek for sparse feature representa
tions which are shared across multiple tasks. This paper
presents two novel formulations which simultaneously
learn the shared feature representations as well as the
corresponding optimal task predictors. In general, the
former learns a nonsparse representation (henceforth
denoted by MKMTFL1); whereas the latter learns a
sparse one (henceforth denoted by MKMTSFL2).
The key idea is to pose the problem of multitask
feature learning as that of learning a shared kernel,
constructed by combining a given set of base kernels,
in order to improve the generalization in case of all
the tasks simultaneously.
be understood as an extension of the Multiple Kernel
Learning (MKL) framework [4, 5, 6], to the case of
multiple tasks. The primary objective in MKL is indeed
to learn a kernel best suited for a given task by optimally
combining the base kernels.
The proposed MKMTFL formulation employs a
mixed (l1,lp),p ≥ 2 norm regularizer over the RKHS
norms of the feature loadings corresponding to the given
tasks and base kernels. The lpnorm regularizer is
applied across tasks and promotes unequal weightings
across task — by varying p one can achieve various
schemes of unequal weightings for tasks. This may be
useful in handling tasks with unequal reliability [7]. The
l1norm regularizer is applied over the lpnorms leading
to learning the shared kernel as a sparse combination
of the given base kernels.
by adapting the mirrordescent [8, 9, 10] based algo
rithm proposed in [11] to the present case. While this
formulation, in general, learns nonsparse feature repre
sentations, in some applications like object recognition,
learning sparse feature representations may be useful.
In the recent past, several works have focused
Hence the work can also
The formulation is solved
1Abbreviation denotes multiple kernel multitask feature learn
ing
2Abbreviation denotes multiple kernel multitask sparse fea
ture learning
828
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 2
on this important problem of learning of sparse fea
ture representations which are shared across multiple
tasks [12, 13, 14, 15, 1, 16, 17]. A majority of these
methods simultaneously learn a shared lowdimensional
(sparse) feature representation as well as the corre
sponding optimal task predictors. Motivated from both
computational as well as regularization perspectives,
these methods restrict the search for the optimal fea
ture space to lowdimensional subspaces in the input
feature space (i.e., the final feature representation can
be obtained from the input features by applying a
suitable rotation and then neglecting unimportant fea
tures). Hence the quality of the final feature represen
tation depends on the quality of the input features.
The proposed MKMTSFL formulation aims at
reducing this risk of employing low quality input fea
tures, by considering an enriched input space induced by
multiple base kernels. MKMTSFL essentially looks
for lowdimensional subspaces in the induced feature
spaces of the base kernels which are important for all the
tasks at hand, while insisting on employing as few base
kernels as possible. The latter constraint helps in elimi
nating kernels (and corresponding features) with overall
low quality across the tasks. In the special case where
number of base kernels is unity, the MKMTSFL re
duces to the formulation in [1] (henceforth denoted by
MTSFL) — which is shown to achieve stateoftheart
performance on several benchmark multitask datasets
and hence is considered as a baseline for comparison in
the simulations (refer section 5).
The MKMTSFL formulation is posed as a
(l1,lq),q ≥ 1 mixed Schattennorm regularized prob
lem. Such problems are nonstandard in literature and
call for novel optimization methodologies. One main
contribution of this paper is an efficient mirrordescent
based algorithm for solving the MKMTSFL formu
lation. MirrorDescent (MD) is similar in spirit to the
Projected Gradient Descent (PGD) [18, 19] algorithm.
While the key computational step in PGD is projection
onto the feasibility set, MD employs a suitable regular
izer such that the projection problem is simple. MD has
been studied in the context of two standard feasibility
sets: i) simplex ii) spectrahedron. However the feasi
bility set with MKMTSFL is a (l1,lq),q ≥ 1 mixed
Schattennorm ball and is hence nonstandard. In this
paper we show that the entropy function can be em
ployed as the regularizer in the context of MD leading
to an efficient algorithm for solving MKMTSFL.
Simulation results on realworld datasets clearly in
dicate that the proposed methods achieve better gen
eralization than stateoftheart and hence employing
multiple base kernels is beneficial in the case of multi
ple tasks. The results also show that in the special case
where number kernels is unity in the MKMTSFL for
mulation, the proposed mirrordescent based algorithm
converges faster (with similar periteration complexity)
than the alternate minimization algorithm in [1].
In order to make the paper selfcontained, we
provide a brief introduction to the generic scheme of
the mirrordescent algorithm in the subsequent section.
The MKMTFL formulation is presented in section 3;
while in section 4, the MKMTSFL formulation is
presented. The sections also describe the mirrordescent
based algorithms for solving the formulations.
results of simulations are detailed in section 5.
paper concludes with a brief summary of the paper and
directions for future work.
The
The
2MirrorDescent Algorithm
This section presents the MirrorDescent (MD) algo
rithm which is suitable for solving problems of the form
minx∈Xf(x) where X is a convex compact set and f
is a convex and Lipschitz continuous function on X. It
is assumed that an oracle which can compute the sub
gradient (∇f) of f at any point in X is available (an
oracle for computing f is not necessary). Interestingly,
both the proposed formulations can indeed be posed
as convex minimization problems over convex compact
sets and oracles for computing the subgradient of the
objective exist. Hence mirrordescent algorithm can be
employed for solving them efficiently.
MD is close in spirit to the projected gradient
descent algorithm where the update rule is x(l+1)=
ΠX(x(l)−sl∇f(x(l))) where ΠXdenotes projection onto
set X and x(l),sl are the iterate value and stepsize
in the lthiteration respectively. Note that the update
rule is equivalent to x(l+1)= argminx∈Xx⊤∇f(x(l)) +
1
2sl?x − x(l)?2
mize a local linear approximation of the function while
penalizing deviations from current iterate (as the lin
ear approximation is valid only locally). The stepsize
takes the role of regularization parameter.
idea in mirrordescent is to choose the regularization
term which penalizes deviations from current iterate in
such a way that this perstep optimization problem is
easy to solve; leading to computationally efficient gra
dient descent procedures. For convergence guarantees
to hold, the regularizer needs to be chosen as a Breg
mann divergence i.e.,
Bregmann divergence term: ωx(l)(x) = ω(x)−ω(x(l))−
∇ω(x(l))⊤(x−x(l)) where ω is the (strongly convex) gen
erating function of the Bregmann divergence employed.
The stepsizes3can be chosen as any divergent series for
e.g., sl=
lp,0 ≤ p ≤ 1. Note that in case ω is chosen as
2. The interpretation of this rule is: mini
The key
1
2?x − x(l)?2
2is replaced by some
1
3Refer [10] for notes on choosing stepsizes optimally.
829
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 3
ω(x) =1
rithm is recovered. The perstep minimization problem
mentioned above can now be rewritten in terms of ω
as:
x(l+1)= argmin(2.1)
2?x?2
2, then the projected gradient descent algo
x∈Xx⊤ζ(l)+ ω(x)
where ζ(l)= sl∇f(x(l)) − ∇ω(x(l)).
above, the strongly convex function ω is chosen cleverly
based on X such that (2.1) turns out to be an easy
problem to solve. There are two wellknown cases where
the MD algorithm almost achieves the information
based optimal rates of convergence [10]:
simplex in Rd(i.e., X = {x ∈ Rd xi≥ 0,
1}) and ω is chosen as the negative entropy function:
ω(x) =?d
unit trace) and ω(x) = Trace(xlog(x)). Infact, in the
former case, the perstep problem (2.1) has an analytical
solution:
exp{−ζ(l)
?d
Also for this case, the number of iterations can be shown
to grow as log(d) and hence nearlyindependent of the
dimensionality of the problem.
As mentioned
i) X is a
?d
i=1xi=
i=1xilog(xi) ii) X is a spectrahedron (i.e.,
set of all symmetric positive semidefinite matrices with
x(l+1)
i
=
i}
j=1exp{−ζ(l)
j}
(2.2)
3Multiple Kernel Multitask Feature Learning
This section presents the MKMTFL formulation and
the mirrordescent based algorithm for solving it. The
derivations are presented for the case where each task
is a binary classification problem; however it is easy
to extend the derivations to other learning tasks as
well. Let D = {(xti,yti), i = 1,...,mt, t = 1,...,T}
be the training dataset where xti represents the ith
example of the tthtask and yti is its label.
Kj,j = 1,...,k be the given set of base kernels and let
φj(·) represent the feature vector induced by the jth
kernel. Let the linear discriminator for the tthtask be
?k
?T
??k
regularizer employs a mixednorm over the RKHS
norms (i.e., ?wtj?2). More specifically, it involves an l1
norm across kernels and lpnorm across tasks and hence
promotes sparsity across kernels and nonsparse com
binations across tasks. With this regularizer, feature
loadings for a particular kernel are encouraged to be
either zero or nonzero across all the tasks (refer [20, 21]
also).Hence the formulation does achieve a shared
Let
j=1w⊤
task would imply minimizing the following hingeloss:
?mt
??T
tjφj(x) − bt= 0. Low empirical risk over each
t=1
i=1max
?
0,1 − yti
??k
p?2
j=1w⊤
tjφj(xti) − bt
??
.
We employ the following mixednorm based regularizer:
t=1(?wtj?2)p?1
j=1
where p ≥ 2.This
feature representation as desired. Mathematically, the
proposed MKMTFL formulation can be written as:
min
w,b,ξ
1
2
??k
yti(?k
j=1
??T
j=1w⊤
t=1?wtj?p
2
?1
p?2
+ C?T
t=1
?mt
i=1ξti
s.t.
tjφj(xti) − bt) ≥ 1 − ξti, ξti≥ 0
The MKMTFL formulation can also be under
stood as an extension of the standard MKL formula
tion [6] to the case of multiple tasks.
special case of number of base kernels is unity, it reduces
to the MKL formulation. Also, it can understood as an
adaptation of the composite absolute penalties family
studied in [22, 23, 11] to the current problem. We now
rewrite this formulation in a convenient form which can
be efficiently solved using mirrordescent based algo
rithms. We introduce some more notation: let ∆d,r=
?
and with slight abuse of notation let ∆d,1= ∆d. Next
we note the following lemma [24]:
Infact, in the
z ≡ [z1...zd]⊤
?d
i=1zr
i≤ 1,zi≥ 0,i = 1,...,d
?
Lemma 3.1. Let ai≥ 0,i = 1,...,d and 1 ≤ r < ∞.
Then, for ∆d,rdefined as before,
min
η∈∆d,r
?
i
ai
ηi
=
?
d
?
i=1
a
r
r+1
i
?r+1
r
and the minimum is attained at
ηi=
a
1
r+1
i
??d
i=1a
r
r+1
i
?1
r
with the convention that a/0 is 0 if a = 0 and is ∞ if
a ?= 0.
Using the result of the lemma (with r = 1) and
introducing variables γ = [γ1...γk]⊤, we have:
Now introducing dual variables λj= [λj1...λjT]⊤, j =
1,...,k and using the notion of dual norm [25], we
obtain:
k
?
j=1
?
T
?
t=1
(?wtj?2)p
?1
p
2
= min
γ∈∆k
k
?
j=1
??T
t=1(?wtj?2)p?2
γj
p
?
T
?
t=1
??wtj?2
p
p−2. With this, the objective in the MK
MTFL formulation can now be written as:
2
?p
2
?2
p
=max
λj∈∆T,¯ p
T
?
t=1
λjt?wtj?2
2
where ¯ p =
min
γ∈∆kmin
w,b,ξ
max
λj∈∆T,¯ p
1
2
k
?
j=1
?T
t=1λjt?wtj?2
γj
2
+ C
T
?
t=1
mt
?
i=1
ξti
830
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 4
After interchanging the min.
max theorem [26] and writing the Lagrange dual of the
resulting formulation wrt. the variables w,b,ξ alone,
leads to the following:
and max.using min
min
γ∈∆k
max
λj∈∆T,¯ p
max
αt∈Smt(C)g(λ,α;γ) (3.3)
where
g(λ,α;γ) =
T
?
t=1
1⊤αt−1
2α⊤
tYt
k
?
j=1
γjKtj
λjt
Ytαt
Here, αtis a vector of Lagrange multipliers correspond
ing to the mt classifications constraints correspond
ing to the tthtask in the MKMTFL formulation,
Smt(C) ≡
appropriatesized vectors of zeros and ones respectively,
ytis vector of labels of the tthtask training data points,
Ytis a diagonal matrix with entries as ytand Ktjrep
resents the gram matrix of the tthtask training data
points wrt. the jthkernel.
The partial dual in (3.3) provides further insights
into the formulation. Firstly, (3.3) is equivalent to
solving regular SVM [27] problems one for each task
with the effective kernel
??k
weights (γ,λ) in the effective kernel are learned based
on examples of all the tasks and in particular, γs are
shared across all the tasks. Moreover, by construction,
most of the γjare zero, leading to sparse combination of
kernels. In other words, the effective features employed
are learnt using examples of all the tasks whereas
individual task feature loadings may differ across tasks;
this is consistent with the context of multitask feature
learning. Note that in the special case p = 2 i.e., ¯ p = ∞,
all λs must be unity at optimality. In other words, even
the nonzero weights for the selected kernels are also
shared across tasks. Infact by varying p one can obtain
various weighting schemes for the selected kernels and
can be beneficial in cases where tasks are not equally
reliable.
In the following we present an efficient mirror
descent based methodology for solving the MK
MTFL formulation which is inspired by the algo
rithm presented in[28].
minmax problem in (3.3) we prefer to solve equiv
alently: minγ∈∆kf(γ) where f(γ) is as defined as
maxλj∈∆T,¯ pmaxαt∈Smt(C)g(λ,α;γ).
that f(γ) is convex and using Danskin’s theorem [19]
one can obtain ∇f provided f(γ) can be computed
efficiently for a given γ. Since the feasibility set for
this problem is a simplex, mirrordescent outlined in
section 2 can be employed to solve it very efficiently.
As noted above, since the feasibility set is a simplex,
?αt 0 ≤ αt≤ C1,y⊤
tαt= 0?, 0,1 denote
j=1
γjKtj
λjt
?
.The kernel
Instead of solving the
It is easy to see
the MD algorithm will be achieving almost the opti
mal rate of convergence and specifically, the number
of iterations required will be nearlyindependent of k
(the no.base kernels). Now, evaluation of f(γ) is
equivalent to maximization of g(λ,α;γ) wrt. λ and α
for the given γ. This maximization can be performed
efficiently using an alternate minimization over the λ
and α variables. For fixed values of λ, maximization
over α is equivalent to solving T regular SVM prob
lems. For fixed values of α the problem of maximization
wrt. λ is equivalent to: −?k
where Djt =
know that this problem has an analytical solution. The
perstep computational complexity is dominated by the
SVM computations and calculation of effective ker
nel: O(k?T
and in our simulations we found that the alternate
minimization typically converges in around 35 itera
tions. Hence the overall computational complexity is
O(k log(k)?T
methodologies [12, 13, 14, 1, 16, 17] — which solve a
Singular Value Decomposition (SVD) problem at every
iteration. However, in general, unless the base kernels
are carefully chosen, the optimal feature representation
learnt by MKMTFL is nonsparse. Whereas in some
realworld multitask applications it may be desirable to
learn sparse featurerepresentations. Such a novel for
mulation for learning sparse feature representations con
structed from multiple base kernels and shared across
multiple tasks is presented in the subsequent section.
j=1minλj∈∆T,¯ p
?T
t=1
Djt
λjt
1
2γjαT
tYtKtjYtαt. From lemma 1, we
t=1m2
t).As mentioned above, the num
ber of iterations for the mirrordescent grows as log(k)
t=1m2
t). This formulation is thus expected
to be far more scalable than most of the existing MTFL
4Multiple Kernel Multitask Sparse Feature
Learning
This section presents the details of the MKMTSFL
formulation. As discussed in section 1, most of the ex
isting methods for shared learning of sparse feature rep
resentations look for lowdimensional subspaces in the
input space which are beneficial to all the tasks.
short coming of this methodology, as noted above, is the
strong dependency of the performance on the input fea
tures. One way to minimize risk of employing low qual
ity input features is to enable the formulations working
with multiple base kernels. A trivial way of generalizing
the existing methodologies to work with multiple base
kernels is just by employing a kernel (and hence its in
duced feature space) which is a simple sum or average of
all the base kernels. This essentially is same as working
with an enriched input space which is a concatenation of
the individual feature spaces corresponding to the given
base kernels [29]. This may be is fine provided all the
base kernels are “good”. In a more realistic situation
A
831
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 5
it is desirable to employ the “best” few base kernels
and the corresponding lowdimensional subspaces. In
the following we present a formalism which implements
this idea. Also, the simulation results on realworld
datasets, detailed in section 5, confirm that the pro
posed MKMTSFL formulation achieves better gener
alization than the trivial extension noted above.
We follow the notation introduced in section 3.
As in case of the existing multitask sparse feature
learning algorithms and in particular MTSFL, we
constructnew featuresas
of the given features i.e., Ljφj(x) where Lj is an
orthogonal matrix which is to be learnt.
linear discriminator for the tthtask with the newly
constructed features is:
?k
implyminimizingthe
?T
troduce some more notation:
wtj be wtjl,l = 1,...,dj where dj is the dimen
sionality of the feature space induced by the jth
kernel. By w·jl we denote the vector with entries
wtjl,t = 1,...,T. The regularization term we employ
??k
are few fundamental differences:
employs an l1 norm over feature loadings across tasks
rather than an RKHS norm over feature loadings for
each task. Hence the feature loadings will be sparse
and the features not selected are common across the
tasks. Thus a shared sparse feature representation
is constructed ii) the l1 norm over kernels in the
former case is now generalized to an lq norm over
kernels.Since q ∈ [1,2], we obtain various schemes
of sparsely combining the base kernels by varying
q. iii) the lp norm over tasks is here restricted to
l2norm. This is done to facilitate the kernel trick (as
we shall understand later). In summary, the regularizer
promotes use of few kernels (i.e., few kernel’s feature
loadings are nonzero) and promotes use of few learnt
features in each kernel (i.e., achieves sparse feature
representation) across the tasks. Thus it is suitable for
constructing shared sparse feature representation across
tasks given multiple base kernels. Mathematically, the
MKMTSFL formulation can be expressed as:
orthogonaltransforms
Now a
j=1w⊤
tjL⊤
jφj(x) − bt = 0.
Again,low empirical risk over each task would
following
?mt
hingeloss:
t=1
i=1max
?
0,1 − yti
??k
j=1w⊤
tjL⊤
jφj(xti) − bt
??
.
Before describing the regularization term,we in
Let the entries of
is
regularizer looks similar to that in MKMTFL, there
j=1
??dj
l=1?w·jl?2
?q?2
q,q ∈ [1,2]. Though this
i) this regularizer
min
w,b,ξ,L
1
2
??k
j=1
??dj
j=1w⊤
l=1?w·jl?2
?q?2
q+ C?T
t=1
?mt
i=1ξti
s.t.yti(?k
tjL⊤
ξti≥ 0,Lj∈ Odj
jφj(xti) − bt) ≥ 1 − ξti
where Odjrepresents the set of all orthogonal matrices
of dimensionality dj. Note that, in the special case k =
1, this formulation is same as the MTSFL formulation.
In the following text we rewrite this formulation in a
form which is convenient to solve using an MD based
algorithm.
Using lemma 1 and introducing new variables λ =
[λ1...λk]⊤, we have:
where ¯ q
to 2, the ¯ q will vary from 1 to ∞.
ing lemma 1 and introducing new variables γj
[γj1...γjdj]⊤,j = 1,...,k, the regularizer can be writ
k
?
j=1
dj
?
l=1
?w·jl?2
q
Note that as q varies from 1
2
q
= min
λ∈∆k,¯ q
k
?
j=1
??dj
l=1?w·jl?2
λj
?2
=
q
2−q.
Again us
=
ten as:
Now we perform a change of variables:
minλ∈∆k,¯ qminγj∈∆dj
?T
t=1
?k
j=1
√
?dj
l=1
w2
γjkλj.
tjl
wtjl
γjkλj= ¯ wtjl.
Also, we define ¯ wtj as vector with entries as ¯ wtjl,l =
1,...,dj. Using this one can rewrite the MKMTSFL
formulation as:
min
λ,γj,Lj
?T
t=1min¯ wt,bt,ξt
1
2
?k
j=1¯ w⊤
tj¯ wtj+ C?mt
i=1ξti
s.t.yti(?k
j=1¯ w⊤
tjΛ
λ ∈ ∆k,¯ q,γj∈ ∆dj,Lj∈ Odj
1
2
jL⊤
jφj(xti) − bt) ≥ 1 − ξti, ξti≥ 0
where Λjis a diagonal matrix with entries as λjγjl,l =
1,...,dj4. Now writing the Lagrange dual wrt. ¯ wt,bt,ξt
leads to the following form:
min
λ,γj,Lj
?T
t=1maxαt1⊤αt−1
λ ∈ ∆k,¯ q,γj∈ ∆dj,Lj∈ Odj,αt∈ Smt(C)
where Φtj
is the data matrix with columns as
φj(xti),i = 1,...,mt. Denoting L⊤
eliminating variables λ,γ,Ls leads to:
2α⊤
tYt
??k
j=1Φ⊤
tjL⊤
jΛjLjΦtj
?
Ytαt
s.t.
jΛjLj by¯Qj and
min
¯Q
T
?
t=1
max
αt∈Smt(C)
1⊤αt−1
2α⊤
tYt
??k
j=1(trace(¯Qj))¯ q≤ 1
j=1Φ⊤
tj¯QjΦtj
?
Ytαt
s.t.
¯Qj? 0,?k
The difficulty in working with this formulation is that
the explicit mappings φjs are required. We now de
scribe a way of overcoming this problem and effi
ciently kernelizing the formulation (refer [1] also). Let
Φj ≡ [Φ1j...ΦTj] and the compact SVD of Φj be
4Note that, λjγjlbeing zero does not cause a problem since
then both wtjl and ¯ wtjl will be zero for all t = 1,...,T at
optimality
832
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 6
UjΣjV⊤
¯Qj= UjQjU⊤
definite matrix of size same as rank of Φj. Eliminating
variables¯Qj, we can rewrite the above problem using
Qjas:
j
5. Now, introduce new variables Qjsuch that
j. Here, Qjis a symmetric positive semi
min
Q
T
?
t=1
max
αt
1⊤αt−1
2α⊤
tYt
??k
j=1(trace(Qj))¯ q≤ 1,
αt∈ Smt(C)
Note that calculation
j=1M⊤
tjQjMtj
?
Ytαt
s.t.
Qj? 0,?k
where Mtj = Σ−1
of Mtj does not require the kernelinduced features
explicitly and hence the formulation is kernelized.
The partial dual of MKMTSFL noted above
provides more insights into the original formulation.
Given Qjs, the problem is equivalent to solving T SVM
problems individually. The Qjs are learnt using training
examples of all the tasks and are shared across the tasks.
The trace constraints are a generalization of the trace
norm regularization and hence promote low ranked Qjs
at optimality. Thus the formulation indeed constructs a
shared sparse feature representation using multiple base
kernels simultaneously. In the following text we describe
an efficient MD based algorithm for solving this dual
formulation.
Insteadofsolving
inthe dual,we prefer
minQ∈R(¯ q)f(Q)where
Q
nalmatrix withentries
?
f(Q)=
?T
Bj
=
?T
Schattennorm based regularized problem. Also, since
f is pointwise maximum over affine functions in Q, it
is convex in Q. Moreover, the gradient of f wrt. Q
i.e., ∇f can be obtained using the Danskin’s theorem:
∇f(Q(l)) = −1
using optimal αt obtained while evaluating f(Q(l)).
The evaluation of f(Q) is equivalent to solving T regu
lar SVM problems. Hence the mirrordescent algorithm
can be employed for solving this minimization problem.
In the following we show that the strongly convex
function ω(x) = trace(xlog(x)) can be employed as the
Bregmann divergence generating function in the context
of mirrordescent for solving the mixed Schattennorm
jV⊤
jΦ⊤
jΦtj.
the minmax
equivalently
ablock
Q1...Qk,
j=1(trace(Qj))¯ q= 1
1
2trace(QB),
matrix with
tYtM⊤
problem
solve:
diago
R(¯ q)
?
and
entries
Note that this
to
is
as≡
QQj? 0 ∀ j = 1,...,k,?k
isablock
t=1MtjYtαtα⊤
minimization problem is an instance of a mixed
,
t=11⊤αt −
diagonal
B
as
tj.
2B(l)where B(l)is the value obtained
5Since we perform a compact SVD, Σj is a square diagonal
matrix of size equal to rank (which is atmostPT
and no. columns of Uj,Vj are the again equal to the rank.
t=1mt) of Φj
regularized problem at hand. With this choice, the per
step optimization problem (2.1) turns out to be:
min
Q∈R(¯ q)trace(ζ(l)Q) + trace(Qlog(Q))
where ζ(l)= sl∇f(Q(l)) − ∇ω(Q(l)). We already noted
how Danskin’s theorem can be employed to obtain
∇f(Q(l)). Also, ∇ω(Q(l)) = log(Q(l)) + I where I is
the identity matrix of appropriate size. Note that both
Q and ζ(l)share the same block diagonal structure.
In the following we argue that at optimality, Q and
ζ(l)share the same eigenvectors (also refer [10] for
case of spectrahedron geometry). Let the EVD (Eigen
Value Decomposition) of ζ(l)= ZΠZ⊤(here Π is
the diagonal matrix containing eigen values). Passing
from variable Q to Θ according to Q = ZΘZ⊤we
can rewrite this problem as: minΘ∈R(¯ q)trace(ΠΘ) +
trace(Θlog(Θ)).It is easy to see that the unique
optimal solution of this (strongly convex) problem is
a diagonal matrix: for every diagonal matrix D with
entries ±1 and every feasible solution Θ, the quantity
DΘD will remain feasible and moreover achieves the
same objective (since Π is diagonal). It follows that
the optimal set of solutions must be invariant wrt.
Θ → DΘD transformations, which is possible if and
only if Θ is also diagonal (else uniqueness of the solution
breaksdown).If the entries in Π,Θ are πjl,θjl,l =
1,...,rj,j = 1,...,k (here rjrepresents the dimension
of matrix Qj ) respectively, then the above problem is
equivalent to:
min
θ
?k
θjl≥ 0∀j,?k
j=1
?rj
l=1(θjllog(θjl) + θjlπjl) (4.4)
s.t.
j=1(?rj
l=1θjl)¯ q≤ 1
Though this is a convex problem and involves vectorial
variables, the number of variables can be as large as
k?T
can reduce the number of variables to k by performing
the following trick: introduce variables ρj,j = 1,...,k
and rewrite problem (4.4) as:
t=1mt. Hence it is not wise to employ standard op
timization toolboxes to solve this problem. Actually one
min
ρ
k
?
j=1
min
θj
?rj
l=1(θjllog(θjl) + θjlπjl)(4.5)
s.t. θjl≥ 0,?rj
l=1θjl= ρj
j=1ρ¯ q
s.t.ρj≥ 0,?k
j≤ 1
The minimization problem wrt. θj in the above prob
lem has an analytical solution:
¯ πjl =
Prj
from (4.5) and rewrite as:
θjl = ρj¯ πjl, where
exp{−πjl}
¯l=1exp{−πj¯l}. Using this one can eliminate θj
min
ρ
?k
j=1
?ρjlog(ρj) + ρj
??rj
l=1(πjl¯ πjl+ ¯ πjllog(πjl))??
833
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 7
s.t.ρj≥ 0,?k
j=1ρ¯ q
j≤ 1
The size of this problem is k and can easily managed
by standard solvers like cvx6.
putational cost of cvx (which is negligible for k in or
der of few hundreds), the dominant computations are i)
EVD of ζ(l)which involves EVDs of k matrices of size
rj,j = 1,...,k ii) solving T regular SVMs7. This is still
reasonable as, in the special case number of base kernels
is unity, cvx need not be employed and the dominant
computation remains the same as that in the alternating
minimization algorithm in [1]. Also, in the case q = 1,
cvx need not be employed as the final problem again
has an analytic solution.
Apart from the com
5Numerical experiments
This section summarizes results of simulations which il
lustrate the merits of the proposed algorithms. The ex
periments are aimed to show that: i) the proposed for
mulations achieve good generalization ii) the proposed
MD based algorithms are efficient in solving them. We
begin with results comparing the generalization of the
various formulations8:
MKMTFL The multiple kernel multitask feature
learning formulation presented in section (3).
Three different values of p (the norm over tasks)
were considered: 2, 6, 8.67.
MKMTSFL The multiple kernel multitask sparse
feature learning formulation presented in sec
tion (4). Three different values of q (the norm over
kernels) were considered: 1.01, 1.5, 1.99
MTSFL Stateoftheart
learning formulation [1].
vided by authors9was employed for solving the for
mulation.
multitask
The original code pro
sparsefeature
MTSFLavg Straightforward
MTSFL formulation to the case of multiple base
kernels noted in section 4.
MTSFL formulation but with input kernel as the
average of all the given base kernels.
extensionofthe
It is same as the
SVM The baseline formulation where each task is
learnt using an individual SVM [27] model. In
6Available at http://cvxr.com/cvx/
7The computation of log(Qj) can be done efficiently (avoiding
EVD) by bookkeeping the EVD of Qj
8Code for the proposed formulations is available at: http:
//www.cse.iitb.ac.in/saketh/research/MTFL.tgz
9Available at http://ttic.uchicago.edu/~argyriou/code/
mtl_feat/mtl_feat.tar
case of binary classification tasks, we employed
libsvm10for solving the SVM problem.
MKL The baseline formulation where each task is
learnt using an individual MKL [6] formulation.
Since this is a special case of MKMTFL with
k = 1, it can be solved using the same algorithm.
The following datasets were employed in our com
parisons:
School A
dataset [1]11.
mance of students given their descriptions/past
record. Data for 15362 students from 139 schools
is available and each student is described using
28 features. The regression problem of predicting
performances in each school is considered as a
task. At random 15 examples in each task were
taken as training data and the rest as test data12.
benchmarkmultitask regression
The goal is to predict perfor
Sarcos A multipleoutput regression dataset [30]13.
The objective is to predict inverse dynamics cor
responding to the seven degreesoffreedom of a
SARCOS anthropomorphic robot arm based on 21 in
put features. The dataset comprises of 48933 data
points.Prediction of inverse dynamics for each
degreeoffreedom is considered as a task. 2000 ran
dom examples were sampled in case of each task
and 15 of them were used as training examples
while the rest were kept aside as test examples.
Parkinsons A multitask regression dataset14.
objective is to predict two Parkinsons disease symp
tom scores (motor UPDRS, total UPDRS) for pa
tients based on 19 biomedical features. The origi
nal dataset comprises of 5,875 recordings for 42 pa
tients. The regression problem of predicting each
symptom score for each patient is considered as a
task. Thus the total number of tasks turns out to
be 42×2 = 84. 15 random examples per task were
used for training and the rest formed the test data.
The
Caltech A benchmark multiclass object categoriza
tion dataset [31]15.A collection of images from
102 categories of objects like faces, watches, ants
10Code available at www.csie.ntu.edu.tw/~cjlin/libsvm/
11Available at http://ttic.uchicago.edu/~argyriou/code/
mtl_feat/school_splits.tar
12We employed a different trainingtest ratio than [1] in order
to focus on data scarce regime
13Available at http://www.gaussianprocess.org/gpml/data/
14Availableat
http://archive.ics.uci.edu/ml/datasets/
Parkinsons+Telemonitoring
15Available at
ucsdmitcaltech101mkldataset
http://mkl.ucsd.edu/dataset/
834
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 8
Table 1: Comparison of % explained variance with various methods on multitask regression datasets
MTSFLMTSFLavg
MKLSVM
MKMTFL
p = 6
School
9.18
Sarcos
82.5871.70
Parkinsons
249.32455.36
MKMTSFL
q = 1.5p = 2p = 8.67q = 1.01q = 1.99
13.88 12.35 15.428.976.60 4.53
14.02∗
13.8613.95
23.283940.125986.268426.161169.9437
26.1477∗
25.664425.52
57.8142.72 14.1644.352156.3445.3758.11
59.27∗
etc., with 30 images per category. Each 1vsrest
binary classification problem was considered as a
task (no. tasks=102). The trainingtest splits and
25 base kernels are provided with the dataset.
Oxford A benchmark multiclass object categorization
dataset [32]16. Collection of images of 17 varieties
of flowers. The number of images per category is
80. Each 1vsrest binary classification problem
was considered as a task (no. tasks=17). Three
predefined trainingtest splits consisting of 40 train
ing, 20 validation and 20 test examples and seven
base kernels are provided with the dataset.
In case of SVM, MTSFL a 3fold nested cross
validation procedure is employed to tune the C and
the kernel parameter. For the proposed methods and
MKL, MTSFLavg, a 3fold crossvalidation procedure
is employed to tune the C parameter and for fairness in
comparison, the base kernels are chosen to be exactly
those provided to the other methods for nested cross
validation. The values of the C parameter considered
for the regression datasets are in the set {5e − 4,5e −
3,...,5e + 2} whereas for the classification datasets
they are in set {5e − 1,5,...,5e + 5}. Also, one linear
and six Gaussian kernels, with parameter values in the
set {1e − 2,1e − 1,...,1e + 3}, are employed in case
of the regression datasets. Following [1], % explained
variance [33] is employed as the measure of performance
for regression problems.The explained variance per
task can be computed as 1 minus the ratio of mean
squared error and the variance of the true outputs on the
test dataset. Note that a simple regressor which predicts
every output as the mean of the outputs in the dataset
achieves an explained variance of zero on the dataset.
Thus if explained variance is negative a simple mean
estimate is better than the corresponding regressor and
hence the method is of no use.
the explained variance, better the method.
In general, higher
However
16Available
flowers/17/index.html
at
http://www.robots.ox.ac.uk/~vgg/data/
in our case we have multiple tasks and we compute
the overall explained variance as the mean of explained
variances over each task. Hence we can no longer claim
that a methodology which achieves an overall negative
score is useless.But still, the higher the explained
variance, the better the method.
categorization datasets we employ % accuracy as the
measure of performance.
The results on regression (object categorization)
datasets are summarized in table 1 (3) respectively.
We report the mean % explained variance (% accu
racy) achieved over 10 random splits (standard splits)
for each dataset. The highest explained variance (accu
racy) achieved in each dataset is highlighted. Also, if the
improvement in explained variance (accuracy) is statis
tically significant (≥ 94% confidence with paired ttest),
then the result is marked with a ∗. Table 1 clearly shows
that the proposed MKMTSFL formulation outper
forms every baseline confirming that learning a shared
feature representation is indeed beneficial. Moreover the
improvement over stateoftheart multitask methodol
ogy is statistically significant. This illustrates the ad
vantage of employing multiple kernels in the context of
multitask applications. Note that MTSFLavg which
also employs multiple kernels (in a trivial manner) is
infact achieving less generalization than its single ker
nel counterpart MTSFL — this shows the benefit of
employing sparse combinations of kernels. The results
highlight the importance of solving the MKMTSFL
formulation at various values of q (the norm for kernel
regularization), and hence the utility of the MD based
algorithm, as there seems to be no evident optimal value
across datasets. Automatic tuning of q calls for further
investigation.
From the results in table 1, the MKMTFL for
mulation seems to be highly incompetent; whereas MK
MTSFL, which learns sparse feature representations, is
highly efficient. However, MKMTFL can also be em
ployed for learning sparse feature representations pro
vided suitable base kernels are employed. For instance,
if base kernels are derived from each individual input
In case of object
835
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 9
Table 2: Comparison of % explained variance with various methods on multitask regression datasets with feature
wise kernels
Datasets
MTSFLMTSFLavg
MKLSVM
MKMTFL
p = 6
13.68
39.23
15.88
MKMTSFL
q = 1.5

38.82

p = 2
10.95
49.82
19.58
p = 8.67
14.35∗
30.12
418.86
School
Sarcos
Parkinsons
9.76
37.80
35.98
3.07
48.72
60.66∗
4999.08
14.64
53.59
2.44

features, then MKMTFL may indeed learn sparse fea
ture representations. In order to see if the poor perfor
mance of MKMTFL is due to inherent limitations in
MKMTFL or due to restricted choice of base kernels,
we repeated the simulations with base kernels includ
ing those derived from individual input features. Again,
one linear and six Gaussian kernels with the same Gaus
sian parameters as mentioned above are employed; how
ever using each feature these seven base kernels are con
structed — resulting in 7n base kernels overall where,
n is the number of input features. These results are re
ported in table 2. Due to the large increase in the num
ber of base kernels some formulations failed to execute
with the available resources. This failure is indicated
with a − at the appropriate cell in the table.
Interestingly, the performance of MKMTFL is
improved by a great magnitude and infact in the case
of School and Sarcos datasets it turned out to be the
best performing method (with either setting of base ker
nels). In case of the Parkinsons dataset though the ex
plained variance seems to be low, we observed that the
mean squared error (mse) is the least among all the
methods and infact the improvement in terms of mse
is statistically significant. The results on object cate
gorization are summarized in table 3. Since the train
ing dataset sizes here are of the order of few tens of
thousands, the SVD based algorithms failed to execute.
Hence we report simulation results with the baselines
and MKMTFL alone. The proposed MKMTFL
outperforms the baselines confirming that it is indeed
capable of discovering efficient shared feature represen
tations in the context of multiple tasks. To summarize,
in case large number of base kernels are available, then
both from computational and generalization perspective
MKMTFL is the obvious choice. When a restricted
set of base kernels are available, then MKMTSFL is
the best option.
The results showing the efficacy of the proposed MD
methodologies are summarized in figure 1.
two plots in figure 1 show the scaling of MKMTFL
with T and k respectively on the School dataset. The
plots show that the method can easily scale to large
number of base kernels as well as tasks. The subsequent
The first
Table 3: Comparison of performance on various object
categorization datasets
Dataset
MKLSVM
MKMTFL
p = 6
65.15
85.98∗
p = 2
65.12
85.29
p = 8.67
65.27∗
85.88
Caltech
Oxford
52.75
83.73
54.35
81.67
plots compare convergence rate of the MD algorithm
presented in section 4 for the special case k = 1 and
alternate minimization in [1] on the School dataset
with T = 20,139 respectively.
the MD algorithm achieves faster convergence (with
similar computational complexity) especially for large
problems.
It can be seen that
6 Conclusions
There are three main contributions in this paper: i) the
MKMTFL formulation: discovers shared feature rep
resentations by learning a common kernel constructed
using multiple base kernels. Simulations show that this
method is scalable as well as achieves good generaliza
tion when provided with an appropriate set of base ker
nels. ii) the MKMTSFL formulation: learns sparse
feature representations constructed from multiple base
kernels and shared across multiple tasks. Simulations
show that this method always achieves better gener
alization than stateoftheart and outperforms various
baselines. iii) the MD based algorithm for solving MK
MTSFL. We showed that the negative entropy func
tion can be employed as an efficient Bregmann generat
ing function in the context of mirrordescent for solving
some mixed Schattennorm regularized problems.
Acknowledgments We would like to thank Prof. Chi
ranjib Bhattacharyya (IISc, Bangalore) for insightful
discussions on this paper.
References
[1] A. Argyriou, T. Evgeniou, and A. Pontil.Con
836
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 10
0 102030405060
1
2
3
4
5
6
7
No. of Tasks
Time (in seconds)
0100200300400500600
0
2
4
6
8
10
No. of Kernels
Time (in seconds)
2468 1012 14
1.914
1.916
1.918
1.92
1.922
1.924
1.926
x 10
5
No. of Iterations
Objective value
Alt. Minimization
Mirror Descent
5101520 25
1.14
1.15
1.16
1.17
1.18
1.19
x 10
6
No. of Iterations
Objective value
Alt. Minimization
Mirror Descent
Figure 1: First two plots present scaling results with MKMTFL. The next two compare convergence of MD
and alt.min. with T = 20,139 respectively. All comparisons on School dataset
vex Multitask Feature Learning. Machine Learning,
73:243–272, 2008.
[2] B. Heisele, T. Serre, M. Pontil, T. Vetter, and T. Pog
gio. Categorization by Learning and Combining Object
Parts. In Advances in Neural Information Processing
Systems, volume 14, pages 1239–1245, 2002.
[3] A. Torralba, K. P. Murphy, and W. T. Freeman.
Sharing Features: Efficient Boosting Procedures for
Multiclass Object Detection.
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), volume 2, pages 762–769, 2004.
[4] G.R.G. Lanckriet,N.
L. El Ghaoui, and M.I. Jordan.
nel Matrix with Semidefinite Programming. Journal
of Machine Learning Research, 5:27–72, 2004.
[5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan.
Multiple Kernel Learning, Conic Duality, and the SMO
Algorithm. In International Conference on Machine
Learning, 2004.
[6] A. Rakotomamonjy, F. Bach, S. Canu, and Y Grand
valet. SimpleMKL. Journal of Machine Learning Re
search, 9:2491–2521, 2008.
[7] Xiaolin Yang, Seyoung Kim, and Eric P. Xing. Hetero
geneous Multitask Learning with Joint Sparsity Con
straints. In Advances in Neural Information Processing
Systems, 2009.
[8] A.BenTal, T.Margalit, and A.Nemirovski.
dered Subsets Mirror Descent Optimization Method
with Applications to Tomography. SIAM J. of Opt.,
12(1):79–108, 2001.
[9] Amir Beck and Marc Teboulle. Mirror Descent and
Nonlinear Projected Subgradient Methods for Convex
Optimization. Operations Research Letters, 31:167–
175, 2003.
[10] Arkadi Nemirovski. Lectures on modern convex
optimization (chp.5.45.5). Available at http://www2.
isye.gatech.edu/~nemirovs/Lect_ModConvOpt.pdf,
2005.
[11] J. Saketha Nath, G. Dinesh, S. Raman, C. Bhat
tacharyya, A. BenTal, and K. R. Ramakrishnan. On
the Algorithmics and Applications of a Mixednorm
based Kernel Learning Formulation. In Advances in
In Proceedings of the
Cristianini, P.Bartlett,
Learning the Ker
The Or
Neural Information Processing Systems, 2009.
[12] R. K. Ando and T. Zhang. A Framework for Learning
Predictive Structures from Multiple Tasks and Unla
beled Data. Journal of Machine Learning Research,
6:1817–1853, 2005.
[13] Jianhui Chen, Lei Tang, Jun Liu, and Jieping Ye. A
Convex Formulation for Learning Shared Structures
from Multiple Tasks. In Proceedings of the Interna
tional Conference on Machine Learning, 2009.
[14] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncover
ing Shared Structures in Multiclass Classification. In
International Conference on Machine Learning, pages
17–24, 2007.
[15] Jian Zhang, Zoubin Ghahramani, and Yiming Yang.
Flexible Latent Variable Models for MultiTask Learn
ing. Machine Learning, 73(3):221–242, 2008.
[16] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying.
A Spectral Regularization Framework for Multitask
Structure Learning. In Advances in Neural Informa
tion Processing Systems, 2007.
[17] Laurent Jacob, Francis Bach, and JeanPhilippe Vert.
Clustered MultiTask Learning: A Convex Formula
tion. In Advances in Nueral Information Processing
Systems, 2008.
[18] A. BenTal and A. Nemirovski. Lectures on Modern
Convex Optimization: Analysis, Algorithms and En
gineering Applications. MPS/ SIAM Series on Opti
mization, 1, 2001.
[19] D. Bertsekas. Nonlinear Programming. Athena Scien
tific, 1999.
[20] F. Bach. Exploring Large Feature Spaces with Hierar
chical Multiple Kernel Learning. In Advances in Neural
Information Processing Systems (NIPS), 2008.
[21] R. Jenatton, J.Y. Audibert, and F. Bach.
tured Variable Selection with Sparsityinducing Norms.
Technical report, 2009.
[22] P. Zhao, G. Rocha, and B. Yu.
erarchical Model Selection through Composite Abso
lute Penalties. Annals of Statistics, 37(6A):3468–3497,
2009.
[23] M. Szafranski, Y. Grandvalet, and A. Rakotoma
monjy. Composite Kernel Learning. In Proceedings of
Struc
Grouped and Hi
837
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Page 11
the Twentyfifth International Conference on Machine
Learning (ICML), 2008.
[24] Charles A. Micchelli and Massimiliano Pontil. Learn
ing the Kernel Function via Regularization. Journal of
Machine Learning Research, 6:1099–1125, 2005.
[25] S. Boyd and L. Vandenberghe. Convex Optimization.
Cambridge University Press, 2004.
[26] R. T. Rockafellar. Convex Analysis. Princeton Univer
sity Press, 1970.
[27] Vladimir Vapnik. Statistical Learning Theory. Wiley
Interscience, 1998.
[28] J. Saketha Nath, G. Dinesh, S. Raman, C. Bhat
tacharyya, A. BenTal, and K. R. Ramakrishnan. On
the Algorithmics and Applications of a Mixednorm
Regularization based Kernel Learning Formulation. In
Advances in Neural Information Processing Systems
(NIPS), 2009.
[29] Bernhard Sch¨ olkopf and Alex Smola.
Kernels. MIT press, Cambridge, 2002.
[30] Yu Zhang and DitYan Yeung. SemiSupervised Multi
Task Regression. In Proceedings of the European Con
ference on Machine Learning and Knowledge Discovery
in Databases, pages 617–631, 2009.
[31] R. Fergus L. FeiFei and P. Perona.
erative visual models from few training examples:
An incremental bayesian approach tested on 101 ob
ject categories. In IEEE. CVPR 2004, Workshop on
GenerativeModel Based Vision., 2004.
[32] ME. Nilsback and A. Zisserman. A visual vocabulary
for flower classification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni
tion, 2006.
[33] B. Bakker and T. Heskes. Task clustering and gating
for bayesian multitask learning. Journal of Machine
Learning Research, 4:83–99, 2003.
Learning with
Learning gen
838
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.