Content uploaded by Ulf Brefeld

Author content

All content in this area was uploaded by Ulf Brefeld

Content may be subject to copyright.

Non-Sparse Regularization and Efficient Training with

Multiple Kernels

Marius Kloft

Ulf Brefeld

Sören Sonnenburg

Alexander Zien

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2010-21

http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-21.html

February 24, 2010

Copyright © 2010, by the author(s).

All rights reserved.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission.

Acknowledgement

The authors wish to thank Pavel Laskov, Motoaki Kawanabe, Vojtech

Franc, Peter Gehler, Gunnar Raetsch, Peter Bartlett and Klaus-Robert

Mueller for fruitful discussions and helpful comments. This work was

supported in part by the German Bundesministerium fuer Bildung und

Forschung (BMBF) under the project REMIND (FKZ 01-IS07007A), by the

German Academic Exchange Service, and by the FP7-ICT Programme of

the European Community, under the PASCAL2 Network of Excellence,

ICT-216886. Soeren Sonnenburg acknowledges financial support by the

German Research Foundation (DFG) under the grant MU 987/6-1 and RA

1894/1-1.

Non-Sparse Regularization and Eﬃcient Training with

Multiple Kernels

Marius Kloft∗mkloft@cs.berkeley.edu

University of California

Computer Science Division

Berkeley, CA 94720-1758, USA

Ulf Brefeld brefeld@yahoo-inc.com

Yahoo! Research

Avinguda Diagonal 177

08018 Barcelona, Spain

S¨oren Sonnenburg∗soeren.sonnenburg@tuebingen.mpg.de

Friedrich Miescher Laboratory

Max Planck Society

Spemannstr. 39, 72076 T¨ubingen, Germany

Alexander Zien zien@lifebiosystems.com

LIFE Biosystems GmbH

Poststraße 34

69115 Heidelberg, Germany

Abstract

Learning linear combinations of multiple kernels is an appealing strategy when the right

choice of features is unknown. Previous approaches to multiple kernel learning (MKL)

promote sparse kernel combinations to support interpretability and scalability. Unfortu-

nately, this `1-norm MKL is rarely observed to outperform trivial baselines in practical

applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms.

We devise new insights on the connection between several existing MKL formulations and

develop two eﬃcient interleaved optimization strategies for arbitrary norms, like `p-norms

with p > 1. Empirically, we demonstrate that the interleaved optimization strategies are

much faster compared to the commonly used wrapper approaches. An experiment on con-

trolled artiﬁcial data experiment sheds light on the appropriateness of sparse, non-sparse

and `∞MKL in various scenarios. Application of `p-norm MKL to three hard real-world

problems from computational biology show that non-sparse MKL achieves accuracies that

go beyond the state-of-the-art. We conclude that our improvements ﬁnally made MKL ﬁt

for deployment to practical applications: MKL now has a good chance of improving the

accuracy (over a plain sum kernel) at an aﬀordable computational cost.

1. Introduction

Kernels allow to decouple machine learning from data. Finding an appropriate data rep-

resentation via a kernel function immediately opens the door to a vast world of powerful

∗. Also at Machine Learning Group, Technische Universit¨at Berlin, Franklinstr. 28/29, FR 6-9, 10587

Berlin, Germany.

1

machine learning models (e.g. Sch¨olkopf and Smola, 2002) with many eﬃcient and reliable

oﬀ-the-shelf implementations. This has propelled the dissemination of machine learning

techniques to a wide range of diverse application domains.

Finding an appropriate data abstraction—or even engineering the best kernel—for the

problem at hand is not always trivial, though. Starting with cross-validation (Stone, 1974)

which is probably the most prominent approach to general model selection, a great many

approaches to selecting the right kernel(s) have been deployed in the literature.

Kernel target alignment (Cristianini et al., 2002) aims at learning the entries of a ker-

nel matrix by using the outer product of the label vector as the ground-truth. Chapelle

et al. (2002) and Bousquet and Herrmann (2002) minimize estimates of the generalization

error of support vector machines (SVMs) using a gradient descent algorithm over the set

of parameters. Ong et al. (2005) study hyperkernels on the space of kernels and alterna-

tive approaches include selecting kernels by DC programming (Argyriou et al., 2008) and

semi-inﬁnite programming ( ¨

Oz¨og¨ur-Aky¨uz and Weber, 2008; Gehler and Nowozin, 2008).

Although ﬁnding non-linear kernel mixtures (Varma and Babu, 2009) generally results in

non-convex optimization problems, Cortes et al. (2009) show that convex relaxations may

be obtained for special cases.

However, learning arbitrary kernel combinations is a problem too general to allow for

a general optimal solution—by focusing on a restricted scenario, it is possible to achieve

guaranteed optimality. In their seminal work, Lanckriet et al. (2004) consider training an

SVM along with optimizing the linear combination of several positive semi-deﬁnite matrices,

K=PM

m=1 θmKm,subject to the trace constraint tr(K)≤cand requiring a valid combined

kernel K0. This spawned the new ﬁeld of multiple kernel learning (MKL), the automatic

combination of several kernel functions. Lanckriet et al. (2004) show that their speciﬁc

version of the MKL task can be reduced to a convex optimization problem, namely a semi-

deﬁnite programming (SDP) optimization problem. Though convex, however, the SDP

approach is computationally too expensive for practical applications. Thus much of the

subsequent research focused on devising eﬃcient optimization procedures for learning with

multiple kernels.

One conceptual milestone for developing MKL into a tool of practical utility is simply

to constrain the mixing coeﬃcients θto be non-negative: by obviating the complex con-

straint K0, this small restriction allows one to transform the optimization problem into

a quadratically constrained program, hence drastically reducing the computational burden.

While the original MKL objective is stated and optimized in dual space, alternative formu-

lations have been studied. For instance, Bach et al. (2004) found a corresponding primal

problem, and Rubinstein (2005) decomposed the MKL problem into a min-max problem

that can be optimized by mirror-prox algorithms (Nemirovski, 2004).

The min-max formulation has been independently proposed by Sonnenburg et al. (2005).

They use it to recast MKL training as a semi-inﬁnite linear program. Solving the latter

with column generation (e.g., Nash and Sofer, 1996) amounts to repeatedly training an SVM

on a mixture kernel while iteratively reﬁning the mixture coeﬃcients θ. This immediately

lends itself to a convenient implementation by a wrapper approach. These algorithms di-

rectly beneﬁt from eﬃcient SVM optimization routines (cf., e.g., Fan et al., 2005; Joachims,

1999) and are now commonly deployed in recent MKL solvers (e.g., Rakotomamonjy et al.,

2008; Xu et al., 2009), thereby allowing for large-scale multiple kernel learning training

2

(Sonnenburg et al., 2005, 2006a). However, the complete training of several SVMs can still

be prohibitive for large data sets. For this reason, Sonnenburg et al. (2005) also proposed

to interleave the SILP with the SVM training which reduced the training time drastically.

Alternative optimization schemes include level-set methods (Xu et al., 2009) and second

order approaches (Chapelle and Rakotomamonjy, 2008). Szafranski et al. (2008), Nath

et al. (2009), and Bach (2009) study composite and hierarchical kernel learning approaches.

Finally, Zien and Ong (2007) and Ji et al. (2009) provide extensions for multi-class and

multi-label settings, respectively.

Today, there exist two mayor families of multiple kernel learning models, characterized

either by Ivanov regularization (Ivanov et al., 2002) over the mixing coeﬃcients (Rakotoma-

monjy et al., 2007; Zien and Ong, 2007), or as Tikhonov regularized optimization problem

(Tikhonov and Arsenin, 1977). In the both cases, there may be an additional parameter

controlling the regularization of the mixing coeﬃcients (Varma and Ray, 2007).

All the above mentioned multiple kernel learning formulations promote sparse solutions

in terms of the mixing coeﬃcients. The desire for sparse mixtures originates in practical

as well as theoretical reasons. First, sparse combinations are easier to interpret. Second,

irrelevant (and possibly expensive) kernels functions do not need to be evaluated at testing

time. Finally, sparseness appears to be handy also from a technical point of view, as

the additional simplex constraint kθk1≤1 simpliﬁes derivations and turns the problem

into a linearly constrained program. Nevertheless, sparseness is not always beneﬁcial in

practice. Sparse MKL is frequently observed to be outperformed by a regular SVM using

an unweighted-sum kernel K=PmKm.

Consequently, despite all the substantial progress in the ﬁeld of MKL, there still remains

an unsatisﬁed need for an approach that is really useful for practical applications: a model

that has a good chance of improving the accuracy (over a plain sum kernel) together with

an implementation that matches today’s standards (i.e., that can be trained on 10,000s of

data points in a reasonable time). In addition, since the ﬁeld has grown several competing

MKL formulations, it seems timely to consolidate the set of models.

In this article we argue that all of this is now achievable, at least when considering MKL

restricted to non-negative mixture coeﬃcients. On the theoretical side, we cast multiple

kernel learning as a general regularized risk minimization problem for arbitrary convex loss

functions, Hilbertian regularizers, and arbitrary norm-penalties on θ. We ﬁrst show that the

above mentioned Tikhonov and Ivanov regularized MKL variants are equivalent in the sense

that they yield the same set of hypotheses. Then we derive a generalized dual and show that

a variety of methods are special cases of our objective. Our detached optimization problem

subsumes state-of-the-art approaches to multiple kernel learning, covering sparse and non-

sparse MKL by arbitrary p-norm regularization (1 ≤p≤ ∞) on the mixing coeﬃcients as

well as the incorporation of prior knowledge by allowing for non-isotropic regularizers. As

we demonstrate, the p-norm regularization includes both important special cases (sparse

1-norm and plain sum ∞-norm) and oﬀers the potential to elevate predictive accuracy over

both of them.

With regard to the implementation, we introduce an appealing and eﬃcient optimization

strategy which grounds on an exact update in closed-form in the θ-step; hence rendering

expensive semi-inﬁnite and ﬁrst- or second-order gradient methods unnecessary. By uti-

lizing proven working set optimization for SVMs, p-norm MKL can now be trained highly

3

eﬃciently for all p; in particular, we outpace other current 1-norm MKL implementations.

Moreover our implementation employs kernel caching techniques, which enables training

on ten thousands of data points or thousands of kernels respectively. In contrast, most

competing MKL software require all kernel matrices to be stored completely in memory,

which restricts these methods to small data sets with limited numbers of kernels. Our

implementation is freely available within the SHOGUN machine learning toolbox available

from http://www.shogun-toolbox.org/.

Our claims are backed up by experiments on artiﬁcial data and on a couple of real

world data sets representing diverse, relevant and challenging problems from the application

domain bioinformatics. The artiﬁcial data enables us to investigate the relationship between

properties of the true solution and the optimal choice of kernel mixture regularization. The

real world problems include the prediction of the subcellular localization of proteins, the

(transcription) starts of genes, and the function of enzymes. The results demonstrate (i)

that combining kernels is now tractable on large data sets, (ii) that it can provide cutting

edge classiﬁcation accuracy, and (iii) that depending on the task at hand, diﬀerent kernel

mixture regularizations are required for achieving optimal performance.

The remainder of this paper is structured as follows. We derive the generalized MKL in

Section 2 and discuss relations to existing approaches in Section 3. Section 4 introduces the

novel optimization strategy and shows the applicability of existing optimization techniques

to our generalized formulation. We report on our empirical results in Section 5. Section 6

concludes.

2. Generalized MKL

In this section we cast multiple kernel learning into a uniﬁed framework: we present a

regularized loss minimization formulation with additional norm constraints on the kernel

mixing coeﬃcients. We show that it comprises many popular MKL variants currently

discussed in the literature, including seemingly diﬀerent ones.

We derive generalized dual optimization problems without making speciﬁc assumptions

on the norm regularizers or the loss function, beside that the latter is convex. Our formu-

lation covers binary classiﬁcation and regression tasks and can easily be extended to multi-

class classiﬁcation and structural learning settings using appropriate convex loss functions

and joint kernel extensions. Prior knowledge on kernel mixtures and kernel asymmetries

can be incorporated by non-isotropic norm regularizers.

2.1 Preliminaries

We begin with reviewing the classical supervised learning setup. Given a labeled sample

D={(xi, yi)}i=1...,n, where the xilie in some input space Xand yi∈ Y ⊂ R, the goal is

to ﬁnd a hypothesis f∈ H, that generalizes well on new and unseen data. Regularized risk

minimization returns a minimizer f∗,

f∗∈argminfRemp(f) + λΩ(f),

where Remp(f) = 1

nPn

i=1 V(f(xi), yi) is the empirical risk of hypothesis fw.r.t. to a convex

loss function V:R×Y → R, Ω : H → Ris a regularizer, and λ > 0 is a trade-oﬀ parameter.

4

We consider linear models of the form

f˜

w,b(x) = h˜

w, ψ(x)i+b, (1)

together with a (possibly non-linear) mapping ψ:X → H to a Hilbert space H(e.g.,

Sch¨olkopf et al., 1998; M¨uller et al., 2001) and constrain the regularization to be of the

form Ω(f) = 1

2||˜

w||2

2which allows to kernelize the resulting models and algorithms. We will

later make use of kernel functions K(x,x0) = hψ(x), ψ(x0)iHto compute inner products in

H.

2.2 Convex Risk Minimization with Multiple Kernels

When learning with multiple kernels, we are given Mdiﬀerent feature mappings ψm:

X → Hm, m = 1,...M, each giving rise to a reproducing kernel Kmof Hm. Convex

approaches to multiple kernel learning consider linear kernel mixtures Kθ=PθmKm,

θm≥0. Compared to Eq. (1), the primal model for learning with multiple kernels is

extended to

f˜

w,b,θ(x) =

M

X

m=1 pθmh˜

wm, ψm(x)iHm+b=h˜

w, ψθ(x)iH+b(2)

where the parameter vector ˜

wand the composite feature map ψθhave a block structure

˜

w= ( ˜

w>

1,..., ˜

w>

M)>and ψθ=√θ1ψ1×. . . ×√θMψM, respectively.

In learning with multiple kernels we aim at minimizing the loss on the training data

w.r.t. to optimal kernel mixture PθmKmin addition to regularizing θto avoid overﬁtting.

Hence, in terms of regularized risk minimization, the optimization problem becomes

inf

˜

w,b,θ:θ≥0

1

n

n

X

i=1

V M

X

m=1 pθmh˜

wm, ψm(xi)iHm+b, yi!+λ

2

M

X

m=1 ||˜wm||2

Hm+ ˜µ˜

Ω[θ],(3)

for ˜µ > 0. Note that the objective value of Eq. (3) is an upper bound on the training error.

Previous approaches to multiple kernel learning employ regularizers of the form ˜

Ω(θ) = ||θ||1

to promote sparse kernel mixtures. By contrast, we propose to use convex regularizers of

the form ˜

Ω(θ) = ||θ||2, where || · ||2is an arbitrary norm in RM, possibly allowing for

non-sparse solutions and the incorporation of prior knowledge. The non-convexity arising

from the √θm˜

wmproduct in the loss term of Eq. (3) is not inherent and can be resolved by

substituting wm←√θm˜

wm. Furthermore, the regularization parameter and the sample

size can be decoupled by introducing ˜

C=1

nλ (and adjusting µ←˜µ

λ) which has favorable

scaling properties in practice. We obtain the following convex optimization problem (Boyd

and Vandenberghe, 2004) that has also been considered by (Varma and Ray, 2007) for hinge

loss and an `1-norm regularizer

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

||wm||2

Hm

θm

+µ||θ||2,(4)

where we use the convention that t

0= 0 if t= 0 and ∞otherwise.

5

An alternative approach has been studied by Rakotomamonjy et al. (2007) and Zien

and Ong (2007), again using hinge loss and `1-norm. They upper bound the value of

the regularizer ||θ||1≤1 and incorporate the latter as an additional constraint into the

optimization problem. For C > 0, they arrive at the following problem which is the

primary object of investigation in this paper.

Primal MKL Optimization Problem

inf

w,b,θ:θ≥0C

n

X

i=1

VM

X

m=1hwm, ψm(xi)iHm+b, yi+1

2

M

X

m=1

||wm||2

Hm

θm

(P)

s.t. ||θ||2≤1.

Our ﬁrst contribution shows that despite the additional regularization parameter the

Tikhonov regularization in (4) and the Ivanov regularization in Optimization Problem (P)

are equivalent, in the sense that they yield the same binary classiﬁcation function.

Theorem 1 Let || · || be a norm on RM, be Va convex loss function. Suppose for the

optimal w∗in Optimization Problem (P) it holds w∗6=0. Then, for each pair (˜

C, µ)there

exists C > 0such that for each optimal solution (w, b, θ) of Eq. (4) using (˜

C, µ), we have

that (w, b, κ θ)is also an optimal solution of Optimization Problem (P) using C, and vice

versa, where κ > 0is a multiplicative constant.

For the proof we need Prop. 8, which justiﬁes switching from Ivanov to Tikhonov

regularization, and back, if the regularizer is tight. We refer to Appendix A for formulation

and proof of the proposition.

Proof of Theorem 1 Let be ( ˜

C, µ)>0. In order to apply Prop. 8 to (4), we start by

showing that condition (35) in Prop. 8 is satisﬁed, i.e., that the regularizer is tight.

Suppose on the contrary, that Optimization Problem (P) yields the same inﬁmum re-

gardless of whether we require

||θ||2≤1,(5)

or not. Then this implies that in the optimal point we have PM

m=1

||w∗

m||2

2

θ∗

m= 0, hence,

||w∗

m||2

2

θ∗

m

= 0 ∀m. (6)

Since all norms on RMare equivalent (cf., e.g., Rudin (1991)), there exists a L < ∞such

that ||θ∗||∞≤L||θ∗||. In particular, we have ||θ∗||∞<∞, from which we conclude by (6),

that wm= 0 holds for all m, which contradicts our assumption.

Hence, Prop. 8 can be applied and which yields that (4) is equivalent to

inf

w,b,θ

˜

C

n

X

i=1

VM

X

m=1hwm, ψm(x)i+b, yi+1

2

M

X

m=1

||wm||2

2

θm

s.t. ||θ||2≤τ,

6

for some τ > 0. Consider the optimal solution (w?, b?,θ?) corresponding to a given

parametrization ( ˜

C, τ ). For any λ > 0, the bijective transformation ( ˜

C, τ )7→ (λ−1/2˜

C, λτ )

will yield (w?, b?, λ1/2θ?) as optimal solution. Applying the transformation with λ:= 1/τ

and setting C=˜

Cτ 1

2as well as κ=τ−1/2yields Optimization Problem (P), which was to

be shown.

Zien and Ong (2007) also showed that the MKL optimization problems by Bach et al.

(2004), Sonnenburg et al. (2006a), and their own formulation are equivalent. As a main

implication of Theorem 1 and by using the result of Zien and Ong it follows that the

optimization problem of Varma and Ray (Varma and Ray, 2007) lies in the same equivalence

class as (Bach et al., 2004; Sonnenburg et al., 2006a; Rakotomamonjy et al., 2007; Zien and

Ong, 2007). In addition, our result shows the coupling between trade-oﬀ parameter C

and the regularization parameter µin Eq. (4): tweaking one also changes the other and

vice versa. Theorem 1 implies that optimizing Cin Optimization Problem (P) implicitly

searches the regularization path for the parameter µof Eq. (4). In the remainder, we will

therefore focus on the formulation in Optimization Problem (P), as a single parameter is

preferable in terms of model selection.

2.3 Convex MKL in Dual Space

In this section we study the generalized MKL approach of the previous section in the dual

space. Let us begin with rewriting Optimization Problem (P) by expanding the decision

values into slack variables as follows

inf

w,b,t,θ:θ≥0C

n

X

i=1

V(ti, yi) + 1

2

M

X

m=1

||wm||2

Hm

θm

(7)

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iHm+b=ti;||θ||2≤1,

where ||·|| is an arbitrary norm in Rmand ||·||HMdenotes the Hilbertian norm of Hm. Ap-

plying Lagrange’s theorem re-incorporates the constraints into the objective by introducing

Lagrangian multipliers α∈Rnand β∈R+.1The Lagrangian saddle point problem is

then given by

sup

α,β:β≥0

inf

w,b,t,θ≥0C

n

X

i=1

V(ti, yi) + 1

2

M

X

m=1

||wm||2

Hm

θm

(8)

−

n

X

i=1

αi M

X

m=1hwm, ψm(xi)iHm+b−ti!+β1

2||θ||2−1

2.

1. Note that αis variable over the whole range of Rnsince it is incorporates an equality constraint.

7

Denoting the Lagrangian by Land setting its ﬁrst partial derivatives with respect to wand

bto 0 reveals the optimality conditions

1>α= 0; (9a)

∀m= 1,··· M:wm=θm

n

X

i=1

αiψm(xi).(9b)

Resubstituting the above equations yields

sup

α:1>α=0, β:β≥0

inf

t,θ≥0C

n

X

i=1

(V(ti, yi) + αiti)−1

2

M

X

m=1

θmα>Kmα+β1

2||θ||2−1

2,

which can also be written in terms of unconstrained θbecause, without loss of generality,

a supremum with respect to θis trivially attained for arbitrary non-negative θ≥0. We

arrive at

sup

α:1>α=0, β≥0−C

n

X

i=1

sup

ti−αi

Cti−V(ti, yi)−βsup

θ 1

2β

M

X

m=1

θmα>Kmα−1

2||θ||2!−1

2β.

As a consequence, we now may express the Lagrangian as2

sup

α:1>α=0, β≥0−C

n

X

i=1

V∗−αi

C, yi−1

2β

1

2α>KmαM

m=1

2

∗−1

2β, (10)

where h∗(x) = supux>u−h(u) denotes the Fenchel-Legendre conjugate of a function h

and ||·||∗denotes the dual norm, i.e., the norm deﬁned via the identity 1

2||·||2

∗:= 1

2|| · ||2∗.

In the following, we call V∗the dual loss. Eq. (10) now has to be maximized with respect

to the dual variables α, β, subject to 1>α= 0 and β≥0. Let us ignore for a moment

the non-negativity constraint on βand solve ∂L/∂β = 0 for the unbounded β. Setting the

partial derivative to zero allows to express the optimal βas

β=

1

2α>KmαM

m=1

∗

.(11)

Obviously, at optimality, we always have β≥0. We thus discard the corresponding

constraint from the optimization problem and plugging Eq. (11) into Eq. (10) results in

the following dual optimization problem which now solely depends on α:

Dual MKL Optimization Problem

sup

α:1>α=0 −C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

∗

.(D)

2. We employ the notation s= (s1,...,sM)>= (sm)M

m=1 for s∈RM.

8

The above dual generalizes multiple kernel learning to arbitrary convex loss functions

and norms. Note that if the loss function is continuous the supremum is also a maximum.

The threshold bcan be recovered from the solution by applying the KKT conditions.

The above dual can be characterized as follows. We start by noting that the expression

in Optimization Problem (D) is a composition of two terms, ﬁrstly, the left hand side

term, which depends on the conjugate loss function V∗, and, secondly, the right hand side

term which depends on the conjugate norm. The right hand side can be interpreted as

a regularizer on the quadratic terms that, according to the chosen norm, smoothens the

solutions. Hence we have a nice decomposition of the dual into a loss term (in terms of

the dual loss) and a regularizer (in terms of the dual norm). For a speciﬁc choice of a

pair (V, || ·||) we can immediately recover the corresponding dual by computing the pair of

conjugates (V∗,|| · ||∗). In the next section, this is illustrated by means of well-known loss

functions and regularizers.

3. Instantiations of the Model

In this section we show that existing MKL-based learners are subsumed by the generalized

formulation in Optimization Problem (D).

3.1 Support Vector Machines with Unweighted-Sum Kernels

First we note that the support vector machine with an unweighted-sum kernel can be

recovered as a special case of our model. To see this, we consider the RRM problem using

the hinge loss function V(t, y) = max(0,1−ty) and the regularizer ||θ||∞. We then can

obtain the corresponding dual in terms of Fenchel-Legendre conjugate functions as follows.

We ﬁrst note that the dual loss of the hinge loss is V∗(t, y) = t

yif −1≤t

y≤0 and

∞elsewise (Rifkin and Lippert, 2007). Hence, for each ithe term V∗−αi

C, yiof the

generalized dual, i.e., Optimization Problem (D), translates to αi

Cyi, provided that 0 ≤αi

yi≤

C. Employing a variable substitution of the form αnew

i=αi

yi, Optimization Problem (D)

translates to

max

α1>α−1

2

α>Y KmYαM

m=1

∗

,s.t. y>α= 0 and 0≤α≤C1,(12)

where we denote Y= diag(y). The primal `∞-norm penalty ||θ||∞is dual to ||θ||1, hence,

via the identity || · ||∗=|| · ||1the right hand side of the last equation translates to

PM

m=1 α>Y KmYα. Combined with (12) this leads to the dual

sup

α

1>α−1

2

M

X

m=1

α>Y KmYα,s.t. y>α= 0 and 0≤α≤C1,

which is precisely an SVM with an unweighted-sum kernel.

3.2 QCQP MKL of Lanckriet et al. (2004)

A common approach in multiple kernel learning is to employ regularizers of the form

Ω = ||θ||1.(13)

9

This so-called `1-norm regularizers are speciﬁc instances of sparsity-inducing regularizers.

The obtained kernel mixtures are often sparse and hence equip the MKL problem by the

favor of interpretable solutions. Sparse MKL is a special case of our framework; to see

this, note that the conjugate of (13) is || · ||∞. Recalling the deﬁnition of an `p-norm, the

right hand side of Optimization Problem (D) translates to maxm∈{1,...,M}α>Y KmYα. The

maximum can subsequently be expanded into slack variables ξi, resulting in

sup

α,ξ

1>α−ξi

s.t. ∀m:1

2α>Y KmYα≤ξm;y>α= 0 ; 0≤α≤C1,

which is the original QCQP formulation of MKL, ﬁrstly given by Lanckriet et al. (2004).

3.3 `p-Norm MKL

The generalized MKL also allows for robust kernel mixtures by employing an `p-norm

constraint with p > 1, rather than an `1-norm constraint, on the mixing coeﬃcients (Kloft

et al., 2009a). The following identity holds

1

2|| · ||2

p∗

=1

2|| · ||2

q,where 1

p+1

q= 1,

and we obtain for the dual norm of the `p-norm: || · ||∗=|| · ||q. This leads to the dual

problem

sup

α:1>α=0−C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

q

.

In the special case of hinge loss minimization, we obtain the optimization problem

sup

α

1>α−1

2

α>Y KmYαM

m=1

q

,s.t. y>α= 0 and 0≤α≤C1.

It is thereby worth mentioning that the optimality conditions yield the proportionality,

θ∗

m∼(α∗Kmα∗)2

p−1,

as we will show in Sect. 4.1.

3.4 A Smooth Variant of Group Lasso

Yuan and Lin (2006) studied the following optimization problem for the special case Hm=

Rdmand ψm= idRdm, also known as group lasso,

min

w,b

C

2

n

X

i=1 yi−

M

X

m=1hwm, ψm(xi)iHm!2

+1

2

M

X

m=1 ||wm||Hm.(14)

10

Above problem has been solved by active set methods in the primal (Roth and Fischer,

2008). We sketch an alternative approach based on dual optimization. First, we note that

Eq. (14) can be equivalently expressed as (Micchelli and Pontil, 2005a)

inf

w,b,θ:θ≥0

C

2

n

X

i=1 yi−

M

X

m=1hwm, ψm(xi)iHm!2

+1

2

M

X

m=1

||wm||2

Hm

θm

,s.t. ||θ||2

1≤1.

Thus, the dual of V(t, y) = 1

2(y−t)2is V∗(t, y) = 1

2t2+ty and the corresponding group

lasso dual can be written as,

max

αy>α−1

2C||α||2

2−1

2

α>Y KmYαM

m=1

∞

,(15)

which can be expanded into the following QCQP

sup

α,ξ

y>α−1

2C||α||2

2−ξi(16)

s.t. ∀m:1

2α>Y KmYα≤ξm.

For small n, the latter formulation can be handled eﬃciently by QCQP solvers. However,

the quadratic constraints caused by the non-smooth `∞-norm in the objective still are

computationally too demanding. As a remedy, we propose a smooth and unconstrained

variant based on `p-norms (p > 1), given by

max

αy>α−1

2C||α||2

2−1

2

α>Y KmYαM

m=1

p

,

which can be solved very eﬃciently by limited memory quasi-Newton descent methods (Liu

and Nocedal, 1989).

3.5 Density Level-Set Estimation

Density level-set estimators are frequently used for anomaly/novelty detection tasks

(Markou and Singh, 2003a,b). Kernel approaches, such as one-class SVMs (Sch¨olkopf

et al., 2001) and Support Vector Domain Descriptions (Tax and Duin, 1999) have been

extended to MKL settings by Sonnenburg et al. (2006a) and Kloft et al. (2008), respec-

tively. One-class MKL can be cast into our framework by employing loss functions of the

form V(t) = max(0,1−t). This gives rise to the primal

inf

w,b,θ:θ≥0C

n

X

i=1

max 0,

M

X

m=1hwm, ψm(xi)iHm!+1

2

M

X

m=1

||wm||2

Hm

θm

,s.t. ||θ||2≤1.

Noting that the dual loss is V∗(t) = tif −1≤t≤0 and ∞elsewise, we obtain the following

generalized dual

sup

α

1>α−1

2

α>KmαM

m=1

q

,s.t. 0≤α≤C1,

which has been studied by Sonnenburg et al. (2006a) for `1-norm and by Kloft et al. (2009b)

for `p-norms.

11

3.6 Non-Isotropic Norms

In practice, it is often desirable for an expert to incorporate prior knowledge about the

problem domain. For instance, an expert could have given an estimate of the interactions

within the set of kernels considered , e.g. in the form of an M×Mmatrix E. Alternatively,

it might be known in advance that a subset of the employed kernels is inferior to the

remaining kernels; for instance, such knowledge could result from previous experiments in

the considered application ﬁeld. Those scenarios can be easily handled within our framework

by considering non-isotropic regularizers of the form

||θ||E=pθ>Eθwith E0.

The dual norm is again deﬁned via 1

2|| · ||2

∗:= 1

2|| · ||2

E∗and the following easily-to-verify

identity,

1

2|| · ||2

E∗

=1

2|| · ||2

F,

with matrix inverse F=E−1, leads to the dual,

sup

α:1>α=0−C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

E−1

,

which is the desired non-isotropic MKL problem.

4. Eﬃcient Optimization Strategies

The dual as given in Optimization Problem (D) does not lend itself to eﬃcient large-scale

optimization in a straight-forward fashion, for instance by direct application of standard

appoaches like gradient descent. Instead, it is beneﬁcial to exploit the structure of the MKL

cost function by alternating between optimizing w.r.t. to the mixings θand w.r.t. to the

remaining variables. Most recent MKL solvers (e.g., Rakotomamonjy et al., 2008; Xu et al.,

2009; Varma and Babu, 2009) do so by setting up a two-layer optimization procedure:

a master problem, which is parameterized only by θand independent of θ, is solved to

determine the kernel mixture; to solve this master problem, repeatedly a slave problem

is solved which amounts to training a standard SVM on a mixture kernel. Importantly,

for the slave problem, the mixture coeﬃcients are ﬁxed, such that convential, eﬃcient

SVM optimizers can be recycled. Consequently these two-layer procedures are commonly

implemented as wrapper approaches. Albeit appearing advantageous, wrapper methods

suﬀer from a few shortcomings: (i) Due to kernel cache limitations, the kernel matrices

have to be pre-computed and stored or many kernel computations have to be carried out

repeatedly, inducing heavy wastage of either memory or time. (ii) The slave problem is

always optimized to the end (and many convergence proofs seem to require this), although

most of the computational time is spend on the non-optimal mixtures. Certainly suboptimal

slave solutions would already suﬃce to improve far-from-optimal θin the master problem.

Due to these problems, MKL is prohibitive when learning with a multitude of kernels and

on large-scale data sets as commonly encountered in many data-intense real world applica-

tions such as bioinformatics, web mining, databases, and computer security, etc. Therefore

12

all optimization approaches presented in this paper implement a true decomposition of the

MKL problem into smaller subproblems (Platt, 1999; Joachims, 1999; Fan et al., 2005) by

establishing a wrapper-like scheme within the decomposition algorithm.

Our algorithms are embedded into the large-scale framework of Sonnenburg et al. (2006a)

and extend them to optimization of non-sparse kernel mixtures induced by an `p-norm

penalty. Our ﬁrst strategy alternates between minimizing the primal problem (7) w.r.t. θ

with incomplete optimization w.r.t. all other variables which, however, is performed in

terms of the dual variables α. For the second strategy, we devise a convex semi-inﬁnite

program (SIP), which we solve by column generation with nested sequential quadratically

constrained linear programming (SQCLP). In both cases, optimization w.r.t. αis performed

by chunking optimization with minor iterations. The ﬁrst, “direct” approach can be applied

without a common purpose QCQP solver. We show convergence of both algorithms: for

the “direct” algorithm in Prop. 5 and convergence of the SQCLP in Prop. 6. All algorithms

are implemented in the SHOGUN machine learning toolbox, which is freely available from

http://www.shogun-toolbox.org/.

4.1 An Analytical Method

In this section we present a simple and eﬃcient optimization strategy for multiple kernel

learning. To derive the new algorithm, we ﬁrst revisit the primal problem, i.e.

inf

w,b,θ:θ≥0C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

||wm||2

Hm

θm

,s.t. ||θ||2≤1.(P)

In order to obtain an eﬃcient optimization strategy, we divide the variables in the above OP

into two groups, (w, b) on one hand and θon the other. In the following we will derive an

algorithm, which alternatingly operates on those two groups via block coordinate descent

algorithm, also known as the non-linear Gauss-Seidel method. Thereby the optimization

w.r.t. θwill be carried out analytically and the (w, b)-step will be computed in the dual, if

needed.

The basic idea of our ﬁrst approach is that for a given, ﬁxed set of primal variables (w, b),

the optimal θin the primal problem (P) can be calculated analytically. In the subsequent

derivations we exemplarily employ non-sparse norms of the form ||θ||p= (PM

m=1 θp

m)1/p,

1< p < ∞, but the reasoning—including convergence guarantees—holds for arbitrary

continuously diﬀerentiable and strictly convex norms3.

The following proposition gives the an analytic update formula for the θgiven ﬁxed

remaining variables (w, b) and will become the core of our proposed algorithm.

Proposition 2 Let Vbe a convex loss function, be p > 1. Given ﬁxed (w, b), the optimal

solution of Optimization Problem (P) is attained for

θ∗

m=||w∗

m||

2

p+1

Hm

PM

m0=1 ||w∗

m0||

2p

p+1

Hm01/p ,∀m= 1, . . . , M. (17)

3. Lemma 26 in Micchelli and Pontil (2005b) indicates that the result could even be extended to an inﬁnite

number of kernels.

13

Proof We start the derivation, by equivalently translating Optimization Problem (P) via

Theorem 1 into

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

||wm||2

Hm

θm

+µ||θ||2

p.(18)

Setting the partial derivatives w.r.t. θto zero, we obtain the following condition on the

optimality of θ,

−||wm||2

Hm

2θ2

m

+β·∂1

2||θ||2

p

∂θm

= 0,∀m= 1, . . . , M, (19)

with non-zero β(it holds β > 0 by the strict convexity of || · ||). The ﬁrst derivative of the

`p-norm with respect to the mixing coeﬃcients can be expressed as

∂1

2||θ||2

p

∂θm

=θp−1

m||θ||2−p

p,

and hence Eq. (19) translates into the following optimality condition,

θ∗

m=ζ||w∗

m||

2

p+1

Hm,∀m= 1, . . . , M, (20)

with a suitable constant ζ. By the strict convexity of || · || the constraint ||θ||2

p≤1 in

Optimization Problem (P) is at the upper bound and hence we have that ||θ∗||p= 1 for an

optimal θ∗. Hence, ζcan be computed as ζ=PM

m=1 ||w∗

m||2p/p+1

Hm1/p. Combined with

(20), this results in the claimed formula (17).

In the more interesting case, we will perform the above update in the dual, thereby

operating on dual variables α:

Corollary 3 Let Vbe a convex loss function, be p > 1. Given ﬁxed dual variable α, as

speciﬁed in Sect. (2.3), the optimal solution of Optimization Problem (P) is attained for

θ∗

m=(αKmα)2

p−1

PM

m0=1 (αKm0α)

2p

p−11/p ,∀m= 1, . . . , M. (21)

Note that if we deploy hinge loss, then we operate on variables αnew

i=αiyi(cf. Sect. 3.1).

Proof According to Eq. (9b) the dual variables αare speciﬁed in terms of wmby

w∗

m=θ∗

m

n

X

i=1

α∗

iψm(xi).

Plugging the above primal-dual relations into Eq. (20) and appropriately normalizing, we

obtain the desired dual update formula for θ.

14

Second we consider how to optimize Optimization Problem (P) w.r.t. the remaining

variables (wm, b) for a given, set of mixing coeﬃcients θ. Since optimization often is con-

siderably easier in the dual space, we ﬁx θand build the partial Lagrangian of Optimization

Problem (P) w.r.t. all other primal variables w,b. The resulting dual problem is of the

form

sup

α:1>α=0 −C

n

X

i=1

V∗−αi

C, yi−1

2

M

X

m=1

θmα>Kmα.(22)

We now have all ingredients for an eﬃcient `p-norm algorithm, based on alternatingly

solving an SVM w.r.t. the actual mixture θand computing the analytical update according

to Eq. (17). A simple wrapper algorithm is stated in Alg. 1.

Algorithm 1 Simple `p>1-norm MKL wrapper-based training algorithm. The analytical

updates of θand the SVM computations are optimized alternatingly.

1: input feasible αand θ.

2: while optimality conditions are not satisﬁed do

3: solve Eq. (22), e.g., SVM, w.r.t. α

4: obtain updated θaccording to Eq. (21)

5: end while

A disadvantage of the above wrapper approach is that it deploys a full blown kernel

matrix. Instead, we propose to interleave the SVM optimization of SVMlight with the θ-

and α-steps at training time. We have implemented this so-called interleaved algorithm in

Shogun for hinge loss, thereby promoting sparse solutions in α. This allows us to solely

operate on a small number of active variables.4The resulting interleaved optimization

method is shown in Algorithm 2. Lines 3-5 are standard in chunking based SVM solvers

and carried out by SVMlight. Lines 6-8 compute (parts of) SVM-objective values for each

kernel independently. Finally lines 10 to 14 solve the analytical θ-step. The algorithm

terminates if the maximal KKT violation (c.f. Joachims, 1999) falls below a predetermined

precision εsvm and if the normalized maximal constraint violation |1−St

λ|< εmkl for the

MKL-step.

In the following, we exploit the primal view of the above algorithm as a non-linear

Gauss-Seidel method, to prove convergence. We ﬁrst need the following useful result about

convergence of the non-linear Gauss-Seidel method in general.

Proposition 4 (Bertsekas, 1999) Let X=NM

m=1 Xmbe a the Cartesian product of

closed convex sets Xm∈Rdm, be f:X → Ra continuously diﬀerentiable function. Deﬁne

the non-linear Gauss-Seidel method recursively by letting x0∈ X be any feasible point, and

be

xk+1

m= argmin

ξ∈Xm

fxk+1

1,··· , xk+1

m−1, ξ, xk

m+1,··· , xk

M,∀m= 1, . . . , M. (23)

4. In practice, it turns out that the kernel matrix of active variables usually is about of the size 40 ×40

even when we deal with ten-thousands of examples.

15

Algorithm 2 `p-Norm MKL chunking-based training algorithm via analytical update. Ker-

nel weighting θand SVM αare optimized interleavingly. The accuracy parameter and the

subproblem size Qare assumed to be given to the algorithm.

1: gm,i = 0, ˆgi= 0, αi= 0, θm=p

p1/M for m= 1, . . . , M and i= 1, . . . , n

2: for t= 1,2, . . . and while SVM and MKL optimality conditions are not satisﬁed do

3: Select Q suboptimal variables αi1, . . . , αiQbased on the gradient ˆ

gand α; store αold =α

4: Solve SVM dual with respect to the selected variables and update α

5: Update gradient gm,i ←gm,i +PQ

q=1(αiq−αold

iq)yiqkm(xiq,xi) for all m= 1, . . . , M and

i= 1, . . . , n

6: for m= 1, . . . , M do

7: St

m=1

2Pigm,iαiyi

8: end for

9: if |1−St

λ| ≥

10: while MKL optimality conditions are not satisﬁed do

11: for m= 1, . . . , M

12: θm= (St

m)1/(p+1) /PM

m0=1 (St

m0)p/(p+1)1/p

13: end for

14: end while

15: end if

16: ˆgi=Pmθmgm,i for all i= 1, . . . , n

17: end for

Suppose that for each mand x∈ X, the minimum

min

ξ∈Xm

f(x1,··· , xm−1, ξ , xm+1,··· , xM) (24)

is uniquely attained. Then every limit point of the sequence {xk}k∈Nis a stationary point.

The proof can be found in Bertsekas (1999), p. 268-269. The next proposition basically

establishes convergence the proposed `p-norm MKL training algorithm.

Proposition 5 Let Vbe the hinge loss and be p > 1. Let the kernel matrices K1, . . . , KM

be positive deﬁnite. Then every limit point of Algorithm 1 is a globally optimal point of

Optimization Problem (P). Moreover, suppose that the SVM computation is solved exactly

in each iteration, then the same holds true for Algorithm 2.

Proof If we ignore the numerical speed-ups, then the Algorithms 1 and 2 coincidence for

the hinge loss. Hence, it suﬃces to show the wrapper algorithm converges.

To this aim, we have to transform Optimization Problem (P) into a form such that the

requirements for application of Prop. 4 are fulﬁlled. We start by expanding Optimization

Problem (P) into

min

w,b,ξ,θC

n

X

i=1

ξi+1

2

M

X

m=1

||wm||2

Hm

θm

,

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iHm+b≥1−ξi;ξ≥0; ||θ||2

p≤1; θ≥0,

16

thereby extending the second block of variables, (w, b), into (w, b, ξ). Moreover, we note

that after an application of the representer theorem5(Kimeldorf and Wahba, 1971) we may

without loss of generality assume Hm=Rn.

In above problem’s current form, the possibility of θm= 0 while wm6= 0 renders the

objective function nondiﬀerentiable, and it can take on inﬁnite values. This hinders the

application of the Prop. 4. Fortunately, in the optimal point, we always have θm∗>0 for

all m, which can be veriﬁed by Eq. (21), where we use the positive deﬁniteness of the kernel

matrices Km. We therefore can substitute the constraint θ≥1 by θ>1 for all m. In order

to maintain the closeness of the feasible set we subsequently apply a bijective coordinate

transformation φm:RM

+→RMwith φ(θm) = log(θm), resulting in the following equivalent

problem,

inf

w,b,ξ,θC

n

X

i=1

ξi+1

2

M

X

m=1

exp(−θm)||wm||2

Rn,

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iRn+b≥1−ξi;ξ≥0; ||exp(θ)||2

p≤1,

where we employ the notation exp(θ) = (exp(θ1),··· ,exp(θM))>.

Applying the Gauss-Seidel method in Eq. (23) to the base problem (P) and to the

reparametrized problem yields the same sequence of solutions {(w, b, θ)k}k∈N0. Fortunately,

the above problem now allows to apply Prop. 4 for the two blocks of coordinates θ∈ X1

and (w, b, ξ)∈ X2: the objective is continuously diﬀerentiable and the sets X1are closed

and convex. To see the latter, note that ||·||2

p◦exp is a convex function since ||·||2

pis convex

and non-increasing in each argument (cf., e.g., Section 3.2.4 of Boyd and Vandenberghe,

2004). Moreover, the minima in Eq. (23) are uniquely attained: the (w, b)-step amounts

to solving an SVM on a positive deﬁnite kernel mixture, and the analytical θ-step clearly

yields unique solutions as well.

Hence, we conclude that every limit point of the sequence {(w, b, θ)k}k∈Nis a stationary

point of Optimization Problem (P). For a convex problem, this is equivalent to such a limit

point being globally optimal.

In practice, we are facing two problems. Firstly, the standard Hilbert space setup

necessarily implies that kwmk ≥ 0 for all m. However in practice this assumption may often

be violated, either due to numerical imprecision or because of using an indeﬁnite “kernel”

function. However, for any kwmk ≤ 0 it also follows that θ?

m= 0 as long as at least one

strictly positive kwm0k>0 exists. This is because for any λ < 0 we have limh→0,h>0λ

h=

−∞. Thus, for any mwith kwmk ≤ 0, we can immediately set the corresponding mixing

coeﬃcients θ?

mto zero. The remaining θare then computed according to Equation (2), and

convergence will be achieved as long as at least one strictly positive kwm0k>0 exists in

each iteration.

Secondly, in practice, the SVM problem will only be solved with ﬁnite precision, which

may lead to convergence problems. Moreover, we actually want to improve the αonly a

5. Note that the coordinate transformation into Rnis can be constructively given in terms of the empirical

kernel map (Sch¨olkopf et al., 1999).

17

little bit before recomputing θsince computing a high precision solution can be wasteful,

as indicated by the superior performance of the interleaved algorithms (cf. Sect. 5.5). This

helps to avoid spending a lot of α-optimization (SVM training) on a suboptimal mixture θ.

Fortunately, we can overcome the potential convergence problem by ensuring that the primal

objective decreases within each α-step. Then the alternating optimization is guaranteed

to converge. This is enforced in practice, by computing the SVM by a higher precision if

needed. However, in our computational experiments we ﬁnd that this precaution is not even

necessary: even without it, the algorithm converges in all cases that we tried (cf. Section

5).

Finally, we would like to point out that the proposed block coordinate descent approach

lends itself more naturally to combination with primal SVM optimizers like (Chapelle, 2006),

LibLinear (Fan et al., 2008) or Ocas (Franc and Sonnenburg, 2008). Especially for linear

kernels this is extremely appealing.

4.2 Cutting Planes

In order to obtain an alternative optimization strategy, we ﬁx θin the primal MKL opti-

mization problem (P) and build the partial Lagrangian w.r.t. all other primal variables w,

b. The resulting dual problem is a min-max problem of the form

inf

θ:θ≥0,||θ||2≤1sup

α:1>α=0 −C

n

X

i=1

V∗−αi

C, yi−1

2

M

X

m=1

θmα>Kmα(25)

We focus on the hinge loss, i.e., V(t, y) = max(0,1−ty), and non-sparse norms of the

form ||θ|| =PM

m=1 θp

m1/p (nevertheless, the following reasoning holds for every twice

diﬀerentiable norm). Thus, employing a variable substition of the form αnew

i=αiyi, Eq. (25)

translates into

min

θmax

α1>α−1

2α>

M

X

m=1

θmQmα

s.t. 0≤α≤C1;y>α= 0; θ≥0; kθk2

p≤1,

where Qj=Y KjYfor 1 ≤j≤mand Y= diag(y). The above optimization problem is a

saddle point problem and can be solved by alternating αand θoptimization step. While

the former can simply be carried out by a support vector machine for a ﬁxed mixture θ, the

latter has been optimized for p= 1 by reduced gradients (Rakotomamonjy et al., 2007).

We take a diﬀerent approach and translate the min-max problem into an equivalent

semi-inﬁnite program (SIP) as follows. Denote the value of the target function by t(α,θ)

and suppose α∗is optimal. Then, according to the max-min inequality (Boyd and Vanden-

berghe, 2004, p. 115), we have t(α∗,θ)≥t(α,θ) for all αand θ. Hence, we can equivalently

minimize an upper bound ηon the optimal value and arrive at the following semi-inﬁnite

18

Algorithm 3 Chunking-based `p-Norm MKL cutting plane training algorithm. It simul-

taneously optimizes the variables αand the kernel weighting θ.The accuracy parameter

and the subproblem size Qare assumed to be given to the algorithm. For simplicity, a few

speed-up tricks are not shown, e.g., hot-starts of the SVM and the QCQP solver.

1: gm,i = 0, ˆgi= 0, αi= 0, θm=p

p1/M for m= 1, . . . , M and i= 1, . . . , n

2: for t= 1,2, . . . and while SVM and MKL optimality conditions are not satisﬁed do

3: Select Q suboptimal variables αi1, . . . , αiQbased on the gradient ˆ

gand α; store αold =α

4: Solve SVM dual with respect to the selected variables and update α

5: Update gradient gm,i ←gm,i +PQ

q=1(αiq−αold

iq)yiqkm(xiq,xi) for all m= 1, . . . , M and

i= 1, . . . , n

6: for m= 1, . . . , M do

7: St

m=1

2Pigm,iαiyi

8: end for

9: Lt=Piαi,St=PmθmSt

m

10: if |1−St

λ| ≥

11: while MKL optimality conditions are not satisﬁed do

12: θold =θ

13: (θ, λ)←argmax λ

14: w.r.t. θ∈RM, λ ∈R

15: s.t. 0≤θ≤1,

16: p(p−1)

2Pm(θold

m)p−2θ2

m−Pmp(p−2)(θold

m)p−1θm≤p(3−p)

2and

17: PmθmSr

m−Lr≥λfor r= 1, . . . , t

18: θ←θ/||θ||p

19: Remove inactive constraints

20: end while

21: end if

22: ˆgi=Pmθmgm,i for all i= 1, . . . , n

23: end for

program,

min

ηη

s.t. ∀α∈ A :η≥1>α−1

2α>

M

X

m=1

θmQmα; (SIP)

θ≥0;kθk2

p≤1,

where A=α∈Rn|0≤α≤C1,y>α= 0.

Sonnenburg et al. (2006a) optimize the above SIP for p= 1 with interleaving cutting

plane algorithms. The solution of a quadratic program (here the regular SVM) generates

the most strongly violated constraint for the actual mixture θ. The optimal (θ∗, η) is then

identiﬁed by solving a linear program with respect to the set of active constraints. The

optimal mixture is then used for computing a new constraint and so on.

Unfortunately, for p > 1, a non-linearity is introduced by requiring kθk2

p≤1 and such

constraint is unlikely to be found in standard optimization toolboxes that often handle only

linear and quadratic constraints. As a remedy, we propose to approximate the constraint

19

kθkp

p≤1 by a sequence of second-order Taylor expansions6

||θ||p

p≈ ||˜

θ||p

p+p˜

θp−1>θ−˜

θ+p(p−1)

2θ−˜

θ>diag ˜

θp−2θ−˜

θ

= 1 + p(p−3)

2−

M

X

m=1

p(p−2)(˜

θm)p−1θm+p(p−1)

2

M

X

m=1

˜

θp−2

mθ2

m,

where θpis deﬁned element-wise, that is θp:= (θp

1, . . . , θp

M). The sequence (θ0,θ1,···) is

initialized with a uniform mixture satisfying ||θ0||p

p= 1 as a starting point. Successively θt+1

is computed using ˜

θ=θt. Note that the Hessian of the quadratic term in the approximation

is diagonal, strictly positiv and very-well conditioned wherefore the resulting quadratically

constrained problem can be solved eﬃciently. In fact, since there is only one quadratic con-

straint, its complexity should rather be compared to that of a considerably easier quadratic

program. Moreover, in order to ensure convergence, we enhanced the resulting sequential

quadratically constrained quadratic programming by projection steps onto the boundary of

the feasible set, as given in Line 19. Finally note, that this approach can be further sped-up

by additional level-set projections in the θ-optimization phase similar to Xu et al. (2009).

In our case, the level-set projection is a convex quadratic problem with `p-norm constraints

and can again be approximated by a successive sequence of second-order Taylor expansions.

Algorithm 3 outlines the interleaved α,θMKL training algorithm. Lines 3-5 are stan-

dard in chunking based SVM solvers and carried out by SVMlight. Lines 6-9 compute (parts

of) SVM-objective values for each kernel independently. Finally lines 11 to 19 solve a

sequence of semi-inﬁnite programs with the `p-norm constraint being approximated as a

sequence of second-order constraints. The algorithm terminates if the maximal KKT viola-

tion (see Joachims, 1999) falls below a predetermined precision εsvm and if the normalized

maximal constraint violation |1−St

λ|< εmkl for the MKL. The following proposition shows

the convergence of the semi-inﬁnite programming loop in Algorithm 3.

Proposition 6 Let the kernel matrices K1, . . . , KMbe positive deﬁnite and be p > 1. Sup-

pose that the SVM computation is solved exactly in each iteration. Moreover, suppose there

exists an optimal limit point of nested sequence of QCCPs. Then the sequence generated by

Algorithm 3 has at least one point of accumulation that solves Optimization Problem (P).

Proof By assumption the SVM is solved to inﬁnite precision in each MKL step which

simpliﬁes our analysis in that the numerical details in Algorithm 3 can be ignored. We

conclude, that the outer loop of Alg. 3 amounts to a cutting-plane algorithm for solving

the semi-inﬁnite program (SIP). It is well-known (Sonnenburg et al., 2006a), that this

algorithm converges, in the sense that there exists at least one point of accumulation,

which solves the primal problem (P). E.g. this can be seen by viewing the cutting plane

algorithm as a special instance of the class of so-called exchange methods and subsequently

applying Theorem 7.2 in Hettich and Kortanek (1993). A diﬀerence to the analysis in

Sonnenburg et al. (2006a) is the `p>1-norm constraint in our algorithm. However, according

to our assumption that the nonlinear subprogram is solved correctly, a quick inspection

6. We also tried out ﬁrst-order Taylor expansions, whereby our algorithm basically boils down the renowned

sequential quadratic programming, but it empirically turned out to be inferior. Intuitively, second-order

expansions work best when the approximated function is almost quadratic, as given in our case.

20

of the preliminaries of the latter theorem clearly reveals, that they remain fulﬁlled when

introducing an `p-norm constraint.

In order to complete our convergence analysis, it remains to show that the inner loop

(lines 11-18), that is the sequence of QCQPs, converges against an optimal point. Ex-

isting analyses of this so-called sequential quadratically constrained quadratic programming

(SQCQP) can be divided into two classes. First, one class establishes local convergence, i.e.,

convergence in an open neighborhood of the optimal point, at a rate of O(n2), under rela-

tively mild smoothness and constraint qualiﬁcation assumptions (Fern´andez and Solodov,

2008; Anitescu, 2002), whereas Anitescu (2002) additionally requires quadratic growth of

the nonlinear constraints. Those analyses basically guarantee local convergence the nested

sequences of QCQPs in our `p-norm training algorithm, for all p∈(1,∞) (Fern´andez and

Solodov, 2008) and p≥2 (Anitescu, 2002), respectively.

A second class of papers additionally establishes global convergence (e.g. Solodov, 2004;

Fukushima et al., 2002), so they need more restrictive assumptions. Moreover, in order to

ensure feasibility of the subproblems when the actual iterate is too far away from the true

solution, a modiﬁcation of the algorithmic protocol is needed. This is usually dealt by per-

forming a subsequent line search and downweighting the quadratic term by a multiplicative

adaptive constant Di∈[0,1]. Unfortunately, the latter involves a complicated procedure

to tune Di(Fukushima et al., 2002, p. 7). Employing the above modiﬁcations, the analysis

in Fukushima et al. (2002) together with our Prop. 6 would guarantee the convergence of

our Alg. 3.

However, due to the special form of our SQCQP, we chose to discard the comfortable

convergence guarantees and to proceed with a much more simple and eﬃcient strategy, which

renders both the expensive line search and the tuning of the constant Diunnecessary. The

idea of our method is that the projection of θonto the boundary of the feasible set, given

by line 18 in Alg. 3, can be performed analytically. This projection ensures the feasibility

of the QCQP subproblems. Note that in general, this projection can be as expensive as

performing a QCQP step, which is why projection-type algorithms for solving SQCQPs to

the best of our knowledge have not been studied yet by the optimization literature.

Although the projection procedure is appealingly simple and—as we found empirically—

seemingly shares nice convergence properties (the sequence of SQCQPs converged optimally

in all cases we tried, usually after 3-4 iterations), it unfortunately prohibits exploitation

of existing analyses for global convergence. However, the discussions in Fukushima et al.

(2002) and Solodov (2004) identify the reason of occasional divergence of the vanilla SQCQP

as the infeasibility of the subproblems. But in contrast, our projection algorithm always

ensures the feasibility of the subproblem. We therefore conjecture that based on the superior

empirical results and the discussions in Fukushima et al. (2002) and Solodov (2004), our

algorithm is designated to convergence. The theoretical analysis of this new class of so-called

SQCQP projection algorithms is beyond the scope of this paper.

21

4.3 Technical Considerations

4.3.1 Implementation Details

We have implemented the analytic and the cutting plane algorithm as well as a Newton

method (c.f. Kloft et al., 2009a) within the SHOGUN toolbox7for regression, one-class

classiﬁcation, and two-class classiﬁcation tasks. In addition one can choose the optimization

scheme, i.e., decide whether the interleaved optimization algorithm or the wrapper algorithm

should be applied. In all approaches any of the SVMs contained in SHOGUN can be used.

In the more conventional family of approaches, the so-called wrapper algorithms, an

optimization schme on θwraps around a single kernel SVM. Eﬀectively this results in

alternatingly solving for αand θ. For the outer optimization (i.e., that on θ) SHOGUN

oﬀers the three choices listed above. The semi-inﬁnite program (SIP) uses a traditional

SVM to generate new violated constraints and thus requires a single kernel SVM. A linear

program (for p= 1) or a sequence of quadratically constrained linear programs (for p > 1)

is solved via GLPK8or IBM ILOG CPLEX9. Alternatively, either an analytic or a Newton

update (for `pnorms with p > 1) step can be performed, obviating the need for an additional

mathematical programming software.

The second, much faster approach performs interleaved optimization and thus re-

quires modiﬁcation of the core SVM optimization algorithm. It is currently integrated

into the chunking-based SVRlight and SVMlight. To reduce the implementation eﬀort,

we implement a single function perform mkl step(Pα, objm), that has the arguments

Pα=Pn

i=1 αiand objm=1

2αTKmα, i.e. the current linear α-term and the SVM objectives

for for each kernel. This function is either, in the interleaved optimization case, called as a

callback function (after each chunking step or a couple of SMO steps), or it is called by the

wrapper algorithm (after each SVM optimization to full precision).

Recovering Regression and One-Class Classiﬁcation. It should be noted that one-

class classiﬁcation is trivially implemented using Pα= 0 while support vector regression

(SVR) is typically performed by internally translating the SVR problem into a standard

SVM classiﬁcation problem with twice the number of examples once positively and once

negatively labeled with corresponding αand α∗. Thus one needs direct access to α∗and

computes Pα=−Pn

i=1(αi+α∗

i)ε−Pn

i=1(αi−α∗

i)yi(cf. Sonnenburg et al., 2006a). Since

this requires modiﬁcation of the core SVM solver we implemented SVR only for interleaved

optimization and SVMlight.

Eﬃciency Considerations and Kernel Caching. Note that the choice of the size of

the kernel cache becomes crucial when applying MKL to large scale learning applications.10

While for the wrapper algorithm only a single kernel SVM needs to be solved and thus a

single large kernel cache should be used, the story is diﬀerent for interleaved optimization.

Since one must keep track of the several partial MKL objectives objm, requiring access to

individual kernel rows, the same cache size should be used for all sub-kernels.

7. http://www.shogun-toolbox.org.

8. http://www.gnu.org/software/glpk/.

9. http://www.ibm.com/software/integration/optimization/cplex/.

10. Large scale in the sense, that the data cannot be stored in memory or the computation reaches a

maintainable limit. In the case of MKL this can be due both a large sample size or a high number of

kernels.

22

4.3.2 Kernel Normalization

The normalization of kernels is as important for MKL as the normalization of features is

for training regularized linear or single-kernel models. This is owed to the bias introduced

by the regularization: optimal feature / kernel weights are requested to be small. This is

easier to achieve for features (or entire feature spaces, as implied by kernels) that are scaled

to be of large magnitude, while downscaling them would require a correspondingly upscaled

weight for representing the same predictive model. Upscaling (downscaling) features is thus

equivalent to modifying regularizers such that they penalize those features less (more). As is

common practice, we here use isotropic regularizers that, moreover, penalize all dimensions

uniformly. This implies that the kernels have to be normalized in a sensible way in order

to represent an “uninformative prior” as to which kernels are useful.

There exist several approaches to kernel normalization, of which we use two in the com-

putational experiments below. They are fundamentally diﬀerent. The ﬁrst one generalizes

the common practice of standardizing features to entire kernels, thereby directly implement-

ing the spirit of the discussion above. In contrast, the second normalization approach carries

the rescaling of data points to the world of kernels. Nevertheless it can have a beneﬁcial

eﬀect on the scaling of kernels, as we argue below.

Multiplicative Normalization. As done in Ong and Zien (2008), we multiplicatively

normalize the kernels to have uniform variance of data points in feature space. Formally, we

ﬁnd a positive rescaling λmof the kernel, such that the rescaled kernel ˜

km(·,·) = λmkm(·,·)

and the corresponding feature map ˜

Φm(·) = √λmΦm(·) satisfy

1!

=1

n

n

X

i=1 ˜

Φm(xi)−˜

Φm(¯

x)2=1

n

n

X

i=1

˜

km(xi,xi)−1

n2

n

X

i=1

n

X

j=1

˜

km(xi,xj)

for each m= 1, . . . , M, where ˜

Φm(¯

x) := 1

nPn

i=1 ˜

Φm(xi) is the empirical mean of the data

in feature space. The ﬁnal normalization rule is

k(x,¯

x)7−→ k(x,¯

x)

1

nPn

i=1 k(xi,xi)−1

n2Pn

i,j=1, k(xi¯

xj).(26)

Spherical Normalization. Frequently, kernels are normalized according to

k(x,¯

x)7−→ k(x,¯

x)

pk(x,x)k(¯

x,¯

x).(27)

After this operation, kxk=k(x,x) = 1 holds for each data point x; this means that each

data point is rescaled to lie on the unit sphere. Still, this also may have an eﬀect on the

scale of the features: in case the kernel is centered (i.e. average of the data points lies on

the origin), the rescaled kernel satisﬁes the above goal that the points have unit variance

(around their mean). Thus the spherical normalization may be seen as an approximation

to the above multiplicative normalization and may be used as a substitute for it. Note,

however, that it changes the data points themselves by eliminating length information;

whether this is desired or not depends on the learning task at hand. Finally note that both

normalizations achieve that the optimal value of Cis not far from 1.

23

4.4 Relation to Block-Norm Formulation and Limitations of Our Framework

In this section we ﬁrst show a connection of `p-norm MKL to a formulation based on block

norms and then point out a limitation of our framework. To this aim let us recall the primal

MKL problem (P) and consider the special case of `p-norm MKL given by

inf

w,b,θ:θ≥0C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

||wm||2

Hm

θm

,s.t. ||θ||2

p≤1.

(28)

The subsequent proposition shows that Optimization Problem (P) equivalently can be trans-

lated into the following mixed-norm formulation,

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1 ||wm||q

Hm,(29)

where q=2p

p+1 , and ˜

Cis a constant. For q= 1 this has been studied by Bach et al. (2004).

Proposition 7 Let be p > 1and be Va convex loss function. Optimization Problem (28)

and (29) are equivalent, i.e., for each Cthere exists a ˜

C > 0, such that for each optimal

solution (w∗, b∗, θ∗) of OP (28) using C, we have that (w∗, b∗) is also optimal in OP (29)

using ˜

C, and vice versa.

Proof We begin by applying Theorem 1 to rephrase Optimization Problem (P) as

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

||wm||2

Hm

θm

+µ||θ||2

p.

Setting the partial derivatives w.r.t. θto zero, we obtain the following equation at opti-

mality:

−||wm||2

Hm

2θ2

m

+β·θp−1

m||θ||2−p

p= 0,∀m= 1, . . . , M. (30)

Hence, Eq. (30) translates into the following optimality condition on wand θ:

θ∗

m=ζ||w∗

m||

2

p+1

Hm,∀m= 1, . . . , M,

with a suitable constant ζ. Plugging the above equation into Optimization Problem (P)

yields

inf

w,b,θ:θ≥0C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2ζ

M

X

m=1 ||wm||

2p

p+1

Hm.(31)

Deﬁning q:= 2p

p+1 and ˜

C:= ζC results in (29) what was to show.

Now, let us take a closer look on the parameter range of q. It is easy to see that when

we vary pin the real interval [1,∞], then qis limited to range in [1,2]. This raises the

24

question whether we can derive an eﬃcient wrapper-based optimization strategy for the

case of q > 2. A framework by Aﬂalo et al. (2010) covers the case q≥2, although their

method aims at hierarchical kernel learning. Note, that q≤2 and hence `p-norm MKL is

not covered by their approach.

We brieﬂy sketch the analysis of Aﬂalo et al. (2010) and discuss a potential simpliﬁcation

of their algorithm for the special case of `q>2block norm MKL. We start by noting that it

is possible to show that for q≥2, Eq. (29) is equivalent to

sup

θ:θ≥0,||θ||2

p≤1

inf

w,b

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

θm||wm||2

Hm,(32)

where p=q/2

q/2−1. Note that despite the similarity to `p-norm MKL, the above problem

signiﬁcantly diﬀers from `p-norm MKL for two reasons. Firstly, obvious diﬀerences such as

the mixing coeﬃcients θappearing in the nominator and the consequential maximization

w.r.t. θ, render the above problem a min-max problem. Secondly, note that by varying p

in the interval [1,∞], the whole range of qin the interval [2,∞] can be obtained, which

explains why this method is complementary to ours, where qranges in [1,2].

Using the hinge loss, Eq. (32) can be partially dualized w.r.t. ﬁxed θ, resulting in a

convex optimization problem (Boyd and Vandenberghe, 2004, p. 76)

max

α,θ1>α−1

2α>

M

X

m=1

Qm

θm

α(33)

s.t. 0≤α≤C1;y>α= 0; θ≥0; kθk2

p≤1,

where, as in the previous sections, we denote Qj=Y KjYand and Y= diag(y). Origi-

nally the authors aimed at hierarchical kernel learning and Aﬂalo et al. (2010) proposed to

optimize (33) by a mirror descent algorithm (Beck and Teboulle, 2003). However, for the

special case of q > 2 block norm MKL, which we consider here, a simple block gradient

procedure based on an analytical update of θ, similar to the one presented in Section 4.1, is

suﬃcient. We omit the derivations which are analogeous to those presented in Section 4.1.

5. Computational Experiments

In this section we study non-sparse MKL in terms of computational eﬃciency and predictive

accuracy. Throughout all our experiments both `p-norm MKL implementations, presented

in Sections 4.1 and 4.2, perform comparably. We apply the method of (Sonnenburg et al.,

2006a) in the case of p= 1, as it is recovered as a special case of our cutting plane strategy.

We write `∞-norm MKL for a regular SVM with the unweighted-sum kernel K=PmKm.

We ﬁrst study a toy problem in Section 5.1 where we have full control over the distri-

bution of the relevant information in order to shed light on the appropriateness of sparse,

non-sparse, and `∞-MKL. We report on real-world problems from the Bioninformatics do-

main, namely protein subcellular localization (Section 5.2), ﬁnding transcription start sites

of RNA Polymerase II binding genes in genomic DNA sequences (Section 5.3), and recon-

structing metabolic gene networks (Section 5.4).

25

Complementarily, we would like to mention empirical results of other researchers which

have been experimenting with non-sparse MKL. Cortes et al. (2009) applies `2-norm MKL

to regression tasks on Reuters and various sentiment analysis datasets, and Yu et al. (2009)

studies `2-norm on two real-world genomic data sets for clinical decision support in cancer

diagnosis and disease relevant gene prioritization, respectively. Yan et al. (2009) apply `2-

norm MKL to image and video classiﬁcation tasks. All those papers show an improvement

of `2-norm MKL over sparse MKL and the unweighted sum kernel SVM. Nakajima et al.

(2009) study `p-norm MKL for multi-label image categorization and show an improvement

of non-sparse MKL over `1/∞-norm MKL.

5.1 Measuring the Impact of Data Sparsity — Toy Experiment

The goal of this section is to study the relationship of the level of sparsity of the true

underlying function to be learnt to the chosen norm pin the model. It is suggestive that

the optimal choice of pdirectly corresponds to the true level of sparsity. Apart from verifying

this conjecture, we are also interested in the eﬀects of suboptimal choice of p. To this aim

we constructed several artiﬁcial data sets in which we vary the degree of sparsity in the true

kernel mixture coeﬃcients. We go from having all weight focussed on a single kernel (the

highest level of sparsity) to uniform weights (the least sparse scenario possible) in several

steps. We then study the statistical performance of `p-norm MKL for diﬀerent values of p

that cover the entire range [0,∞].

We generate an n-elemental balanced sample D={(xi, yi)}n

i=1 from two d= 50-

dimensional isotropic Gaussian distributions with equal covariance matrices C=Id×d.

The two Gaussians are aligned at opposing means w.r.t. to the origin, µ1=ρ

||θ||2θand

µ2=−ρ

||θ||2θ. Thereby θis a binary vector, i.e., θi∈ {0,1}, encoding the true underlying

data sparsity as follows. Zero components θi= 0 clearly imply identical means of the two

classes distributions in the i-th feature set; hence the latter does not carry any discriminat-

ing information. In summary, the fraction of zero components, ν(θ) = 1 −1

dPd

i=1 θi, is a

measure for the feature sparsity of the learning problem.

For several values of νwe generate m= 250 data sets D1,...,Dmﬁxing ρ= 1.75. Then,

each feature is input to a linear kernel and the resulting kernel matrices are multiplicatively

normalized as described in Section 4.3.2. Hence, the ν