Content uploaded by Ulf Brefeld

Author content

All content in this area was uploaded by Ulf Brefeld

Content may be subject to copyright.

Non-sparse Regularization for Multiple Kernel Learning

Non-Sparse Regularization for Multiple Kernel Learning

Marius Kloft∗mkloft@cs.berkeley.edu

University of California

Computer Science Division

Berkeley, CA 94720-1758, USA

Ulf Brefeld brefeld@yahoo-inc.com

Yahoo! Research

Avinguda Diagonal 177

08018 Barcelona, Spain

S¨oren Sonnenburg∗soeren.sonnenburg@tuebingen.mpg.de

Friedrich Miescher Laboratory

Max Planck Society

Spemannstr. 39, 72076 T¨ubingen, Germany

Alexander Zien zien@lifebiosystems.com

LIFE Biosystems GmbH

Poststraße 34

69115 Heidelberg, Germany

Abstract

Learning linear combinations of multiple kernels is an appealing strategy when the right

choice of features is unknown. Previous approaches to multiple kernel learning (MKL)

promote sparse kernel combinations to support interpretability and scalability. Unfortu-

nately, this `1-norm MKL is rarely observed to outperform trivial baselines in practical

applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms.

We devise new insights on the connection between several existing MKL formulations and

develop two eﬃcient interleaved optimization strategies for arbitrary norms, like `p-norms

with p > 1. Empirically, we demonstrate that the interleaved optimization strategies are

much faster compared to the commonly used wrapper approaches. A theoretical analysis

and an experiment on controlled artiﬁcial data experiment sheds light on the appropriate-

ness of sparse, non-sparse and `∞-norm MKL in various scenarios. Empirical applications

of `p-norm MKL to three real-world problems from computational biology show that non-

sparse MKL achieves accuracies that go beyond the state-of-the-art.

Keywords: multiple kernel learning, learning kernels, non-sparse, support vector ma-

chine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics,

generalization bounds

1. Introduction

Kernels allow to decouple machine learning from data representations. Finding an appro-

priate data representation via a kernel function immediately opens the door to a vast world

∗. Also at Machine Learning Group, Technische Universit¨at Berlin, Franklinstr. 28/29, FR 6-9, 10587

Berlin, Germany.

1

arXiv:1003.0079v3 [cs.LG] 26 Oct 2010

Kloft, Brefeld, Sonnenburg, and Zien

of powerful machine learning models (e.g. Sch¨olkopf and Smola, 2002) with many eﬃcient

and reliable oﬀ-the-shelf implementations. This has propelled the dissemination of machine

learning techniques to a wide range of diverse application domains.

Finding an appropriate data abstraction—or even engineering the best kernel—for the

problem at hand is not always trivial, though. Starting with cross-validation (Stone, 1974),

which is probably the most prominent approach to general model selection, a great many

approaches to selecting the right kernel(s) have been deployed in the literature.

Kernel target alignment (Cristianini et al., 2002; Cortes et al., 2010b) aims at learning

the entries of a kernel matrix by using the outer product of the label vector as the ground-

truth. Chapelle et al. (2002) and Bousquet and Herrmann (2002) minimize estimates of the

generalization error of support vector machines (SVMs) using a gradient descent algorithm

over the set of parameters. Ong et al. (2005) study hyperkernels on the space of kernels

and alternative approaches include selecting kernels by DC programming (Argyriou et al.,

2008) and semi-inﬁnite programming ( ¨

Oz¨og¨ur-Aky¨uz and Weber, 2008; Gehler and Nowozin,

2008). Although ﬁnding non-linear kernel mixtures (G¨onen and Alpaydin, 2008; Varma and

Babu, 2009) generally results in non-convex optimization problems, Cortes et al. (2009b)

show that convex relaxations may be obtained for special cases.

However, learning arbitrary kernel combinations is a problem too general to allow for

a general optimal solution—by focusing on a restricted scenario, it is possible to achieve

guaranteed optimality. In their seminal work, Lanckriet et al. (2004) consider training an

SVM along with optimizing the linear combination of several positive semi-deﬁnite matrices,

K=PM

m=1 θmKm,subject to the trace constraint tr(K)≤cand requiring a valid combined

kernel K0. This spawned the new ﬁeld of multiple kernel learning (MKL), the automatic

combination of several kernel functions. Lanckriet et al. (2004) show that their speciﬁc

version of the MKL task can be reduced to a convex optimization problem, namely a semi-

deﬁnite programming (SDP) optimization problem. Though convex, however, the SDP

approach is computationally too expensive for practical applications. Thus much of the

subsequent research focuses on devising more eﬃcient optimization procedures.

One conceptual milestone for developing MKL into a tool of practical utility is simply

to constrain the mixing coeﬃcients θto be non-negative: by obviating the complex con-

straint K0, this small restriction allows one to transform the optimization problem into

a quadratically constrained program, hence drastically reducing the computational burden.

While the original MKL objective is stated and optimized in dual space, alternative formu-

lations have been studied. For instance, Bach et al. (2004) found a corresponding primal

problem, and Rubinstein (2005) decomposed the MKL problem into a min-max problem

that can be optimized by mirror-prox algorithms (Nemirovski, 2004). The min-max formu-

lation has been independently proposed by Sonnenburg et al. (2005). They use it to recast

MKL training as a semi-inﬁnite linear program. Solving the latter with column generation

(e.g., Nash and Sofer, 1996) amounts to repeatedly training an SVM on a mixture kernel

while iteratively reﬁning the mixture coeﬃcients θ. This immediately lends itself to a con-

venient implementation by a wrapper approach. These wrapper algorithms directly beneﬁt

from eﬃcient SVM optimization routines (cf., e.g., Fan et al., 2005; Joachims, 1999) and are

now commonly deployed in recent MKL solvers (e.g., Rakotomamonjy et al., 2008; Xu et al.,

2009), thereby allowing for large-scale training (Sonnenburg et al., 2005, 2006a). However,

the complete training of several SVMs can still be prohibitive for large data sets. For this

2

Non-sparse Regularization for Multiple Kernel Learning

reason, Sonnenburg et al. (2005) also propose to interleave the SILP with the SVM training

which reduces the training time drastically. Alternative optimization schemes include level-

set methods (Xu et al., 2009) and second order approaches (Chapelle and Rakotomamonjy,

2008). Szafranski et al. (2010), Nath et al. (2009), and Bach (2009) study composite and

hierarchical kernel learning approaches. Finally, Zien and Ong (2007) and Ji et al. (2009)

provide extensions for multi-class and multi-label settings, respectively.

Today, there exist two major families of multiple kernel learning models. The ﬁrst

is characterized by Ivanov regularization (Ivanov et al., 2002) over the mixing coeﬃcients

(Rakotomamonjy et al., 2007; Zien and Ong, 2007). For the Tikhonov-regularized optimiza-

tion problem (Tikhonov and Arsenin, 1977), there is an additional parameter controlling

the regularization of the mixing coeﬃcients (Varma and Ray, 2007).

All the above mentioned multiple kernel learning formulations promote sparse solutions

in terms of the mixing coeﬃcients. The desire for sparse mixtures originates in practical

as well as theoretical reasons. First, sparse combinations are easier to interpret. Second,

irrelevant (and possibly expensive) kernels functions do not need to be evaluated at testing

time. Finally, sparseness appears to be handy also from a technical point of view, as the

additional simplex constraint kθk1≤1 simpliﬁes derivations and turns the problem into a

linearly constrained program. Nevertheless, sparseness is not always beneﬁcial in practice

and sparse MKL is frequently observed to be outperformed by a regular SVM using an

unweighted-sum kernel K=PmKm(Cortes et al., 2008).

Consequently, despite all the substantial progress in the ﬁeld of MKL, there still remains

an unsatisﬁed need for an approach that is really useful for practical applications: a model

that has a good chance of improving the accuracy (over a plain sum kernel) together with

an implementation that matches today’s standards (i.e., that can be trained on 10,000s of

data points in a reasonable time). In addition, since the ﬁeld has grown several competing

MKL formulations, it seems timely to consolidate the set of models. In this article we argue

that all of this is now achievable.

1.1 Outline of the Presented Achievements

On the theoretical side, we cast multiple kernel learning as a general regularized risk mini-

mization problem for arbitrary convex loss functions, Hilbertian regularizers, and arbitrary

norm-penalties on θ. We ﬁrst show that the above mentioned Tikhonov and Ivanov regu-

larized MKL variants are equivalent in the sense that they yield the same set of hypotheses.

Then we derive a dual representation and show that a variety of methods are special cases of

our objective. Our optimization problem subsumes state-of-the-art approaches to multiple

kernel learning, covering sparse and non-sparse MKL by arbitrary p-norm regularization

(1 ≤p≤ ∞) on the mixing coeﬃcients as well as the incorporation of prior knowledge

by allowing for non-isotropic regularizers. As we demonstrate, the p-norm regularization

includes both important special cases (sparse 1-norm and plain sum ∞-norm) and oﬀers

the potential to elevate predictive accuracy over both of them.

With regard to the implementation, we introduce an appealing and eﬃcient optimization

strategy which grounds on an exact update in closed-form in the θ-step; hence rendering

expensive semi-inﬁnite and ﬁrst- or second-order gradient methods unnecessary. By uti-

lizing proven working set optimization for SVMs, p-norm MKL can now be trained highly

3

Kloft, Brefeld, Sonnenburg, and Zien

eﬃciently for all p; in particular, we outpace other current 1-norm MKL implementations.

Moreover our implementation employs kernel caching techniques, which enables training

on ten thousands of data points or thousands of kernels respectively. In contrast, most

competing MKL software require all kernel matrices to be stored completely in memory,

which restricts these methods to small data sets with limited numbers of kernels. Our im-

plementation is freely available within the SHOGUN machine learning toolbox available at

http://www.shogun-toolbox.org/.

Our claims are backed up by experiments on artiﬁcial data and on a couple of real world

data sets representing diverse, relevant and challenging problems from the application do-

main bioinformatics. Experiments on artiﬁcial data enable us to investigate the relationship

between properties of the true solution and the optimal choice of kernel mixture regular-

ization. The real world problems include the prediction of the subcellular localization of

proteins, the (transcription) starts of genes, and the function of enzymes. The results

demonstrate (i) that combining kernels is now tractable on large data sets, (ii) that it can

provide cutting edge classiﬁcation accuracy, and (iii) that depending on the task at hand,

diﬀerent kernel mixture regularizations are required for achieving optimal performance.

In Appendix A we present a ﬁrst theoretical analysis of non-sparse MKL. We introduce

a novel `1-to-`pconversion technique and use it to derive generalization bounds. Based on

these, we perform a case study to compare a particular sparse with a non-sparse scenario.

A basic version of this work appeared in NIPS 2009 (Kloft et al., 2009a). The present

article additionally oﬀers a more general and complete derivation of the main optimization

problem, exemplary applications thereof, a simple algorithm based on a closed-form solution,

technical details of the implementation, a theoretical analysis, and additional experimental

results. Parts of Appendix A are based on Kloft et al. (2010) the present analysis however

extends the previous publication by a novel conversion technique, an illustrative case study,

and an improved presentation.

Since its initial publication in Kloft et al. (2008), Cortes et al. (2009a), and Kloft et al.

(2009a), non-sparse MKL has been subsequently applied, extended, and further analyzed by

several researchers: Varma and Babu (2009) derive a projected gradient-based optimization

method for `2-norm MKL. Yu et al. (2010) present a more general dual view of `2-norm

MKL and show advantages of `2-norm over an unweighted-sum kernel SVM on six bioinfor-

matics data sets. Cortes et al. (2010a) provide generalization bounds for `1- and `p≤2-norm

MKL. The analytical optimization method presented in this paper was independently and

in parallel discovered by Xu et al. (2010) and has also been studied in Roth and Fischer

(2007) and Ying et al. (2009) for `1-norm MKL, and in Szafranski et al. (2010) and Nath

et al. (2009) for composite kernel learning on small and medium scales.

The remainder is structured as follows. We derive non-sparse MKL in Section 2 and

discuss relations to existing approaches in Section 3. Section 4 introduces the novel opti-

mization strategy and its implementation. We report on our empirical results in Section 5.

Section 6 concludes.

2. Multiple Kernel Learning – A Regularization View

In this section we cast multiple kernel learning into a uniﬁed framework: we present a

regularized loss minimization formulation with additional norm constraints on the kernel

4

Non-sparse Regularization for Multiple Kernel Learning

mixing coeﬃcients. We show that it comprises many popular MKL variants currently

discussed in the literature, including seemingly diﬀerent ones.

We derive generalized dual optimization problems without making speciﬁc assumptions

on the norm regularizers or the loss function, beside that the latter is convex. Our formu-

lation covers binary classiﬁcation and regression tasks and can easily be extended to multi-

class classiﬁcation and structural learning settings using appropriate convex loss functions

and joint kernel extensions. Prior knowledge on kernel mixtures and kernel asymmetries

can be incorporated by non-isotropic norm regularizers.

2.1 Preliminaries

We begin with reviewing the classical supervised learning setup. Given a labeled sample

D={(xi, yi)}i=1...,n, where the xilie in some input space Xand yi∈ Y ⊂ R, the goal is

to ﬁnd a hypothesis h∈H, that generalizes well on new and unseen data. Regularized risk

minimization returns a minimizer h∗,

h∗∈argminhRemp(h) + λΩ(h),

where Remp(h) = 1

nPn

i=1 V(h(xi), yi) is the empirical risk of hypothesis hw.r.t. a convex

loss function V:R×Y → R, Ω : H→Ris a regularizer, and λ > 0 is a trade-oﬀ parameter.

We consider linear models of the form

h˜

w,b(x) = h˜

w, ψ(x)i+b, (1)

together with a (possibly non-linear) mapping ψ:X → H to a Hilbert space H(e.g.,

Sch¨olkopf et al., 1998; M¨uller et al., 2001) and constrain the regularization to be of the

form Ω(h) = 1

2k˜

wk2

2which allows to kernelize the resulting models and algorithms. We will

later make use of kernel functions k(x,x0) = hψ(x), ψ(x0)iHto compute inner products in

H.

2.2 Regularized Risk Minimization with Multiple Kernels

When learning with multiple kernels, we are given Mdiﬀerent feature mappings ψm:X →

Hm, m = 1,...M, each giving rise to a reproducing kernel kmof Hm. Convex approaches to

multiple kernel learning consider linear kernel mixtures kθ=Pθmkm,θm≥0. Compared

to Eq. (1), the primal model for learning with multiple kernels is extended to

h˜

w,b,θ(x) =

M

X

m=1 pθmh˜

wm, ψm(x)iHm+b=h˜

w, ψθ(x)iH+b(2)

where the parameter vector ˜

wand the composite feature map ψθhave a block structure

˜

w= ( ˜

w>

1,..., ˜

w>

M)>and ψθ=√θ1ψ1×. . . ×√θMψM, respectively.

In learning with multiple kernels we aim at minimizing the loss on the training data w.r.t.

the optimal kernel mixture PM

m=1 θmkmin addition to regularizing θto avoid overﬁtting.

Hence, in terms of regularized risk minimization, the optimization problem becomes

inf

˜

w,b,θ:θ≥0

1

n

n

X

i=1

V M

X

m=1 pθmh˜

wm, ψm(xi)iHm+b, yi!+λ

2

M

X

m=1 k˜wmk2

Hm+ ˜µ˜

Ω[θ],(3)

5

Kloft, Brefeld, Sonnenburg, and Zien

for ˜µ > 0. Note that the objective value of Eq. (3) is an upper bound on the training error.

Previous approaches to multiple kernel learning employ regularizers of the form ˜

Ω(θ) = kθk1

to promote sparse kernel mixtures. By contrast, we propose to use convex regularizers of

the form ˜

Ω(θ) = kθk2, where k·k2is an arbitrary norm in RM, possibly allowing for non-

sparse solutions and the incorporation of prior knowledge. The non-convexity arising from

the √θm˜

wmproduct in the loss term of Eq. (3) is not inherent and can be resolved by

substituting wm←√θm˜

wm. Furthermore, the regularization parameter and the sample

size can be decoupled by introducing ˜

C=1

nλ (and adjusting µ←˜µ

λ) which has favorable

scaling properties in practice. We obtain the following convex optimization problem (Boyd

and Vandenberghe, 2004) that has also been considered by (Varma and Ray, 2007) for hinge

loss and an `1-norm regularizer

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

kwmk2

Hm

θm

+µkθk2,(4)

where we use the convention that t

0= 0 if t= 0 and ∞otherwise.

An alternative approach has been studied by Rakotomamonjy et al. (2007) and Zien

and Ong (2007), again using hinge loss and `1-norm. They upper bound the value of

the regularizer kθk1≤1 and incorporate the latter as an additional constraint into the

optimization problem. For C > 0, they arrive at the following problem which is the

primary object of investigation in this paper.

Primal MKL Optimization Problem

inf

w,b,θ:θ≥0C

n

X

i=1

VM

X

m=1hwm, ψm(xi)iHm+b, yi+1

2

M

X

m=1

kwmk2

Hm

θm

(P)

s.t. kθk2≤1.

It is important to note here that, while the Ivanov regularization in (4) has two regu-

larization parameters (Cand µ), the above Tikhonov regularization (P) has only one (C

only). Our ﬁrst contribution shows that, despite the additional regularization parameter,

both MKL variants are equivalent, in the sense that traversing the regularization paths

yields the same binary classiﬁcation functions.

Theorem 1. Let k · k be a norm on RM, be Va convex loss function. Suppose for the

optimal w∗in Optimization Problem (P) it holds w∗6=0. Then, for each pair (˜

C, µ)there

exists C > 0such that for each optimal solution (w, b, θ) of Eq. (4) using (˜

C, µ), we have

that (w, b, κ θ)is also an optimal solution of Optimization Problem (P) using C, and vice

versa, where κ > 0is a multiplicative constant.

For the proof we need Prop. 11, which justiﬁes switching from Ivanov to Tikhonov

regularization, and back, if the regularizer is tight. We refer to Appendix B for the

proposition and its proof.

6

Non-sparse Regularization for Multiple Kernel Learning

Proof. of Theorem 1 Let be ( ˜

C, µ)>0. In order to apply Prop. 11 to (4), we show that

condition (37) in Prop. 11 is satisﬁed, i.e., that the regularizer is tight.

Suppose on the contrary, that Optimization Problem (P) yields the same inﬁmum re-

gardless of whether we require

kθk2≤1,(5)

or not. Then this implies that in the optimal point we have PM

m=1

kw∗

mk2

2

θ∗

m= 0, hence,

kw∗

mk2

2

θ∗

m

= 0,∀m= 1, . . . , M. (6)

Since all norms on RMare equivalent (e.g., Rudin, 1991), there exists a L < ∞such that

kθ∗k∞≤Lkθ∗k. In particular, we have kθ∗k∞<∞, from which we conclude by (6), that

wm= 0 holds for all m, which contradicts our assumption.

Hence, Prop. 11 can be applied,1which yields that (4) is equivalent to

inf

w,b,θ

˜

C

n

X

i=1

VM

X

m=1hwm, ψm(x)i+b, yi+1

2

M

X

m=1

kwmk2

2

θm

s.t. kθk2≤τ,

for some τ > 0. Consider the optimal solution (w?, b?,θ?) corresponding to a given

parametrization ( ˜

C, τ). For any λ > 0, the bijective transformation ( ˜

C, τ)7→ (λ−1/2˜

C, λτ)

will yield (w?, b?, λ1/2θ?) as optimal solution. Applying the transformation with λ:= 1/τ

and setting C=˜

Cτ 1

2as well as κ=τ−1/2yields Optimization Problem (P), which was to

be shown.

Zien and Ong (2007) also show that the MKL optimization problems by Bach et al.

(2004), Sonnenburg et al. (2006a), and their own formulation are equivalent. As a main

implication of Theorem 1 and by using the result of Zien and Ong it follows that the

optimization problem of Varma and Ray (Varma and Ray, 2007) lies in the same equivalence

class as (Bach et al., 2004; Sonnenburg et al., 2006a; Rakotomamonjy et al., 2007; Zien and

Ong, 2007). In addition, our result shows the coupling between trade-oﬀ parameter C

and the regularization parameter µin Eq. (4): tweaking one also changes the other and

vice versa. Theorem 1 implies that optimizing Cin Optimization Problem (P) implicitly

searches the regularization path for the parameter µof Eq. (4). In the remainder, we will

therefore focus on the formulation in Optimization Problem (P), as a single parameter is

preferable in terms of model selection.

2.3 MKL in Dual Space

In this section we study the generalized MKL approach of the previous section in the dual

space. Let us begin with rewriting Optimization Problem (P) by expanding the decision

1. Note that after a coordinate transformation, we can assume that His ﬁnite dimensional (see Sch¨olkopf

et al., 1999).

7

Kloft, Brefeld, Sonnenburg, and Zien

values into slack variables as follows

inf

w,b,t,θ:θ≥0C

n

X

i=1

V(ti, yi) + 1

2

M

X

m=1

kwmk2

Hm

θm

(7)

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iHm+b=ti;kθk2≤1,

where k·k is an arbitrary norm in Rmand k·kHMdenotes the Hilbertian norm of Hm. Ap-

plying Lagrange’s theorem re-incorporates the constraints into the objective by introducing

Lagrangian multipliers α∈Rnand β∈R+.2The Lagrangian saddle point problem is

then given by

sup

α,β:β≥0

inf

w,b,t,θ≥0C

n

X

i=1

V(ti, yi) + 1

2

M

X

m=1

kwmk2

Hm

θm

(8)

−

n

X

i=1

αi M

X

m=1hwm, ψm(xi)iHm+b−ti!+β1

2kθk2−1

2.

Denoting the Lagrangian by Land setting its ﬁrst partial derivatives with respect to wand

bto 0 reveals the optimality conditions

1>α= 0; (9a)

wm=θm

n

X

i=1

αiψm(xi),∀m= 1, . . . , M. (9b)

Resubstituting the above equations yields

sup

α:1>α=0, β:β≥0

inf

t,θ≥0C

n

X

i=1

(V(ti, yi) + αiti)−1

2

M

X

m=1

θmα>Kmα+β1

2kθk2−1

2,

which can also be written in terms of unconstrained θ, because the supremum with respect

to θis attained for non-negative θ≥0. We arrive at

sup

α:1>α=0, β≥0−C

n

X

i=1

sup

ti−αi

Cti−V(ti, yi)−βsup

θ 1

2β

M

X

m=1

θmα>Kmα−1

2kθk2!−1

2β.

As a consequence, we now may express the Lagrangian as3

sup

α:1>α=0, β≥0−C

n

X

i=1

V∗−αi

C, yi−1

2β

1

2α>KmαM

m=1

2

∗−1

2β, (10)

where h∗(x) = supux>u−h(u) denotes the Fenchel-Legendre conjugate of a function h

and k·k∗denotes the dual norm, i.e., the norm deﬁned via the identity 1

2k·k2

∗:= 1

2k·k2∗.

2. Note that, in contrast to the standard SVM dual deriviations, here αis a variable that ranges over all

of Rn, as it is incorporates an equality constraint.

3. We employ the notation s= (s1,...,sM)>= (sm)M

m=1 for s∈RM.

8

Non-sparse Regularization for Multiple Kernel Learning

In the following, we call V∗the dual loss. Eq. (10) now has to be maximized with respect

to the dual variables α, β, subject to 1>α= 0 and β≥0. Let us ignore for a moment

the non-negativity constraint on βand solve ∂L/∂β = 0 for the unbounded β. Setting the

partial derivative to zero allows to express the optimal βas

β=

1

2α>KmαM

m=1

∗

.(11)

Obviously, at optimality, we always have β≥0. We thus discard the corresponding

constraint from the optimization problem and plugging Eq. (11) into Eq. (10) results in

the following dual optimization problem which now solely depends on α:

Dual MKL Optimization Problem

sup

α:1>α=0 −C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

∗

.(D)

The above dual generalizes multiple kernel learning to arbitrary convex loss functions

and norms.4Note that if the loss function is continuous (e.g., hinge loss), the supremum is

also a maximum. The threshold bcan be recovered from the solution by applying the KKT

conditions.

The above dual can be characterized as follows. We start by noting that the expression in

Optimization Problem (D) is a composition of two terms, ﬁrst, the left hand side term, which

depends on the conjugate loss function V∗, and, second, the right hand side term which

depends on the conjugate norm. The right hand side can be interpreted as a regularizer on

the quadratic terms that, according to the chosen norm, smoothens the solutions. Hence

we have a decomposition of the dual into a loss term (in terms of the dual loss) and a

regularizer (in terms of the dual norm). For a speciﬁc choice of a pair (V, k · k) we can

immediately recover the corresponding dual by computing the pair of conjugates (V∗,k·k∗)

(for a comprehensive list of dual losses see Rifkin and Lippert, 2007, Table 3). In the next

section, this is illustrated by means of well-known loss functions and regularizers.

At this point we would like to highlight some properties of Optimization Problem (D)

that arise due to our dualization technique. While approaches that ﬁrstly apply the rep-

resenter theorem and secondly optimize in the primal such as Chapelle (2006) also can

employ general loss functions, the resulting loss terms depend on all optimization variables.

By contrast, in our formulation the dual loss terms are of a much simpler structure and they

only depend on a single optimization variable αi. A similar dualization technique yielding

singly-valued dual loss terms is presented in Rifkin and Lippert (2007); it is based on Fenchel

duality and limited to strictly positive deﬁnite kernel matrices. Our technique, which uses

Lagrangian duality, extends the latter by allowing for positive semi-deﬁnite kernel matrices.

4. We can even employ non-convex losses and still the dual will be a convex problem; however, it might

suﬀer from a duality gap.

9

Kloft, Brefeld, Sonnenburg, and Zien

3. Instantiations of the Model

In this section we show that existing MKL-based learners are subsumed by the generalized

formulation in Optimization Problem (D).

3.1 Support Vector Machines with Unweighted-Sum Kernels

First we note that the support vector machine with an unweighted-sum kernel can be recov-

ered as a special case of our model. To see this, we consider the regularized risk minimization

problem using the hinge loss function V(t, y) = max(0,1−ty) and the regularizer kθk∞. We

then can obtain the corresponding dual in terms of Fenchel-Legendre conjugate functions

as follows.

We ﬁrst note that the dual loss of the hinge loss is V∗(t, y) = t

yif −1≤t

y≤0 and

∞elsewise (Rifkin and Lippert, 2007, Table 3). Hence, for each ithe term V∗−αi

C, yi

of the generalized dual, i.e., Optimization Problem (D), translates to −αi

Cyi, provided that

0≤αi

yi≤C. Employing a variable substitution of the form αnew

i=αi

yi, Optimization

Problem (D) translates to

max

α1>α−1

2

α>Y KmYαM

m=1

∗

,s.t. y>α= 0 and 0≤α≤C1,(12)

where we denote Y= diag(y). The primal `∞-norm penalty kθk∞is dual to kθk1, hence,

via the identity k·k∗=k · k1the right hand side of the last equation translates to

PM

m=1 α>Y KmYα. Combined with (12) this leads to the dual

sup

α

1>α−1

2

M

X

m=1

α>Y KmYα,s.t. y>α= 0 and 0≤α≤C1,

which is precisely an SVM with an unweighted-sum kernel.

3.2 QCQP MKL of Lanckriet et al. (2004)

A common approach in multiple kernel learning is to employ regularizers of the form

Ω(θ) = kθk1.(13)

This so-called `1-norm regularizers are speciﬁc instances of sparsity-inducing regularizers.

The obtained kernel mixtures usually have a considerably large fraction of zero entries, and

hence equip the MKL problem by the favor of interpretable solutions. Sparse MKL is a

special case of our framework; to see this, note that the conjugate of (13) is k·k∞. Recalling

the deﬁnition of an `p-norm, the right hand side of Optimization Problem (D) translates

to maxm∈{1,...,M}α>Y KmYα. The maximum can subsequently be expanded into a slack

variable ξ, resulting in

sup

α,ξ

1>α−ξ

s.t. ∀m:1

2α>Y KmYα≤ξ;y>α= 0 ; 0≤α≤C1,

which is the original QCQP formulation of MKL, ﬁrstly given by Lanckriet et al. (2004).

10

Non-sparse Regularization for Multiple Kernel Learning

3.3 `p-Norm MKL

Our MKL formulation also allows for robust kernel mixtures by employing an `p-norm

constraint with p > 1, rather than an `1-norm constraint, on the mixing coeﬃcients (Kloft

et al., 2009a). The following identity holds

1

2k·k2

p∗

=1

2k·k2

p∗,

where p∗:= p

p−1is the conjugated exponent of p, and we obtain for the dual norm of the

`p-norm: k·k∗=k·kp∗. This leads to the dual problem

sup

α:1>α=0−C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

p∗

.

In the special case of hinge loss minimization, we obtain the optimization problem

sup

α

1>α−1

2

α>Y KmYαM

m=1

p∗

,s.t. y>α= 0 and 0≤α≤C1.

3.4 A Smooth Variant of Group Lasso

Yuan and Lin (2006) studied the following optimization problem for the special case Hm=

Rdmand ψm= idRdm, also known as group lasso,

min

w

C

2

n

X

i=1 yi−

M

X

m=1hwm, ψm(xi)iHm!2

+1

2

M

X

m=1 kwmkHm.(14)

The above problem has been solved by active set methods in the primal (Roth and Fischer,

2008). We sketch an alternative approach based on dual optimization. First, we note that

Eq. (14) can be equivalently expressed as (Micchelli and Pontil, 2005, Lemma 26)

inf

w,θ:θ≥0

C

2

n

X

i=1 yi−

M

X

m=1hwm, ψm(xi)iHm!2

+1

2

M

X

m=1

kwmk2

Hm

θm

,s.t. kθk2

1≤1.

The dual of V(t, y) = 1

2(y−t)2is V∗(t, y) = 1

2t2+ty and thus the corresponding group

lasso dual can be written as

max

αy>α−1

2Ckαk2

2−1

2

α>Y KmYαM

m=1

∞

,(15)

which can be expanded into the following QCQP

sup

α,ξ

y>α−1

2Ckαk2

2−ξ(16)

s.t. ∀m:1

2α>Y KmYα≤ξ.

11

Kloft, Brefeld, Sonnenburg, and Zien

For small n, the latter formulation can be handled eﬃciently by QCQP solvers. However,

the quadratic constraints caused by the non-smooth `∞-norm in the objective still are

computationally too demanding. As a remedy, we propose the following unconstrained

variant based on `p-norms (1 <p<∞), given by

max

αy>α−1

2Ckαk2

2−1

2

α>Y KmYαM

m=1

p∗

.

It is straight forward to verify that the above objective function is diﬀerentiable in any

α∈Rn(in particular, notice that the `p-norm function is diﬀerentiable for 1 < p < ∞)

and hence the above optimization problem can be solved very eﬃciently by, for example,

limited memory quasi-Newton descent methods (Liu and Nocedal, 1989).

3.5 Density Level-Set Estimation

Density level-set estimators are frequently used for anomaly/novelty detection tasks

(Markou and Singh, 2003a,b). Kernel approaches, such as one-class SVMs (Sch¨olkopf et al.,

2001) and Support Vector Domain Descriptions (Tax and Duin, 1999) can be cast into our

MKL framework by employing loss functions of the form V(t) = max(0,1−t). This gives

rise to the primal

inf

w,θ:θ≥0C

n

X

i=1

max 0,

M

X

m=1hwm, ψm(xi)iHm!+1

2

M

X

m=1

kwmk2

Hm

θm

,s.t. kθk2≤1.

Noting that the dual loss is V∗(t) = tif −1≤t≤0 and ∞elsewise, we obtain the following

generalized dual

sup

α

1>α−1

2

α>KmαM

m=1

p∗

,s.t. 0≤α≤C1,

which has been studied by Sonnenburg et al. (2006a) and Rakotomamonjy et al. (2008) for

`1-norm, and by Kloft et al. (2009b) for `p-norms.

3.6 Non-Isotropic Norms

In practice, it is often desirable for an expert to incorporate prior knowledge about the

problem domain. For instance, an expert could provide estimates of the interactions of

kernels {K1, ..., KM}in the form of an M×Mmatrix E. Alternatively, Ecould be obtained

by computing pairwise kernel alignments Eij =<Ki,Kj>

kKik kKjkgiven a dot product on the space

of kernels such as the Frobenius dot product (Ong et al., 2005). In a third scenario, Ecould

be a diagonal matrix encoding the a priori importance of kernels—it might be known from

pilot studies that a subset of the employed kernels is inferior to the remaining ones.

All those scenarios can be easily handled within our framework by considering non-

isotropic regularizers of the form5

kθkE−1=pθ>E−1θwith E0,

5. This idea is inspired by the Mahalanobis distance (Mahalanobis, 1936).

12

Non-sparse Regularization for Multiple Kernel Learning

where E−1is the matrix inverse of E. The dual norm is again deﬁned via 1

2k·k2

∗:=

1

2k·k2

E−1∗and the following easily-to-verify identity,

1

2k·k2

E−1∗

=1

2k·k2

E,

leads to the dual,

sup

α:1>α=0−C

n

X

i=1

V∗−αi

C, yi−1

2

α>KmαM

m=1

E

,

which is the desired non-isotropic MKL problem.

4. Optimization Strategies

The dual as given in Optimization Problem (D) does not lend itself to eﬃcient large-scale

optimization in a straight-forward fashion, for instance by direct application of standard

approaches like gradient descent. Instead, it is beneﬁcial to exploit the structure of the

MKL cost function by alternating between optimizing w.r.t. the mixings θand w.r.t. the

remaining variables. Most recent MKL solvers (e.g., Rakotomamonjy et al., 2008; Xu et al.,

2009; Nath et al., 2009) do so by setting up a two-layer optimization procedure: a master

problem, which is parameterized only by θ, is solved to determine the kernel mixture; to

solve this master problem, repeatedly a slave problem is solved which amounts to train-

ing a standard SVM on a mixture kernel. Importantly, for the slave problem, the mixture

coeﬃcients are ﬁxed, such that conventional, eﬃcient SVM optimizers can be recycled. Con-

sequently these two-layer procedures are commonly implemented as wrapper approaches.

Albeit appearing advantageous, wrapper methods suﬀer from two shortcomings: (i) Due to

kernel cache limitations, the kernel matrices have to be pre-computed and stored or many

kernel computations have to be carried out repeatedly, inducing heavy wastage of either

memory or time. (ii) The slave problem is always optimized to the end (and many con-

vergence proofs seem to require this), although most of the computational time is spend

on the non-optimal mixtures. Certainly suboptimal slave solutions would already suﬃce to

improve far-from-optimal θin the master problem.

Due to these problems, MKL is prohibitive when learning with a multitude of kernels

and on large-scale data sets as commonly encountered in many data-intense real world

applications such as bioinformatics, web mining, databases, and computer security. The

optimization approach presented in this paper decomposes the MKL problem into smaller

subproblems (Platt, 1999; Joachims, 1999; Fan et al., 2005) by establishing a wrapper-like

scheme within the decomposition algorithm.

Our algorithm is embedded into the large-scale framework of Sonnenburg et al. (2006a)

and extends it to the optimization of non-sparse kernel mixtures induced by an `p-norm

penalty. Our strategy alternates between minimizing the primal problem (7) w.r.t. θvia a

simple analytical update formula and with incomplete optimization w.r.t. all other variables

which, however, is performed in terms of the dual variables α. Optimization w.r.t. αis

performed by chunking optimizations with minor iterations. Convergence of our algorithm

is proven under typical technical regularity assumptions.

13

Kloft, Brefeld, Sonnenburg, and Zien

4.1 A Simple Wrapper Approach Based on an Analytical Update

We ﬁrst present an easy-to-implement wrapper version of our optimization approach to

multiple kernel learning. The interleaved decomposition algorithm is deferred to the next

section. To derive the new algorithm, we ﬁrst revisit the primal problem, i.e.

inf

w,b,θ:θ≥0C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

kwmk2

Hm

θm

,s.t. kθk2≤1.(P)

In order to obtain an eﬃcient optimization strategy, we divide the variables in the above

OP into two groups, (w, b) on one hand and θon the other. In the following we will

derive an algorithm which alternatingly operates on those two groups via a block coordinate

descent algorithm, also known as the non-linear block Gauss-Seidel method. Thereby the

optimization w.r.t. θwill be carried out analytically and the (w, b)-step will be computed

in the dual, if needed.

The basic idea of our ﬁrst approach is that for a given, ﬁxed set of primal variables (w, b),

the optimal θin the primal problem (P) can be calculated analytically. In the subsequent

derivations we employ non-sparse norms of the form kθkp= (PM

m=1 θp

m)1/p, 1 <p<∞.6

The following proposition gives an analytic update formula for θgiven ﬁxed remaining

variables (w, b) and will become the core of our proposed algorithm.

Proposition 2. Let Vbe a convex loss function, be p > 1. Given ﬁxed (possibly suboptimal)

w6=0and b, the minimal θin Optimization Problem (P) is attained for

θm=kwmk

2

p+1

Hm

PM

m0=1 kwm0k

2p

p+1

Hm01/p ,∀m= 1, . . . , M. (17)

Proof. 7We start the derivation, by equivalently translating Optimization Problem (P) via

Theorem 1 into

inf

w,b,θ:θ≥0

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

kwmk2

Hm

θm

+µ

2kθk2

p,(18)

with µ > 0. Suppose we are given ﬁxed (w, b), then setting the partial derivatives of the

above objective w.r.t. θto zero yields the following condition on the optimality of θ,

−kwmk2

Hm

2θ2

m

+µ·∂1

2kθk2

p

∂θm

= 0,∀m= 1, . . . , M. (19)

The ﬁrst derivative of the `p-norm with respect to the mixing coeﬃcients can be expressed

as ∂1

2kθk2

p

∂θm

=θp−1

mkθk2−p

p,

6. While the reasoning also holds for weighted `p-norms, the extension to more general norms, such as the

ones described in Section 3.6, is left for future work.

7. We remark that a more general result can be obtained by an alternative proof using H¨older’s inequality

(see Lemma 26 in Micchelli and Pontil, 2005).

14

Non-sparse Regularization for Multiple Kernel Learning

and hence Eq. (19) translates into the following optimality condition,

∃ζ∀m= 1, . . . , M :θm=ζkwmk

2

p+1

Hm.(20)

Because w6= 0, using the same argument as in the proof of Theorem 1, the constraint

kθk2

p≤1 in (18) is at the upper bound, i.e. kθkp= 1 holds for an optimal θ. Inserting (20)

in the latter equation leads to ζ=PM

m=1 kwmk2p/p+1

Hm1/p. Resubstitution into (20) yields

the claimed formula (17).

Second, we consider how to optimize Optimization Problem (P) w.r.t. the remaining

variables (w, b) for a given set of mixing coeﬃcients θ. Since optimization often is consid-

erably easier in the dual space, we ﬁx θand build the partial Lagrangian of Optimization

Problem (P) w.r.t. all other primal variables w,b. The resulting dual problem is of the

form (detailed derivations omitted)

sup

α:1>α=0 −C

n

X

i=1

V∗−αi

C, yi−1

2

M

X

m=1

θmα>Kmα,(21)

and the KKT conditions yield wm=θmPn

i=1 αiψm(xi) in the optimal point, hence

kwmk2=θ2

mαKmα,∀m= 1, . . . , M. (22)

We now have all ingredients (i.e., Eqs. (17), (21)–(22)) to formulate a simple macro-wrapper

algorithm for `p-norm MKL training:

Algorithm 1 Simple `p>1-norm MKL wrapper-based training algorithm. The analytical

updates of θand the SVM computations are optimized alternatingly.

1: input: feasible αand θ

2: while optimality conditions are not satisﬁed do

3: Compute αaccording to Eq. (21) (e.g. SVM)

4: Compute kwmk2for all m= 1, ..., M according to Eq. (22)

5: Update θaccording to Eq. (17)

6: end while

The above algorithm alternatingly solves a convex risk minimization machine (e.g. SVM)

w.r.t. the actual mixture θ(Eq. (21)) and subsequently computes the analytical update

according to Eq. (17) and (22). It can, for example, be stopped based on changes of the

objective function or the duality gap within subsequent iterations.

4.2 Towards Large-Scale MKL—Interleaving SVM and MKL Optimization

However, a disadvantage of the above wrapper approach still is that it deploys a full blown

kernel matrix. We thus propose to interleave the SVM optimization of SVMlight with the

θ- and α-steps at training time. We have implemented this so-called interleaved algorithm

in Shogun for hinge loss, thereby promoting sparse solutions in α. This allows us to solely

operate on a small number of active variables.8The resulting interleaved optimization

8. In practice, it turns out that the kernel matrix of active variables typically is about of the size 40 ×40,

even when we deal with ten-thousands of examples.

15

Kloft, Brefeld, Sonnenburg, and Zien

method is shown in Algorithm 2. Lines 3-5 are standard in chunking based SVM solvers

and carried out by SVMlight (note that Qis chosen as described in Joachims (1999)).

Lines 6-7 compute SVM-objective values. Finally, the analytical θ-step is carried out in

Line 9. The algorithm terminates if the maximal KKT violation (c.f. Joachims, 1999)

falls below a predetermined precision εand if the normalized maximal constraint violation

|1−ω

ωold |< εmkl for the MKL-step, where ωdenotes the MKL objective function value

(Line 8).

Algorithm 2 `p-Norm MKL chunking-based training algorithm via analytical update. Ker-

nel weighting θand (signed) SVM αare optimized interleavingly. The accuracy parameter

εand the subproblem size Qare assumed to be given to the algorithm.

1: Initialize: gm,i = ˆgi=αi= 0, ∀i= 1, ..., n;L=S=−∞;θm=p

p1/M,∀m= 1, ..., M

2: iterate

3: Select Q variables αi1, . . . , αiQbased on the gradient ˆ

gof (21) w.r.t. α

4: Store αold =αand then update αaccording to (21) with respect to the selected variables

5: Update gradient gm,i ←gm,i +PQ

q=1(αiq−αold

iq)km(xiq,xi), ∀m= 1, . . . , M ,i= 1, . . . , n

6: Compute the quadratic terms Sm=1

2Pigm,iαi,qm= 2θ2

mSm,∀m= 1, . . . , M

7: Lold =L,L=Piyiαi,Sold =S,S=PmθmSm

8: if |1−L−S

Lold−Sold | ≥ ε

9: θm= (qm)1/(p+1) /PM

m0=1 (qm0)p/(p+1)1/p ,∀m= 1, . . . , M

10: else

11: break

12: end if

13: ˆgi=Pmθmgm,i for all i= 1, . . . , n

4.3 Convergence Proof for p > 1

In the following, we exploit the primal view of the above algorithm as a nonlinear block

Gauss-Seidel method, to prove convergence of our algorithms. We ﬁrst need the following

useful result about convergence of the nonlinear block Gauss-Seidel method in general.

Proposition 3 (Bertsekas, 1999, Prop. 2.7.1).Let X=NM

m=1 Xmbe the Cartesian product

of closed convex sets Xm⊂Rdm, be f:X → Ra continuously diﬀerentiable function. Deﬁne

the nonlinear block Gauss-Seidel method recursively by letting x0∈ X be any feasible point,

and be

xk+1

m= argmin

ξ∈Xm

fxk+1

1,··· ,xk+1

m−1,ξ,xk

m+1,··· ,xk

M,∀m= 1, . . . , M. (23)

Suppose that for each mand x∈ X, the minimum

min

ξ∈Xm

f(x1,··· ,xm−1,ξ,xm+1,··· ,xM) (24)

is uniquely attained. Then every limit point of the sequence {xk}k∈Nis a stationary point.

The proof can be found in Bertsekas (1999), p. 268-269. The next proposition basically

establishes convergence of the proposed `p-norm MKL training algorithm.

16

Non-sparse Regularization for Multiple Kernel Learning

Theorem 4. Let Vbe the hinge loss and be p > 1. Let the kernel matrices K1, . . . , KM

be positive deﬁnite. Then every limit point of Algorithm 1 is a globally optimal point of

Optimization Problem (P). Moreover, suppose that the SVM computation is solved exactly

in each iteration, then the same holds true for Algorithm 2.

Proof. If we ignore the numerical speed-ups, then the Algorithms 1 and 2 coincidence for

the hinge loss. Hence, it suﬃces to show the wrapper algorithm converges.

To this aim, we have to transform Optimization Problem (P) into a form such that the

requirements for application of Prop. 3 are fulﬁlled. We start by expanding Optimization

Problem (P) into

min

w,b,ξ,θC

n

X

i=1

ξi+1

2

M

X

m=1

kwmk2

Hm

θm

,

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iHm+b≥1−ξi;ξ≥0; kθk2

p≤1; θ≥0,

thereby extending the second block of variables, (w, b), into (w, b, ξ). Moreover, we note

that after an application of the representer theorem9(Kimeldorf and Wahba, 1971) we may

without loss of generality assume Hm=Rn.

In the problem’s current form, the possibility of θm= 0 while wm6= 0 renders the

objective function nondiﬀerentiable. This hinders the application of Prop. 3. Fortunately,

it follows from Prop. 2 (note that Km0 implies w6=0) that this case is impossible. We

therefore can substitute the constraint θ≥0by θ>0for all m. In order to maintain

the closeness of the feasible set we subsequently apply a bijective coordinate transformation

φ:RM

+→RMwith θnew

m=φm(θm) = log(θm), resulting in the following equivalent problem,

inf

w,b,ξ,θC

n

X

i=1

ξi+1

2

M

X

m=1

exp(−θm)kwmk2

Rn,

s.t. ∀i:

M

X

m=1hwm, ψm(xi)iRn+b≥1−ξi;ξ≥0; kexp(θ)k2

p≤1,

where we employ the notation exp(θ) = (exp(θ1),··· ,exp(θM))>.

Applying the Gauss-Seidel method in Eq. (23) to the base problem (P) and to the

reparametrized problem yields the same sequence of solutions {(w, b, θ)k}k∈N0. The above

problem now allows to apply Prop. 3 for the two blocks of coordinates θ∈ X1and (w, b, ξ)∈

X2: the objective is continuously diﬀerentiable and the sets X1are closed and convex. To see

the latter, note that k·k2

p◦exp is a convex function, since k·k2

pis convex and non-increasing

in each argument (cf., e.g., Section 3.2.4 in Boyd and Vandenberghe, 2004). Moreover, the

minima in Eq. (23) are uniquely attained: the (w, b)-step amounts to solving an SVM on

a positive deﬁnite kernel mixture, and the analytical θ-step clearly yields unique solutions

as well.

9. Note that the coordinate transformation into Rncan be explicitly given in terms of the empirical kernel

map (Sch¨olkopf et al., 1999).

17

Kloft, Brefeld, Sonnenburg, and Zien

Hence, we conclude that every limit point of the sequence {(w, b, θ)k}k∈Nis a stationary

point of Optimization Problem (P). For a convex problem, this is equivalent to such a limit

point being globally optimal.

In practice, we are facing two problems. First, the standard Hilbert space setup neces-

sarily implies that kwmk ≥ 0 for all m. However in practice this assumption may often be

violated, either due to numerical imprecision or because of using an indeﬁnite “kernel” func-

tion. However, for any kwmk ≤ 0 it also follows that θ?

m= 0 as long as at least one strictly

positive kwm0k>0 exists. This is because for any λ < 0 we have limh→0,h>0λ

h=−∞.

Thus, for any mwith kwmk ≤ 0, we can immediately set the corresponding mixing coef-

ﬁcients θ?

mto zero. The remaining θare then computed according to Equation (2), and

convergence will be achieved as long as at least one strictly positive kwm0k>0 exists in

each iteration.

Second, in practice, the SVM problem will only be solved with ﬁnite precision, which

may lead to convergence problems. Moreover, we actually want to improve the αonly a

little bit before recomputing θsince computing a high precision solution can be wasteful,

as indicated by the superior performance of the interleaved algorithms (cf. Sect. 5.5). This

helps to avoid spending a lot of α-optimization (SVM training) on a suboptimal mixture

θ. Fortunately, we can overcome the potential convergence problem by ensuring that the

primal objective decreases within each α-step. This is enforced in practice, by computing

the SVM by a higher precision if needed. However, in our computational experiments we

ﬁnd that this precaution is not even necessary: even without it, the algorithm converges in

all cases that we tried (cf. Section 5).

Finally, we would like to point out that the proposed block coordinate descent approach

lends itself more naturally to combination with primal SVM optimizers like (Chapelle, 2006),

LibLinear (Fan et al., 2008) or Ocas (Franc and Sonnenburg, 2008). Especially for linear

kernels this is extremely appealing.

4.4 Technical Considerations

4.4.1 Implementation Details

We have implemented the analytic optimization algorithm described in the previous Section,

as well as the cutting plane and Newton algorithms by Kloft et al. (2009a), within the

SHOGUN toolbox (Sonnenburg et al., 2010) for regression, one-class classiﬁcation, and

two-class classiﬁcation tasks. In addition one can choose the optimization scheme, i.e.,

decide whether the interleaved optimization algorithm or the wrapper algorithm should be

applied. In all approaches any of the SVMs contained in SHOGUN can be used. Our

implementation can be downloaded from http://www.shogun-toolbox.org.

In the more conventional family of approaches, the wrapper algorithms, an optimization

scheme on θwraps around a single kernel SVM. Eﬀectively this results in alternatingly

solving for αand θ. For the outer optimization (i.e., that on θ) SHOGUN oﬀers the three

choices listed above. The semi-inﬁnite program (SIP) uses a traditional SVM to generate

new violated constraints and thus requires a single kernel SVM. A linear program (for

p= 1) or a sequence of quadratically constrained linear programs (for p > 1) is solved via

18

Non-sparse Regularization for Multiple Kernel Learning

GLPK10 or IBM ILOG CPLEX11. Alternatively, either an analytic or a Newton update

(for `pnorms with p > 1) step can be performed, obviating the need for an additional

mathematical programming software.

The second, much faster approach performs interleaved optimization and thus re-

quires modiﬁcation of the core SVM optimization algorithm. It is currently integrated

into the chunking-based SVRlight and SVMlight. To reduce the implementation eﬀort,

we implement a single function perform mkl step(Pα, objm), that has the arguments

Pα=Pn

i=1 αiand objm=1

2αTKmα, i.e. the current linear α-term and the SVM objec-

tives for each kernel. This function is either, in the interleaved optimization case, called as

a callback function (after each chunking step or a couple of SMO steps), or it is called by

the wrapper algorithm (after each SVM optimization to full precision).

Recovering Regression and One-Class Classiﬁcation. It should be noted that one-

class classiﬁcation is trivially implemented using Pα= 0 while support vector regression

(SVR) is typically performed by internally translating the SVR problem into a standard

SVM classiﬁcation problem with twice the number of examples once positively and once

negatively labeled with corresponding αand α∗. Thus one needs direct access to α∗and

computes Pα=−Pn

i=1(αi+α∗

i)ε−Pn

i=1(αi−α∗

i)yi(cf. Sonnenburg et al., 2006a). Since

this requires modiﬁcation of the core SVM solver we implemented SVR only for interleaved

optimization and SVMlight.

Eﬃciency Considerations and Kernel Caching. Note that the choice of the size of

the kernel cache becomes crucial when applying MKL to large scale learning applications.12

While for the wrapper algorithms only a single kernel SVM needs to be solved and thus a

single large kernel cache should be used, the story is diﬀerent for interleaved optimization.

Since one must keep track of the several partial MKL objectives objm, requiring access to

individual kernel rows, the same cache size should be used for all sub-kernels.

4.4.2 Kernel Normalization

The normalization of kernels is as important for MKL as the normalization of features is

for training regularized linear or single-kernel models. This is owed to the bias introduced

by the regularization: optimal feature / kernel weights are requested to be small. This is

easier to achieve for features (or entire feature spaces, as implied by kernels) that are scaled

to be of large magnitude, while downscaling them would require a correspondingly upscaled

weight for representing the same predictive model. Upscaling (downscaling) features is

thus equivalent to modifying regularizers such that they penalize those features less (more).

As is common practice, we here use isotropic regularizers, which penalize all dimensions

uniformly. This implies that the kernels have to be normalized in a sensible way in order

to represent an “uninformative prior” as to which kernels are useful.

There exist several approaches to kernel normalization, of which we use two in the com-

putational experiments below. They are fundamentally diﬀerent. The ﬁrst one generalizes

10. http://www.gnu.org/software/glpk/.

11. http://www.ibm.com/software/integration/optimization/cplex/.

12. Large scale in the sense, that the data cannot be stored in memory or the computation reaches a

maintainable limit. In the case of MKL this can be due both a large sample size or a high number of

kernels.

19

Kloft, Brefeld, Sonnenburg, and Zien

the common practice of standardizing features to entire kernels, thereby directly imple-

menting the spirit of the discussion above. In contrast, the second normalization approach

rescales the data points to unit norm in feature space. Nevertheless it can have a beneﬁcial

eﬀect on the scaling of kernels, as we argue below.

Multiplicative Normalization. As done in Ong and Zien (2008), we multiplicatively

normalize the kernels to have uniform variance of data points in feature space. Formally, we

ﬁnd a positive rescaling ρmof the kernel, such that the rescaled kernel ˜

km(·,·) = ρmkm(·,·)

and the corresponding feature map ˜

Φm(·) = √ρmΦm(·) satisfy

1

n

n

X

i=1

˜

Φm(xi)−˜

Φm(¯

x)

2= 1

for each m= 1, . . . , M, where ˜

Φm(¯

x) := 1

nPn

i=1 ˜

Φm(xi) is the empirical mean of the data

in feature space. The above equation can be equivalently be expressed in terms of kernel

functions as

1

n

n

X

i=1

˜

km(xi,xi)−1

n2

n

X

i=1

n

X

j=1

˜

km(xi,xj)=1,

so that the ﬁnal normalization rule is

k(x,¯

x)7−→ k(x,¯

x)

1

nPn

i=1 k(xi,xi)−1

n2Pn

i,j=1, k(xi,xj).(25)

Note that in case the kernel is centered (i.e. the empirical mean of the data points lies

on the origin), the above rule simpliﬁes to k(x,¯

x)7−→ k(x,¯

x)/1

ntr(K), where tr(K) :=

Pn

i=1 k(xi,xi) is the trace of the kernel matrix K.

Spherical Normalization. Frequently, kernels are normalized according to

k(x,¯

x)7−→ k(x,¯

x)

pk(x,x)k(¯

x,¯

x).(26)

After this operation, kxk=k(x,x) = 1 holds for each data point x; this means that

each data point is rescaled to lie on the unit sphere. Still, this also may have an ef-

fect on the scale of the features: a spherically normalized and centered kernel is also al-

ways multiplicatively normalized, because the multiplicative normalization rule becomes

k(x,¯

x)7−→ k(x,¯

x)/1

ntr(K) = k(x,¯

x)/1.

Thus the spherical normalization may be seen as an approximation to the above mul-

tiplicative normalization and may be used as a substitute for it. Note, however, that it

changes the data points themselves by eliminating length information; whether this is de-

sired or not depends on the learning task at hand. Finally note that both normalizations

achieve that the optimal value of Cis not far from 1.

20

Non-sparse Regularization for Multiple Kernel Learning

4.5 Limitations and Extensions of our Framework

In this section, we show the connection of `p-norm MKL to a formulation based on block

norms, point out limitations and sketch extensions of our framework. To this aim let us

recall the primal MKL problem (P) and consider the special case of `p-norm MKL given by

inf

w,b,θ:θ≥0C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

kwmk2

Hm

θm

,s.t. kθk2

p≤1.

(27)

The subsequent proposition shows that (27) equivalently can be translated into the following

mixed-norm formulation,

inf

w,b

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1 kwmkq

Hm,(28)

where q=2p

p+1 , and ˜

Cis a constant. This has been studied by Bach et al. (2004) for q= 1

and by Szafranski et al. (2008) for hierarchical penalization.

Proposition 5. Let be p > 1, be Va convex loss function, and deﬁne q:= 2p

p+1 (i.e.

p=q

2−q). Optimization Problem (27) and (28) are equivalent, i.e., for each Cthere exists

a˜

C > 0, such that for each optimal solution (w∗, b∗, θ∗) of OP (27) using C, we have that

(w∗, b∗) is also optimal in OP (28) using ˜

C, and vice versa.

Proof. From Prop. 2 it follows that for any ﬁxed win (27) it holds for the w-optimal θ:

∃ζ:θm=ζkwmk

2

p+1

Hm,∀m= 1, . . . , M.

Plugging the above equation into (27) yields

inf

w,b C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2ζ

M

X

m=1 kwmk

2p

p+1

Hm.(29)

Deﬁning q:= 2p

p+1 and ˜

C:= ζC results in (28).

Now, let us take a closer look on the parameter range of q. It is easy to see that when we

vary pin the real interval [1,∞], then qis limited to range in [1,2]. So in other words the

methodology presented in this paper only covers the 1 ≤q≤2 block norm case. However,

from an algorithmic perspective our framework can be easily extended to the q > 2 case:

although originally aiming at the more sophisticated case of hierarchical kernel learning,

Aﬂalo et al. (2009) showed in particular that for q≥2, Eq. (28) is equivalent to

sup

θ:θ≥0,kθk2

r≤1

inf

w,b

˜

C

n

X

i=1

V M

X

m=1hwm, ψm(xi)iHm+b, yi!+1

2

M

X

m=1

θmkwmk2

Hm,(30)

where r:= q

q−2. Note the diﬀerence to `p-norm MKL: the mixing coeﬃcients θappear in

the nominator and by varying rin the interval [1,∞], the range of qin the interval [2,∞]

21

Kloft, Brefeld, Sonnenburg, and Zien

can be obtained, which explains why this method is complementary to ours, where qranges

in [1,2].

It is straight forward to show that for every ﬁxed (possibly suboptimal) pair (w, b) the

optimal θis given by

θm=kwmk

2

r−1

Hm

PM

m0=1 kwm0k

2r

r−1

Hm01/r ,∀m= 1, . . . , M.

The proof is analogous to that of Prop. 2 and the above analytical update formula can

be used to derive a block coordinate descent algorithm that is analogous to ours. In our

framework, the mixings θ, however, appear in the denominator of the objective function of

Optimization Problem (P). Therefore, the corresponding update formula in our framework

is

θm=kwmk

−2

r−1

Hm

PM

m0=1 kwm0k

−2r

r−1

Hm01/r ,∀m= 1, . . . , M. (31)

This shows that we can simply optimize 2 < q ≤ ∞-block-norm MKL within our computa-

tional framework, using the update formula (31).

5. Computational Experiments

In this section we study non-sparse MKL in terms of computational eﬃciency and predictive

accuracy. We apply the method of Sonnenburg et al. (2006a) in the case of p= 1. We write

`∞-norm MKL for a regular SVM with the unweighted-sum kernel K=PmKm.

We ﬁrst study a toy problem in Section 5.1 where we have full control over the distribu-

tion of the relevant information in order to shed light on the appropriateness of sparse, non-

sparse, and `∞-MKL. We report on real-world problems from bioinformatics, namely protein

subcellular localization (Section 5.2), ﬁnding transcription start sites of RNA Polymerase II

binding genes in genomic DNA sequences (Section 5.3), and reconstructing metabolic gene

networks (Section 5.4).

5.1 Measuring the Impact of Data Sparsity—Toy Experiment

The goal of this section is to study the relationship of the level of sparsity of the true

underlying function to be learned to the chosen norm pin the model. Intuitively, we might

expect that the optimal choice of pdirectly corresponds to the true level of sparsity. Apart

from verifying this conjecture, we are also interested in the eﬀects of suboptimal choice of

p. To this aim we constructed several artiﬁcial data sets in which we vary the degree of

sparsity in the true kernel mixture coeﬃcients. We go from having all weight focussed on

a single kernel (the highest level of sparsity) to uniform weights (the least sparse scenario

possible) in several steps. We then study the statistical performance of `p-norm MKL for

diﬀerent values of pthat cover the entire range [1,∞].

We generate an n-element balanced sample D={(xi, yi)}n

i=1 from two d= 50-

dimensional isotropic Gaussian distributions with equal covariance matrices C=Id×dand

22

Non-sparse Regularization for Multiple Kernel Learning

µ2µ1

relevant

feature

irrelevant

feature

Figure 1: Illustration of the toy experiment for θ= (1,0)>.

equal, but opposite, means µ1=ρ

kθk2θand µ2=−µ1. Thereby θis a binary vector, i.e.,

∀i:θi∈ {0,1}, encoding the true underlying data sparsity as follows. Zero components

θi= 0 clearly imply identical means of the two classes’ distributions in the ith feature set;

hence the latter does not carry any discriminating information. In summary, the fraction of

zero components, ν(θ) = 1 −1

dPd

i=1 θi, is a measure for the feature sparsity of the learning

problem.

For several values of νwe generate m= 250 data sets D1,...,Dmﬁxing ρ= 1.75. Then,

each feature is input to a linear kernel and the resulting kernel matrices are multiplicatively

normalized as described in Section 4.4.2. Hence, ν(θ) gives the fraction of noise kernels in the

working kernel set. Then, classiﬁcation models are computed by training `p-norm MKL for

p= 1,4/3,2,4,∞on each Di. Soft margin parameters Care tuned on independent 10,000-

elemental validation sets by grid search over C∈10[−4,3.5,...,0] (optimal Cs are attained in

the interior of the grid). The relative duality gaps were optimized up to a precision of 10−3.

We report on test errors evaluated on 10,000-elemental independent test sets and pure mean

`2model errors of the computed kernel mixtures, that is ME(b

θ) = kζ(b

θ)−ζ(θ)k2, where

ζ(x) = x

kxk2.

The results are shown in Fig. 2 for n= 50 and n= 800, where the ﬁgures on the left

show the test errors and the ones on the right the model errors ME(b

θ). Regarding the

latter, model errors reﬂect the corresponding test errors for n= 50. This observation can

be explained by statistical learning theory. The minimizer of the empirical risk performs

unstable for small sample sizes and the model selection results in a strongly regularized

hypothesis, leading to the observed agreement between test error and model error.

Unsurprisingly, `1performs best and reaches the Bayes error in the sparse scenario,

where only a single kernel carries the whole discriminative information of the learning

problem. However, in the other scenarios it mostly performs worse than the other MKL

variants. This is remarkable because the underlying ground truth, i.e. the vector θ, is sparse

in all but the uniform scenario. In other words, selecting this data set may imply a bias

towards `1-norm. In contrast, the vanilla SVM using an unweighted sum kernel performs

best when all kernels are equally informative, however, its performance does not approach

the Bayes error rate. This is because it corresponds to a `2,2-block norm regularization (see

Sect. 4.5) but for a truly uniform regularization a `∞-block norm penalty (as employed in

Nath et al., 2009) would be needed. This indicates a limitation of our framework; it shall,

however, be kept in mind that such a uniform scenario might quite artiﬁcial. The non-sparse

23

Kloft, Brefeld, Sonnenburg, and Zien

0 44 64 82 92 98

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

ν(θ) = fraction of noise kernels [in %]

test error

1−norm MKL

4/3−norm MKL

2−norm MKL

4−norm MKL

∞−norm MKL (=SVM)

Bayes error

(a)

0 44 66 82 92 98

0

0.2

0.4

0.6

0.8

1

1.2

1.4

ν(θ) = fraction of noise kernels [in %]

∆θ

(b)

0 44 64 82 92 98

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

ν(θ) = fraction of noise kernels [in %]

test error

1−norm MKL

4/3−norm MKL

2−norm MKL

4−norm MKL

∞−norm MKL (=SVM)

Bayes error

(c)

0 44 66 82 92 98

0

0.2

0.4

0.6

0.8

1

1.2

1.4

ν(θ) = fraction of noise kernels [in %]

∆θ

(d)

Figure 2: Results of the artiﬁcial experiment for sample sizes of n= 50 (top) and n= 800 (below)

training instances in terms of test errors (left) and mean `2model errors ME(b

θ) (right).

`4- and `2-norm MKL variants perform best in the balanced scenarios, i.e., when the noise

level is ranging in the interval 64%-92%. Intuitively, the non-sparse `4-norm MKL is the

most robust MKL variant, achieving a test error of less than 10% in all scenarios. Tuning

the sparsity parameter pfor each experiment, `p-norm MKL achieves the lowest test error

across all scenarios.

When the sample size is increased to n= 800 training instances, test errors decrease

signiﬁcantly. Nevertheless, we still observe diﬀerences of up to 1% test error between the

best (`∞-norm MKL) and worst (`1-norm MKL) prediction model in the two most non-

sparse scenarios. Note that all `p-norm MKL variants perform well in the sparse scenarios.

In contrast with the test errors, the mean model errors depicted in Figure 2 (bottom, right)

are relatively high. Similarly to above reasoning, this discrepancy can be explained by

the minimizer of the empirical risk becoming stable when increasing the sample size (see

theoretical Analysis in Appendix A, where we show that speed of the minimizer becoming

24

Non-sparse Regularization for Multiple Kernel Learning

stable is O(1/√n)). Again, `p-norm MKL achieves the smallest test error for all scenarios

for appropriately chosen pand for a ﬁxed pacross all experiments, the non-sparse `4-norm

MKL performs the most robustly.

In summary, the choice of the norm parameter pis important for small sample sizes,

whereas its impact decreases with an increase of the training data. As expected, sparse MKL

performs best in sparse scenarios, while non-sparse MKL performs best in moderate or non-

sparse scenarios, and for uniform scenarios the unweighted-sum kernel SVM performs best.

For appropriately tuning the norm parameter, `p-norm MKL proves robust in all scenarios.

5.2 Protein Subcellular Localization—a Sparse Scenario

The prediction of the subcellular localization of proteins is one of the rare empirical success

stories of `1-norm-regularized MKL (Ong and Zien, 2008; Zien and Ong, 2007): after deﬁning

69 kernels that capture diverse aspects of protein sequences, `1-norm-MKL could raise

the predictive accuracy signiﬁcantly above that of the unweighted sum of kernels, and

thereby also improve on established prediction systems for this problem. This has been

demonstrated on 4 data sets, corresponding to 4 diﬀerent sets of organisms (plants, non-

plant eukaryotes, Gram-positive and Gram-negative bacteria) with diﬀering sets of relevant

localizations. In this section, we investigate the performance of non-sparse MKL on the

same 4 data sets.

We downloaded the kernel matrices of all 4 data sets13. The kernel matrices are

multiplicatively normalized as described in Section 4.4.2. The experimental setup used

here is related to that of Ong and Zien (2008), although it deviates from it in sev-

eral details. For each data set, we perform the following steps for each of the 30 pre-

deﬁned splits in training set and test set (downloaded from the same URL): We con-

sider norms p∈ {1,32/31,16/15,8/7,4/3,2,4,8,∞} and regularization constants C∈

{1/32,1/8,1/2,1,2,4,8,32,128}. For each parameter setting (p, C), we train `p-norm MKL

using a 1-vs-rest strategy on the training set. The predictions on the test set are then

evaluated w.r.t. average (over the classes) MCC (Matthews correlation coeﬃcient). As we

are only interested in the inﬂuence of the norm on the performance, we forbear proper

cross-validation (the so-obtained systematical error aﬀects all norms equally). Instead, for

each of the 30 data splits and for each p, the value of Cthat yields the highest MCC is

selected. Thus we obtain an optimized Cand MCC value for each combination of data set,

split, and norm p. For each norm, the ﬁnal MCC value is obtained by averaging over the

data sets and splits (i.e., Cis selected to be optimal for each data set and split).

The results, shown in Table 1, indicate that indeed, with proper choice of a non-sparse

regularizer, the accuracy of `1-norm can be recovered. On the other hand, non-sparse MKL

can approximate the `1-norm arbitrarily close, and thereby approach the same results.

However, even when 1-norm is clearly superior to ∞-norm, as for these 4 data sets, it is

possible that intermediate norms perform even better. As the table shows, this is indeed

the case for the PSORT data sets, albeit only slightly and not signiﬁcantly so.

We brieﬂy mention that the superior performance of `p≈1-norm MKL in this setup

is not surprising. There are four sets of 16 kernels each, in which each kernel picks up

very similar information: they only diﬀer in number and placing of gaps in all substrings

13. Available from http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc/

25

Kloft, Brefeld, Sonnenburg, and Zien

Table 1: Results for Protein Subcellular Localization. For each of the 4 data sets (rows) and

each considered norm (columns), we present a measure of prediction error together with

its standard error. As measure of prediction error we use 1 minus the average MCC,

displayed as percentage.

`p-norm 1 32/31 16/15 8/7 4/3 2 4 8 16 ∞

plant 8.18 8.22 8.20 8.21 8.43 9.47 11.00 11.61 11.91 11.85

std. err. ±0.47 ±0.45 ±0.43 ±0.42 ±0.42 ±0.43 ±0.47 ±0.49 ±0.55 ±0.60

nonpl 8.97 9.01 9.08 9.19 9.24 9.43 9.77 10.05 10.23 10.33

std. err. ±0.26 ±0.25 ±0.26 ±0.27 ±0.29 ±0.32 ±0.32 ±0.32 ±0.32 ±0.31

psortNeg 9.99 9.91 9.87 10.01 10.13 11.01 12.20 12.73 13.04 13.33

std. err. ±0.35 ±0.34 ±0.34 ±0.34 ±0.33 ±0.32 ±0.32 ±0.34 ±0.33 ±0.35

psortPos 13.07 13.01 13.41 13.17 13.25 14.68 15.55 16.43 17.36 17.63

std. err. ±0.66 ±0.63 ±0.67 ±0.62 ±0.61 ±0.67 ±0.72 ±0.81 ±0.83 ±0.80

of length 5 of a given part of the protein sequence. The situation is roughly analogous

to considering (inhomogeneous) polynomial kernels of diﬀerent degrees on the same data

vectors. This means that they carry large parts of overlapping information. By construction,

also some kernels (those with less gaps) in principle have access to more information (similar

to higher degree polynomials including low degree polynomials). Further, Ong and Zien

(2008) studied single kernel SVMs for each kernel individually and found that in most

cases the 16 kernels from the same subset perform very similarly. This means that each

set of 16 kernels is highly redundant and the excluded parts of information are not very

discriminative. This renders a non-sparse kernel mixture ineﬀective. We conclude that

`1-norm must be the best prediction model.

5.3 Gene Start Recognition—a Weighted Non-Sparse Scenario

This experiment aims at detecting transcription start sites (TSS) of RNA Polymerase II

binding genes in genomic DNA sequences. Accurate detection of the transcription start site

is crucial to identify genes and their promoter regions and can be regarded as a ﬁrst step in

deciphering the key regulatory elements in the promoter region that determine transcription.

Transcription start site ﬁnders exploit the fact that the features of promoter regions

and the transcription start sites are diﬀerent from the features of other genomic DNA

(Bajic et al., 2004). Many such detectors thereby rely on a combination of feature sets

which makes the learning task appealing for MKL. For our experiments we use the data set

from Sonnenburg et al. (2006b) which contains a curated set of 8,508 TSS annotated genes

utilizing dbTSS version 4 (Suzuki et al., 2002) and refseq genes. These are translated into

positive training instances by extracting windows of size [−1000,+1000] around the TSS.

Similar to Bajic et al. (2004), 85,042 negative instances are generated from the interior of

the gene using the same window size.

Following Sonnenburg et al. (2006b), we employ ﬁve diﬀerent kernels representing the

TSS signal (weighted degree with shift), the promoter (spectrum), the 1st exon (spectrum),

angles (linear), and energies (linear). Optimal kernel parameters are determined by model

26

Non-sparse Regularization for Multiple Kernel Learning

0 10K 20K 30K 40K 50K 60K

0.88

0.89

0.9

0.91

0.92

0.93

sample size

AUC

1−norm MKL

4/3−norm MKL

2−norm MKL

4−norm MKL

SVM

1−norm

n=5k

4/3−norm 2−norm 4−norm unw.−sum

n=20kn=60k

Figure 3: (left) Area under ROC curve (AUC) on test data for TSS recognition as a function of

the training set size. Notice the tiny bars indicating standard errors w.r.t. repetitions on

disjoint training sets. (right) Corresponding kernel mixtures. For p= 1 consistent sparse

solutions are obtained while the optimal p= 2 distributes weights on the weighted degree

and the 2 spectrum kernels in good agreement to (Sonnenburg et al., 2006b).

selection in Sonnenburg et al. (2006b). The kernel matrices are spherically normalized as

described in section 4.4.2. We reserve 13,000 and 20,000 randomly drawn instances for

validation and test sets, respectively, and use the remaining 60,000 as the training pool.

Soft margin parameters Care tuned on the validation set by grid search over C∈2[−2,−1,...,5]

(optimal Cs are attained in the interior of the grid). Figure 3 shows test errors for varying

training set sizes drawn from the pool; training sets of the same size are disjoint. Error

bars indicate standard errors of repetitions for small training set sizes.

Regardless of the sample size, `1-norm MKL is signiﬁcantly outperformed by the sum-

kernel. On the contrary, non-sparse MKL signiﬁcantly achieves higher AUC values than

the `∞-norm MKL for sample sizes up to 20k. The scenario is well suited for `2-norm

MKL which performs best. Finally, for 60k training instances, all methods but `1-norm

MKL yield the same performance. Again, the superior performance of non-sparse MKL is

remarkable, and of signiﬁcance for the application domain: the method using the unweighted

sum of kernels (Sonnenburg et al., 2006b) has recently been conﬁrmed to be leading in a

comparison of 19 state-of-the-art promoter prediction programs (Abeel et al., 2009), and

our experiments suggest that its accuracy can be further elevated by non-sparse MKL.

We give a brief explanation of the reason for optimality of a non-sparse `p-norm in

the above experiments. It has been shown by Sonnenburg et al. (2006b) that there are

three highly and two moderately informative kernels. We brieﬂy recall those results by

reporting on the AUC performances obtained from training a single-kernel SVM on each

kernel individually: TSS signal 0.89, promoter 0.86, 1st exon 0.84, angles 0.55, and energies

0.74, for ﬁxed sample size n= 2000. While non-sparse MKL distributes the weights over

all kernels (see Fig. 3), sparse MKL focuses on the best kernel. However, the superior

performance of non-sparse MKL means that dropping the remaining kernels is detrimental,

indicating that they may carry additional discriminative information.

27

Kloft, Brefeld, Sonnenburg, and Zien

Figure 4: Pairwise alignments of the kernel matrices are shown for the gene start recognition exper-

iment. From left to right, the ordering of the kernel matrices is TSS signal, promoter, 1st

exon, angles, and energies. The ﬁrst three kernels are highly correlated, as expected by

their high AUC performances (AUC=0.84–0.89) and the angle kernel correlates decently

(AUC=0.55). Surprisingly, the energy kernel correlates only few, despite a descent AUC

of 0.74.

To investigate this hypothesis we computed the pairwise alignments14 of the centered

kernel matrices, i.e., A(i, j) = <Ki,Kj>F

kKikFkKjkF, with respect to the Frobenius dot product (e.g.,

Golub and van Loan, 1996). The computed alignments are shown in Fig. 4. One can observe

that the three relevant kernels are highly aligned as expected since they are correlated via

the labels.

However, the energy kernel shows only a slight correlation with the remaining kernels,

which is surprisingly little compared to its single kernel performance (AUC=0.74). We

conclude that this kernel carries complementary and orthogonal information about the

learning problem and should thus be included in the resulting kernel mixture. This is

precisely what is done by non-sparse MKL, as can be seen in Fig. 3(right), and the reason

for the empirical success of non-sparse MKL on this data set.

5.4 Reconstruction of Metabolic Gene Network—a Uniformly Non-Sparse

Scenario

In this section, we apply non-sparse MKL to a problem originally studied by Yamanishi

et al. (2005). Given 668 enzymes of the yeast Saccharomyces cerevisiae and 2782 functional

relationships extracted from the KEGG database (Kanehisa et al., 2004), the task is to

predict functional relationships for unknown enzymes. We employ the experimental setup

of Bleakley et al. (2007) who phrase the task as graph-based edge prediction with local

models by learning a model for each of the 668 enzymes. They provided kernel matrices

capturing expression data (EXP), cellular localization (LOC), and the phylogenetic proﬁle

14. The alignments can be interpreted as empirical estimates of the Pearson correlation of the kernels (Cris-

tianini et al., 2002).

28

Non-sparse Regularization for Multiple Kernel Learning

Table 2: Results for the reconstruction of a metabolic gene network. Results by Bleakley et al.

(2007) for single kernel SVMs are shown in brackets.

AUC ±stderr

EXP 71.69 ±1.1 (69.3±1.9)

LOC 58.35 ±0.7 (56.0±3.3)

PHY 73.35 ±1.9 (67.8±2.1)

INT (∞-norm MKL) 82.94 ±1.1(82.1±2.2)

1-norm MKL 75.08 ±1.4

4/3-norm MKL 78.14 ±1.6

2-norm MKL 80.12 ±1.8

4-norm MKL 81.58 ±1.9

8-norm MKL 81.99 ±2.0

10-norm MKL 82.02 ±2.0

Recombined and product kernels

1-norm MKL 79.05 ±0.5

4/3-norm MKL 80.92 ±0.6

2-norm MKL 81.95 ±0.6

4-norm MKL 83.13 ±0.6

(PHY); additionally we use the integration of the former 3 kernels (INT) which matches

our deﬁnition of an unweighted-sum kernel.

Following Bleakley et al. (2007), we employ a 5-fold cross validation; in each fold we

train on average 534 enzyme-based models; however, in contrast to Bleakley et al. (2007)

we omit enzymes reacting with only one or two others to guarantee well-deﬁned problem

settings. As Table 2 shows, this results in slightly better AUC values for single kernel SVMs

where the results by Bleakley et al. (2007) are shown in brackets.

As already observed (Bleakley et al., 2007), the unweighted-sum kernel SVM performs

best. Although its solution is well approximated by non-sparse MKL using large values of p,

`p-norm MKL is not able to improve on this p=∞result. Increasing the number of kernels

by including recombined and product kernels does improve the results obtained by MKL for

small values of p, but the maximal AUC values are not statistically signiﬁcantly diﬀerent

from those of `∞-norm MKL. We conjecture that the performance of the unweighted-sum

kernel SVM can be explained by all three kernels performing well invidually. Their corre-

lation is only moderate, as shown in Fig. 5, suggesting that they contain complementary

information. Hence, downweighting one of those three orthogonal kernels leads to a decrease

in performance, as observed in our experiments. This explains why `∞-norm MKL is the

best prediction model in this experiment.

29

Kloft, Brefeld, Sonnenburg, and Zien

Figure 5: Pairwise alignments of the kernel matrices are shown for the metabolic gene network ex-

periment. From left to right, the ordering of the kernel matrices is EXP, LOC, and PHY.

One can see that all kernel matrices are equally correlated. Generally, the alignments are

relatively low, suggesting that combining all kernels with equal weights is beneﬁcial.

5.5 Execution Time

In this section we demonstrate the eﬃciency of our implementations of non-sparse MKL.

We experiment on the MNIST data set15, where the task is to separate odd vs. even digits.

The digits in this n= 60,000-elemental data set are of size 28x28 leading to d= 784

dimensional examples. We compare our analytical solver for non-sparse MKL (Section 4.1–

4.2) with the state-of-the art for `1-norm MKL, namely SimpleMKL16 (Rakotomamonjy

et al., 2008), HessianMKL17 (Chapelle and Rakotomamonjy, 2008), SILP-based wrapper,

and SILP-based chunking optimization (Sonnenburg et al., 2006a). We also experiment

with the analytical method for p= 1, although convergence is only guaranteed by our

Theorem 4 for p > 1. We also compare to the semi-inﬁnite program (SIP) approach to

`p-norm MKL presented in Kloft et al. (2009a). 18 In addition, we solve standard SVMs19

using the unweighted-sum kernel (`∞-norm MKL) as baseline.

We experiment with MKL using precomputed kernels (excluding the kernel computation

time from the timings) and MKL based on on-the-ﬂy computed kernel matrices measur-

ing training time including kernel computations. Naturally, runtimes of on-the-ﬂy methods

should be expected to be higher than the ones of the precomputed counterparts. We opti-

15. This data set is available from http://yann.lecun.com/exdb/mnist/.

16. We obtained an implementation from http://asi.insa-rouen.fr/enseignants/~arakotom/code/.

17. We obtained an implementation from http://olivier.chapelle.cc/ams/hessmkl.tgz.

18. The Newton method presented in the same paper performed similarly most of the time but sometimes

had convergence problems, especially when p≈1 and thus was excluded from the presentation.

19. We use SVMlight as SVM-solver.

30

Non-sparse Regularization for Multiple Kernel Learning

mize all methods up to a precision of 10−3for the outer SVM-εand 10−5for the “inner”

SIP precision, and computed relative duality gaps. To provide a fair stopping criterion to

SimpleMKL and HessianMKL, we set their stopping criteria to the relative duality gap of

their `1-norm SILP counterpart. SVM trade-oﬀ parameters are set to C= 1 for all methods.

Scalability of the Algorithms w.r.t. Sample Size Figure 6 (top) displays the results

for varying sample sizes and 50 precomputed or on-the-ﬂy computed Gaussian kernels with

bandwidths 2σ2∈1.20,...,49. Error bars indicate standard error over 5 repetitions. As

expected, the SVM with the unweighted-sum kernel using precomputed kernel matrices is

the fastest method. The classical MKL wrapper based methods, SimpleMKL and the SILP

wrapper, are the slowest; they are even slower than methods that compute kernels on-the-

ﬂy. Note that the on-the-ﬂy methods naturally have higher runtimes because they do not

proﬁt from precomputed kernel matrices.

Notably, when considering 50 kernel matrices of size 8,000 times 8,000 (memory require-

ments about 24GB for double precision numbers), SimpleMKL is the slowest method: it is

more than 120 times slower than the `1-norm SILP solver from Sonnenburg et al. (2006a).

This is because SimpleMKL suﬀers from having to train an SVM to full precision for each

gradient evaluation. In contrast, kernel caching and interleaved optimization still allow

to train our algorithm on kernel matrices of size 20000 ×20000, which would usually not

completely ﬁt into memory since they require about 149GB.

Non-sparse MKL scales similarly as `1-norm SILP for both optimization strategies, the

analytic optimization and the sequence of SIPs. Naturally, the generalized SIPs are slightly

slower than the SILP variant, since they solve an additional series of Taylor expansions

within each θ-step. HessianMKL ranks in between on-the-ﬂy and non-sparse interleaved

methods.

Scalability of the Algorithms w.r.t. the Number of Kernels Figure 6 (bottom)

shows the results for varying the number of precomputed and on-the-ﬂy computed RBF

kernels for a ﬁxed sample size of 1000. The bandwidths of the kernels are scaled such that

for Mkernels 2σ2∈1.20,...,M−1. As expected, the SVM with the unweighted-sum kernel

is hardly aﬀected by this setup, taking an essentially constant training time. The `1-norm

MKL by Sonnenburg et al. (2006a) handles the increasing number of kernels best and is the

fastest MKL method. Non-sparse approaches to MKL show reasonable run-times, being

just slightly slower. Thereby the analytical methods are somewhat faster than the SIP

approaches. The sparse analytical method performs worse than its non-sparse counterpart;

this might be related to the fact that convergence of the analytical method is only guaranteed

for p > 1. The wrapper methods again perform worst.

However, in contrast to the previous experiment, SimpleMKL becomes more eﬃcient

with increasing number of kernels. We conjecture that this is in part owed to the sparsity of

the best solution, which accommodates the l1-norm model of SimpleMKL. But the capacity

of SimpleMKL remains limited due to memory restrictions of the hardware. For example,

for storing 1,000 kernel matrices for 1,000 data points, about 7.4GB of memory are required.

On the other hand, our interleaved optimizers which allow for eﬀective caching can easily

cope with 10,000 kernels of the same size (74GB). HessianMKL is considerably faster than

SimpleMKL but slower than the non-sparse interleaved methods and the SILP. Similar to

31

Kloft, Brefeld, Sonnenburg, and Zien

102103104

100

102

104

number of training examples

training time (seconds)

50 precomputed kernels − loglog plot

1−norm wrapper

1−norm simpleMKL

2−norm direct on−the−fly

2−norm SIP on−the−fly

1−norm SIP on−the−fly

SVM on−the−fly

1−norm direct

1.333−norm direct

2−norm direct

3−norm direct

1−norm SILP

1.333−norm SIP

2−norm SIP

3−norm SIP

SVM

101102103104

100

101

102

103

104

105

number of kernels

training time (seconds; logarithmic)

1000 examples with varying number of precomputed kernels

1−norm SILP wrapper

1−norm simpleMKL

2−norm SIP on−the−fly

2−norm direct on−the−fly

SVM on−the−fly

3−norm SILP

2−norm SIP

1.333−norm SIP

1−norm SILP

3−norm direct

2−norm direct

1.333−norm direct

1−norm direct

SVM

Figure 6: Execution times of SVM and `p-norm MKL based on interleaved optimization via analyt-

ical optimization and semi-inﬁnite programming (SIP), respectively, and wrapper-based

optimization via SimpleMKL wrapper and SIP wrapper. Top: Training using ﬁxed num-

ber of 50 kernels varying training set size. Bottom: For 1000 examples and varying

numbers of kernels. Notice the tiny error bars and that these are log-log plots.

32

Non-sparse Regularization for Multiple Kernel Learning

SimpleMKL, it becomes more eﬃcient with increasing number of kernels but eventually

runs out of memory.

Overall, our proposed interleaved analytic and cutting plane based optimization strate-

gies achieve a speedup of up to one and two orders of magnitude over HessianMKL and

SimpleMKL, respectively. Using eﬃcient kernel caching, they allow for truely large-scale

multiple kernel learning well beyond the limits imposed by having to precompute and store

the complete kernel matrices. Finally, we note that performing MKL with 1,000 precom-

puted kernel matrices of size 1,000 times 1,000 requires less than 3 minutes for the SILP.

This suggests that it focussing future research eﬀorts on improving the accuracy of MKL

models may pay oﬀ more than further accelerating the optimization algorithm.

6. Conclusion

We translated multiple kernel learning into a regularized risk minimization problem for

arbitrary convex loss functions, Hilbertian regularizers, and arbitrary-norm penalties on

the mixing coeﬃcients. Our formulation can be motivated by both Tikhonov and Ivanov

regularization approaches, the latter one having an additional regularization parameter.

Applied to previous MKL research, our framework provides a unifying view and shows that

so far seemingly diﬀerent MKL approaches are in fact equivalent.

Furthermore, we presented a general dual formulation of multiple kernel learning that

subsumes many existing algorithms. We devised an eﬃcient optimization scheme for non-

sparse `p-norm MKL with p≥1, based on an analytic update for the mixing coeﬃcients,

and interleaved with chunking-based SVM training to allow for application at large scales.

It is an open question whether our algorithmic approach extends to more general norms.

Our implementations are freely available and included in the SHOGUN toolbox. The execu-

tion times of our algorithms revealed that the interleaved optimization vastly outperforms

commonly used wrapper approaches. Our results and the scalability of our MKL approach

pave the way for other real-world applications of multiple kernel learning.

In order to empirically validate our `p-norm MKL model, we applied it to artiﬁcially

generated data and real-world problems from computational biology. For the controlled

toy experiment, where we simulated various levels of sparsity, `p-norm MKL achieved a

low test error in all scenarios for scenario-wise tuned parameter p. Moreover, we studied

three real-world problems showing that the choice of the norm is crucial for state-of-the art

performance. For the TSS recognition, non-sparse MKL raised the bar in predictive per-

formance, while for the other two tasks either sparse MKL or the unweighted-sum mixture

performed best. In those cases the best solution can be arbitrarily closely approximated by

`p-norm MKL with 1 < p < ∞. Hence it seems natural that we observed non-sparse MKL

to be never worse than an unweighted-sum kernel or a sparse MKL approach. Moreover,

empirical evidence from our experiments along with others suggests that the popular `1-

norm MKL is more prone to bad solutions than higher norms, despite appealing guarantees

like the model selection consistency (Bach, 2008).

A ﬁrst step towards a learning-theoretical understanding of this empirical behaviour

may be the convergence analysis undertaken in the appendix of this paper. It is shown

that in a sparse scenario `1-norm MKL converges faster than non-sparse MKL due to a bias

that well is well-taylored to the ground truth. In their current form the bounds seem to

33

Kloft, Brefeld, Sonnenburg, and Zien

suggest that furthermore, in all other cases, `1-norm MKL is at least as good as non-sparse

MKL. However this would be inconsistent with both the no-free-lunch theorem and our

empirical results, which indicate that there exist scenarios in which non-sparse models are

advantageous. We conjecture that the non-sparse bounds are not yet tight and need further

improvement, for which the results in Appendix A may serve as a starting point.20

A related—and obtruding!—question is whether the optimality of the parameter pcan

retrospectively be explained or, more proﬁtably, even be estimated in advance. Clearly,

cross-validation based model selection over the choice of pwill inevitably tell us which cases

call for sparse or non-sparse models. The analyses of our real-world applications suggests

that both the correlation amongst the kernels with each other and their correlation with

the target (i.e., the amount of discriminative information that they carry) play a role in

the distinction of sparse from non-sparse scenarios. However, the exploration of theoretical

explanations is beyond the scope of this work. Nevertheless, we remark that even completely

redundant but uncorrelated kernels may improve the predictive performance of a model, as

averaging over several of them can reduce the variance of the predictions (cf., e.g., Guyon

and Elisseeﬀ, 2003, Sect. 3.1). Intuitively speaking, we observe clearly that in some cases

all features, even though they may contain redundant information, should be kept, since

putting their contributions to zero worsens prediction, i.e. all of them are informative to

our MKL models.

Finally, we would like to note that it may be worthwhile to rethink the current strong

preference for sparse models in the scientiﬁc community. Already weak connectivity in

a causal graphical model may be suﬃcient for all variables to be required for optimal

predictions (i.e., to have non-zero coeﬃcients), and even the prevalence of sparsity in causal

ﬂows is being questioned (e.g., for the social sciences Gelman (2010) argues that ”There

are (almost) no true zeros”). A main reason for favoring sparsity may be the presumed

interpretability of sparse models. This is not the topic and goal of this article; however

we remark that in general the identiﬁed model is sensitive to kernel normalization, and

in particular in the presence of strongly correlated kernels the results may be somewhat

arbitrary, putting their interpretation in doubt. However, in the context of this work the

predictive accuracy is of focal interest, and in this respect we demonstrate that non-sparse

models may improve quite impressively over sparse ones.

Acknowledgments

The authors wish to thank Vojtech Franc, Peter Gehler, Pavel Laskov, Motoaki Kawan-

abe, and Gunnar R¨atsch for stimulating discussions, and Chris Hinrichs and Klaus-Robert

M¨uller for helpful comments on the manuscript. We acknowledge Peter L. Bartlett and

Ulrich R¨uckert for contributions to parts of an earlier version of the theoretical analysis

that appeared at ECML 2010. We thank the anonymous reviewers for comments and sug-

gestions that helped to improve the manuscript. This work was supported in part by the

German Bundesministerium f¨ur Bildung und Forschung (BMBF) under the project RE-

MIND (FKZ 01-IS07007A), and by the FP7-ICT program of the European Community,

20. We conjecture that the `p>1-bounds are oﬀ by a logarithmic factor, because our proof technique (`1-to-`p

conversion) introduces a slight bias towards `1-norm.

34

Non-sparse Regularization for Multiple Kernel Learning

under the PASCAL2 Network of Excellence, ICT-216886. S¨oren Sonnenburg acknowledges

ﬁnancial support by the German Research Foundation (DFG) under the grant MU 987/6-1

and RA 1894/1-1, and Marius Kloft acknowledges a scholarship by the German Academic

Exchange Service (DAAD).

Appendix A. Theoretical Analysis

In this section we present a theoretical analysis of `p-norm MKL, based on Rademacher

complexities.21 We prove a theorem that converts any Rademacher-based generalization

bound on `1-norm MKL into a generalization bound for `p-norm MKL (and even more

generally: arbitrary-norm MKL). Remarkably this `1-to-`pconversion is obtained almost

without any eﬀort: by a simple 5-line proof. The proof idea is based on Kloft et al. (2010).22

We remark that an `p-norm MKL bound was already given in Cortes et al. (2010a), but

ﬁrst their bound is only valid for the special cases p=n/(n−1) for n= 1,2, . . ., and second

it is not tight for all p, as it diverges to inﬁnity when p > 1 and papproaches one. By

contrast, beside a rather unsubstantial log(M)-factor, our result matches the best known

lower bounds, when papproaches one.

Let us start by deﬁning the hypothesis set that we want to investigate. Following Cortes

et al. (2010a), we consider the following hypothesis class for p∈[1,∞]:

Hp

M:= (h:X → Rh(x) =

M

X

m=1 pθmhwm, ψm(x)iHm,kwkH≤1,kθkp≤1).

Solving our primal MKL problem (P) corresponds to empirical risk minimization in the

above hypothesis class. We are thus interested in bounding the generalization error of the

above class w.r.t. an i.i.d. sample (x1, y1), ..., (xn, yn)∈ X × {−1,1}from an arbitrary

distribution P=PX×PY. In order to do so, we compute the Rademacher complexity,

R(Hp

M) := Esup

h∈Hp

M

1

n

n

X

i=1

σih(xi),

where σ1, . . . , σnare independent Rademacher variables (i.e. they obtain the values -1 or

+1 with the same probability 0.5) and the Eis the expectation operator that removes the

dependency on all random variables, i.e. σi,xi, and yi(i= 1, ..., n). If the Rademacher

complexity is known, there is a large body of results which can be used to bound the

generalization error (e.g., Koltchinskii and Panchenko, 2002; Bartlett and Mendelson, 2002).

We now show a simple `1-to-`pconversion technique for the Rademacher complexity,

which is the main result of this section:

Theorem 6 (`1-to-`pConversion).For any sample of size nand p∈[1,∞], the Rademacher

complexity of the hypothesis set Hp

Mcan be bounded as follows:

R(Hp

M)≤pM1/p∗R(H1

M),

21. An excellent introduction to statistical learning theory, which equips the reader with the needed basics

for this section, is given in Bousquet et al. (2004).

22. We acknowledge the contribution of Ulrich R¨uckert.

35

Kloft, Brefeld, Sonnenburg, and Zien

where p∗:= p/(p−1) is the conjugated exponent of p.

Proof. By H¨older’s inequality (e.g., Steele, 2004), we have

∀θ∈RM:kθk1=1>θ≤ k1kp∗kθkp=M1/p∗kθkp.(32)

Hence,

R(Hp

M)Def.

=Esup

w:kwkH≤1,θ:kθkp≤1

1

n

n

X

i=1

σi

M

X

m=1 pθmhwm, ψm(xi)iHm

(32)

≤Esup

w:kwkH≤1,θ:kθk1≤M1/p∗

1

n

n

X

i=1

σi

M

X

m=1 pθmhwm, ψm(xi)iHm

=Esup

w:kwkH≤1,θ:kθk1≤1

1

n

n

X

i=1

σi

M

X

m=1 pθmM1/p∗hwm, ψm(x)iHm

Def.

=pM1/p∗R(H1

M).

Remark 7. More generally we have that for any norm k·k?on RM, because all norms on

RMare equivalent (e.g., Rudin, 1991), there exists a c?∈Rsuch that

R(Hp

M)≤c?R(H?

M).

This means the conversion technique extends to arbitrary norms: for any given norm k·k?,

we can convert any bound on R(Hp

M)into a bound on the Rademacher complexity R(H?

M)

of hypothesis set induced by k·k?.

A nice thing about the above bound is that we can make use of any existing bound

on the Rademacher complexity of H1

Min order to obtain a generalization bound for Hp

M.

This fact is illustrated in the following. For example, the tightest result bounding R(H1

M)

known so far is:

Theorem 8 (Cortes et al. (2010a)).Let M > 1and assume that km(x,x)≤R2for all

x∈ X and m= 1, . . . , M. Then, for any sample of size n, the Rademacher complexity of

the hypothesis set H1

Mcan be bounded as follows (where c:= 23/22):

R(H1

M)≤rcedlog MeR2

n.

The above result directly leads to a O(√log M) bound on the generalization error and

thus substantially improves on a series of loose results given within the past years (see

Cortes et al., 2010a, and references therein). We can use the above result (or any other

similar result23) to obtain a bound for Hp

M:

23. The point here is that we could use any `1-bound, for example, the bounds of Kakade et al. (2009) and

Kloft et al. (2010) have the same favorable O(log M) rate; in particular, whenever a new `1-bound is

proven, we can plug it into our conversion technique to obtain a new bound.

36

Non-sparse Regularization for Multiple Kernel Learning

Corollary (of the previous two theorems).Let M > 1and assume that km(x,x)≤R2for

all x∈ X and m= 1, . . . , M . Then, for any sample of size n, the Rademacher complexity

of the hypothesis set H1

Mcan be bounded as follows:

∀p∈[1, ..., ∞] : R(Hp

M)≤rceM1/p∗dlog MeR2

n,

where p∗:= p/(p−1) is the conjugated exponent of pand c:= 23/22.

It is instructive to compare the above bound, which we obtained by our `1-to-`pconver-

sion technique, with the one given in Cortes et al. (2010a): that is R(Hp

M)≤qcep∗M1/p∗R2

n

for any p∈[1, ..., ∞] such that p∗is an integer. First, we observe that for p= 2 the bounds’

rates almost coincide: they only diﬀer by a log M-factor, which is unsubstantial due to the

presence of a polynomial term that domiates the asymptotics. Second, we observe that for

small p(close to one), the p∗-factor in the Cortes-bound leads to considerably high constants.

When papproaches one, it even diverges to inﬁnity. In contrast, our bound converges to

R(Hp

M)≤qcedlog MeR2

nwhen papproaches one, which is precisely the tight 1-norm bound

of Thm. 8. Finally, it is also interesting to consider the case p≥2 (which is not covered

by the Cortes et al. (2010a) bound): if we let p→ ∞, we obtain R(Hp

M)≤qceMdlog MeR2

n.

Beside the unsubstantial log M-factor, our so obtained O√Mln(M)bound matches the

well-known O√Mlower bounds based on the VC-dimension (e.g., Devroye et al., 1996,

Section 14).

We now make use of the above analysis of the Rademacher complexity to bound the

generalization error. There are many results in the literature that can be employed to this

aim. Ours is based on Thm. 7 in Bartlett and Mendelson (2002):

Corollary 9. Let M > 1and p∈[1, ..., ∞]. Assume that km(x,x)≤R2for all x∈ X and

m= 1, . . . , M . Assume the loss V:R→[0,1] is Lipschitz with constant Land V(t)≥1

for all t≤0. Set p∗:= p/(p−1) and c:= 23/22. Then, the following holds with probability

larger than 1−δover samples of size nfor all classiﬁers h∈Hp

M:

R(h)≤b

R(h)+2LrceM1/p∗dlog MeR2

n+rln(2/δ)

2n,(33)

where R(h)=Pyh(x)≤0is the expected risk w.r.t. 0-1 loss and b

R(h) =

1

nPn

i=1 V(yih(xi)) is the empirical risk w.r.t. loss V.

The above theorem is formulated for general Lipschitz loss functions. Since the margin

loss V(t) = min 1,[1 −t/γ]+is Lipschitz with constant 1/γ and upper bounding the 0-1

loss, it fulﬁlls the preliminaries of the above corollary. Hence, we immediately obtain the

following radius-margin bound (see also Koltchinskii and Panchenko, 2002):

Corollary 10 (`p-norm MKL Radius-Margin Bound).Fix the margin γ > 0. Let M > 1

and p∈[1, ..., ∞]. Assume that km(x,x)≤R2for all x∈ X and m= 1, . . . , M . Set

37

Kloft, Brefeld, Sonnenburg, and Zien

p∗:= p/(p−1) and c:= 23/22. Then, the following holds with probability larger than 1−δ

over samples of size nfor all classiﬁers h∈Hp

M:

R(h)≤b

R(h) + 2R

γrceM1/p∗dlog Me

n+rln(2/δ)

2n,(34)

where R(h)=Pyh(x)≤0is the expected risk w.r.t. 0-1 loss and b

R(h) =

1

nPn

i=1 min 1,[1 −yih(xi)/γ]+the empirical risk w.r.t. margin loss.

Finally, we would like to point out that, for reasons stated in Remark 7, our `1-to-`p

conversion technique lets us easily extend the above bounds to norms diﬀerent than `p.

This includes, for example, block norms and sums of block norms as used in elastic-net

regularization (see Kloft et al., 2010, for such bounds), but also non-isotropic norms such

as weighted `p-norms.

A.1 Case-based Analysis of a Sparse and a Non-Sparse Scenario

From the results given in the last section it seems that it is beneﬁcial to use a sparsity-

inducing `1-norm penalty when learning with multiple kernels. This however somewhat

contradicts our empirical evaluation, which indicated that the optimal norm parameter p

depends on the true underlying sparsity of the problem. Indeed, as we show below, a reﬁned

theoretical analysis supports this intuitive claim. We show that if the underlying truth is

uniformly non-sparse, then a priori there is no p-norm which is more promising than another

one. On the other hand, we illustrate that in a sparse scenario, the sparsity-inducing `1-

norm indeed can be beneﬁcial.

We start by reparametrizing our hypothesis set based on block norms: by Prop. 5 it

holds that

Hp

M=(h:X → Rh(x) =

M

X

m=1hwm, ψm(x)iHm,kwk2,q ≤1, q := 2p/(p+ 1)),

where ||w||2,q := PM

m=1 ||wm||q

Hm1/q is the `2,q-block norm. This means we can equiva-

lently parametrize our hypothesis set in terms of block norms. Second, let us generalize the

set by introducing an additional parameter Cas follows

CHp

M:= (h:X → Rh(x) =

M

X

m=1hwm, ψm(x)iHm,kwk2,q ≤C, q := 2p/(p+ 1)).

Clearly, CHp

M=Hp

Mfor C= 1, which explains why the parametrization via Cis more

general. It is straight forward to verify that RCHp

M=CRHp

Mfor any C. Hence,

under the preliminaries of Corollary 9, we have

R(h)≤b

R(h)+2LrceM1/p∗dlog MeR2C2

n+rln(2/δ)

2n.(35)

We will exploit the above bound in the following two illustrate examples.

38

Non-sparse Regularization for Multiple Kernel Learning

Figure 7: Illustration of the two analyzed cases: a uniformly non-sparse (Example 1, left) and a

sparse (Example 2, right) Scenario.

Example 1. Let the input space be X=RM, and the feature map be ψm(x) = xmfor

all m= 1, . . . , M and x= (x1, ..., xM)∈ X (in other words, ψmis a projection on the mth

feature). Assume that the Bayes-optimal classiﬁer is given by

wBayes = (1,...,1)>∈RM.

This means the best classiﬁer possible is uniformly non-sparse (see Fig. 7, left). Clearly,

it can be advantageous to work with a hypothesis set that is rich enough to contain the

Bayes classiﬁer, i.e. (1,...,1)>∈CHp

M. In our example, this is the case if and only if

k(1,...,1)>k2p/(p+1) ≤C, which itself is equivalent to M(p+1)/2p≤C. The bound (35)

attains its minimal value under the latter constraint for M(p+1)/2p=C. Resubstitution

into the bound yields

R(h)≤b

R(h)+2LrceM2dlog MeR2

n+rln(2/δ)

2n.

Interestingly, the obtained bound does not depend on the norm parameter pat all! This

means that in this particular (non-sparse) example all p-norm MKL variants yield the same

generalization bound. There is thus no theoretical evidence which norm to prefer a priori.

Example 2. In this second example we consider the same input space and kernels as

before. But this time we assume a sparse Bayes-optimal classiﬁer (see Fig. 7, right)

wBayes = (1,0,...,0)>∈RM.

As in the previous example, in order wBayes to be in the hypothesis set, we have to require

k(1,0,...,0)>k2p/(p+1) ≤C. But this time this simply solves to C≥1, which is independent

of the norm parameter p. Thus, inserting C= 1 in the bound (35), we obtain

R(h)≤b

R(h)+2LrceM1/p∗dlog MeR2

n+rln(2/δ)

2n,

39

Kloft, Brefeld, Sonnenburg, and Zien

which is precisely the bound of Corollary 9. It is minimized for p= 1; thus, in this particular

sparse example, the bound is considerably smaller for sparse MKL—especially, if the number

of kernels is high compared to the sample size. This is also intuitive: if the underlying truth

is sparse, we expect a sparsity-inducing norm to match well the ground truth.

We conclude from the previous two examples that the optimal norm parameter pdepends

on the underlying ground truth: if it is sparse, then choosing a sparse regularization is

beneﬁcial; otherwise, any norm pcan perform well. I.e., without any domain knowledge

there is no norm that a priori should be preferred. Remarkably, this still holds when we

increase the number of kernels. This is somewhat contrary to anecdotal reports, which claim

that sparsity-inducing norms are beneﬁcial in high (kernel) dimensions. This is because

those analyses implicitly assume the ground truth to be sparse. The present paper, however,

clearly shows that we might encounter a non-sparse ground truth in practical applications

(see experimental section).

Appendix B. Switching between Tikhonov and Ivanov Regularization

In this appendix, we show a useful result that justiﬁes switching from Tikhonov to Ivanov

regularization and vice versa, if the bound on the regularizing constraint is tight. It is the

key ingredient of the proof of Theorem 1. We state the result for arbitrary convex functions,

so that it can be applied beyond the multiple kernel learning framework of this paper.

Proposition 11. Let D⊂Rdbe a convex set, let f, g :D→Rbe convex functions.

Consider the convex optimization tasks

min

x∈Df(x) + σg(x),(36a)

min

x∈D:g(x)≤τf(x).(36b)

Assume that the minima exist and that a constraint qualiﬁcation holds in (36b), which gives

rise to strong duality, e.g., that Slater’s condition is satisﬁed. Furthermore assume that the

constraint is active at the optimal point, i.e.

inf

x∈Df(x)<inf

x∈D:g(x)≤τf(x).(37)

Then we have that for each σ > 0there exists τ > 0—and vice versa—such that OP (36a)

is equivalent to OP (36b), i.e., each optimal solution of one is an optimal solution of the

other, and vice versa.

Proof.

(a). Let be σ > 0 and x∗be the optimal of (36a). We have to show that there exists a

τ > 0 such that x∗is optimal in (36b). We set τ=g(x∗). Suppose x∗is not optimal in

(36b), i.e., it exists ˜

x∈D:g(˜

x)≤τsuch that f(˜

x)< f(x∗). Then we have

f(˜

x) + σg(˜

x)< f(x∗) + στ,

40

Non-sparse Regularization for Multiple Kernel Learning

which by τ=g(x∗) translates to

f(˜

x) + σg(˜

x)< f(x∗) + σg(x∗).

This contradics the optimality of x∗in (36a), and hence shows that x∗is optimal in (36b),

which was to be shown.

(b). Vice versa, let τ > 0 be x∗optimal in (36b). The Lagrangian of (36b) is given by

L(σ) = f(x) + σ(g(x)−τ), σ ≥0.

By strong duality x∗is optimal in the saddle point problem

σ∗:= argmax

σ≥0

min

x∈Df(x) + σ(g(x)−τ),

and by the strong max-min property (cf. (Boyd and Vandenberghe, 2004), p. 238) we may

exchange the order of maximization and minimization. Hence x∗is optimal in

min

x∈Df(x) + σ∗(g(x)−τ).(38)

Removing the constant term −σ∗τ, and setting σ=σ∗, we have that x∗is optimal in (36a),

which was to be shown. Moreover by (37) we have that

x∗6= argmin

x∈D

f(x),

and hence we see from Eq. (38) that σ∗>0, which completes the proof of the proposition.

References

T. Abeel, Y. V. de Peer, and Y. Saeys. Towards a gold standard for promoter prediction

evaluation. Bioinformatics, 2009.

J. Aﬂalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman. Variable sparsity

kernel learning—algorithms and applications. Journal of Machine Learning Research,

2009. Submitted 12/2009. Preprint: http://mllab.csa.iisc.ernet.in/vskl.html.

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine

Learning, 73(3):243–272, 2008.

F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In

D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Infor-

mation Processing Systems 21, pages 105–112, 2009.

F. R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn.

Res., 9:1179–1225, 2008.

F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality,

and the SMO algorithm. In Proc. 21st ICML. ACM, 2004.

41

Kloft, Brefeld, Sonnenburg, and Zien

V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano. Promoter prediction analysis on the

whole human genome. Nature Biotechnology, 22(11):1467–1473, 2004.

P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and

structural results. Journal of Machine Learning Research, 3:463–482, Nov. 2002.

D. Bertsekas. Nonlinear Programming, Second Edition. Athena Scientiﬁc, Belmont, MA,

1999.

K. Bleakley, G. Biau, and J.-P. Vert. Supervised reconstruction of biological networks with

local models. Bioinformatics, 23:i57–i65, 2007.

O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In

Advances in Neural Information Processing Systems, 2002.

O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In

O. Bousquet, U. von Luxburg, and G. R¨atsch, editors, Advanced Lectures on Machine

Learning, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer

Berlin / Heidelberg, 2004.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambrigde University Press, Cam-

bridge, UK, 2004.

O. Chapelle. Training a support vector machine in the primal. Neural Computation, 2006.

O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters.

In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal

Kernels, 2008.

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for

support vector machines. Machine Learning, 46(1):131–159, 2002.

C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Proceedings of

the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.

URL http://www.cs.nyu.edu/learning_kernels.

C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Pro-

ceedings of the International Conference on Uncertainty in Artiﬁcial Intelligence, 2009a.

C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels.

In Y. Bengio, D. Schuurmans, J. Laﬀerty, C. K. I. Williams, and A. Culotta, editors,

Advances in Neural Information Processing Systems 22, pages 396–404, 2009b.

C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In

Proceedings, 27th ICML, 2010a.

C. Cortes, M. Mohri, and A. Rostamizadeh. Two-stage learning kernel algorithms. In

Proceedings of the 27th Conference on Machine Learning (ICML 2010), 2010b.

N. Cristianini, J. Kandola, A. Elisseeﬀ, and J. Shawe-Taylor. On kernel-target alignment.

In Advances in Neural Information Processing Systems, 2002.

42

Non-sparse Regularization for Multiple Kernel Learning

L. Devroye, L. Gy¨orﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Num-

ber 31 in Applications of Mathematics. Springer, New York, 1996.

R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear

classiﬁcation. Journal of Machine Learning Research, 9:1871–1874, 2008.

R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order infor-

mation for training support vector machines. Journal of Machine Learning Research, 6:

1889–1918, 2005.

V. Franc and S. Sonnenburg. OCAS optimized cutting plane algorithm for support vector

machines. In Proceedings of the 25nd International Machine Learning Conference. ACM

Press, 2008. URL http://ida.first.fraunhofer.de/~franc/ocas/html/index.html.

P. Gehler and S. Nowozin. Inﬁnite kernel learning. In Proceedings of the NIPS 2008 Work-

shop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.

A. Gelman. Causality and statistical learning. American Journal of Sociology, 0, 2010.

G. Golub and C. van Loan. Matrix Computations. John Hopkins University Press, Balti-

more, London, 3rd edition, 1996.

M. G¨onen and E. Alpaydin. Localized multiple kernel learning. In ICML ’08: Proceedings

of the 25th international conference on Machine learning, pages 352–359, New York, NY,

USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: http://doi.acm.org/10.1145/1390156.

1390201.

I. Guyon and A. Elisseeﬀ. An introduction to variable and feature selection. J. Mach.

Learn. Res., 3:1157–1182, 2003. ISSN 1532-4435.

V. Ivanov, V. Vasin, and V. Tanana. Theory of Linear Ill-Posed Problems and its applica-

tion. VSP, Zeist, 2002.

S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural

Information Processing Systems, 2009.

T. Joachims. Making large–scale SVM learning practical. In B. Sch¨olkopf, C. Burges,

and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages

169–184, Cambridge, MA, 1999. MIT Press.

S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Applications of strong convexity–strong

smoothness duality to learning with matrices. CoRR, abs/0910.0610, 2009.

M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for

deciphering the genome. Nucleic Acids Res, 32:D277–D280, 2004.

G. Kimeldorf and G. Wahba. Some results on tchebycheﬃan spline functions.

J. Math. Anal. Applic., 33:82–95, 1971.

43

Kloft, Brefeld, Sonnenburg, and Zien

M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning.

In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, dec

2008.

M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien. Eﬃcient and

accurate lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Laﬀerty,

C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing

Systems 22, pages 997–1005. MIT Press, 2009a.

M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In

W. L. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Proceedings

of the European Conference on Machine Learning and Knowledge Discovery in Databases

(ECML/PKDD), pages 692–704, 2009b.

M. Kloft, U. R¨uckert, and P. L. Bartlett. A unifying view of multiple kernel learning. In

Proceedings of the European Conference on Machine Learning and Knowledge Discovery

in Databases (ECML/PKDD), 2010. To appear. ArXiv preprint: http://arxiv.org/

abs/1005.0437.

V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the gen-

eralization error of combined classiﬁers. Annals of Statistics, 30:1–50, 2002.

G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the

kernel matrix with semi-deﬁnite programming. JMLR, 5:27–72, 2004.

D. Liu and J. Nocedal. On the limited memory method for large scale optimization. Math-

ematical Programming B, 45(3):503–528, 1989.

P. C. Mahalanobis. On the generalised distance in statistics. In Proceedings National

Institute of Science, India, volume 2, no. 1, April 1936.

M. Markou and S. Singh. Novelty detection: a review – part 1: statistical approaches.

Signal Processing, 83:2481–2497, 2003a.

M. Markou and S. Singh. Novelty detection: a review – part 2: neural network based

approaches. Signal Processing, 83:2499–2521, 2003b.

C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of

Machine Learning Research, 6:1099–1125, 2005.

K.-R. M¨uller, S. Mika, G. R¨atsch, K. Tsuda, and B. Sch¨olkopf. An introduction to kernel-

based learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001.

S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY,

1996.

J. S. Nath, G. Dinesh, S. Ramanand, C. Bhattacharyya, A. Ben-Tal, and K. R. Ramakr-

ishnan. On the algorithmics and applications of a mixed-norm based kernel learning

formulation. In Y. Bengio, D. Schuurmans, J. Laﬀerty, C. K. I. Williams, and A. Cu-

lotta, editors, Advances in Neural Information Processing Systems 22, pages 844–852,

2009.

44

Non-sparse Regularization for Multiple Kernel Learning

A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities

with lipschitz continuous monotone operators and smooth convex-concave saddle point

problems. SIAM Journal on Optimization, 15:229–251, 2004.

C. S. Ong and A. Zien. An Automated Combination of Kernels for Predicting Protein

Subcellular Localization. In Proc. of the 8th Workshop on Algorithms in Bioinformatics,

2008.

C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels.

Journal of Machine Learning Research, 6:1043–1071, 2005.

S. ¨

Oz¨og¨ur-Aky¨uz and G. Weber. Learning with inﬁnitely many kernels via semi-inﬁnite

programming. In Proceedings of Euro Mini Conference on Continuous Optimization and

Knowledge Based Technologies, 2008.

J. Platt. Fast training of support vector machines using sequential minimal optimization. In

B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support

Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press.

A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More eﬃciency in multiple kernel

learning. In ICML, pages 775–782, 2007.

A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine

Learning Research, 9:2491–2521, 2008.

R. M. Rifkin and R. A. Lippert. Value regularization and Fenchel duality. J. Mach. Learn.

Res., 8:441–479, 2007.

V. Roth and B. Fischer. Improved functional prediction of proteins by learning kernel

combinations in multilabel settings. BMC Bioinformatics, 8(Suppl 2):S12, 2007. ISSN

1471-2105. URL http://www.biomedcentral.com/1471-2105/8/S2/S12.

V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of

solutions and eﬃcient algorithms. In Proceedings of the Twenty-Fifth International Con-

ference on Machine Learning (ICML 2008), volume 307, pages 848–855. ACM, 2008.

E. Rubinstein. Support vector machines via advanced optimization techniques. Master’s

thesis, Faculty of Electrical Engineering, Technion, 2005, Nov 2005.

W. Rudin. Functional Analysis. McGraw-Hill, 1991.

B. Sch¨olkopf and A. Smola.