Page 1

Learning Non-Linear Combinations of Kernels

Corinna Cortes

Google Research

76 Ninth Ave

New York, NY 10011

corinna@google.com

Mehryar Mohri

Courant Institute and Google

251 Mercer Street

New York, NY 10012

mohri@cims.nyu.edu

Afshin Rostamizadeh

Courant Institute and Google

251 Mercer Street

New York, NY 10012

rostami@cs.nyu.edu

Abstract

This paper studies the general problem of learning kernels based on a polynomial

combination of base kernels. We analyze this problem in the case of regression

andthe kernel ridgeregressionalgorithm. We examinethe correspondinglearning

kerneloptimizationproblem,showhow that minimaxproblemcanbe reducedto a

simpler minimization problem, and prove that the global solution of this problem

always lies on the boundary. We give a projection-based gradient descent algo-

rithm for solving the optimization problem, shown empirically to convergein few

iterations. Finally, we report the results of extensive experiments with this algo-

rithm using several publicly available datasets demonstrating the effectiveness of

our technique.

1Introduction

Learningalgorithmsbasedonkernelshavebeenusedwithmuchsuccessinavarietyoftasks[17,19].

Classification algorithms such as support vector machines (SVMs) [6,10], regression algorithms,

e.g., kernel ridge regression and support vector regression (SVR) [16,22],and general dimensional-

ity reduction algorithms such as kernel PCA (KPCA) [18] all benefit from kernel methods. Positive

definite symmetric (PDS) kernel functions implicitly specify an inner product in a high-dimension

Hilbert space where large-margin solutions are sought. So long as the kernel function used is PDS,

convergenceof the training algorithm is guaranteed.

However, in the typical use of these kernel method algorithms, the choice of the PDS kernel, which

is crucial to improved performance, is left to the user. A less demanding alternative is to require

the user to instead specify a family of kernels and to use the training data to select the most suitable

kernel out of that family. This is commonly referred to as the problem of learning kernels.

There is a large recent body of literature addressing various aspects of this problem, including de-

riving efficient solutions to the optimization problems it generates and providinga better theoretical

analysis of the problem both in classification and regression [1,8,9,11,13,15,21]. With the excep-

tion of a few publicationsconsideringinfinite-dimensionalkernel families such as hyperkernels[14]

or general convex classes of kernels [2], the great majority of analyses and algorithmic results focus

on learning finite linear combinations of base kernels as originally considered by [12]. However,

despite the substantial progress made in the theoretical understanding and the design of efficient

algorithms for the problem of learning such linear combinations of kernels, no method seems to re-

liably give improvements over baseline methods. For example, the learned linear combination does

not consistently outperformeither the uniformcombinationof base kernels or simply the best single

base kernel (see, for example, UCI dataset experiments in [9,12], see also NIPS 2008 workshop).

This suggests exploring other non-linear families of kernels to obtain consistent and significant

performance improvements.

Non-linear combinations of kernels have been recently considered by [23]. However, here too,

experimental results have not demonstrated a consistent performance improvement for the general

1

Page 2

learningtask. Anothermethod, hierarchicalmultiple learning[3], considers learninga linear combi-

nation of an exponential number of linear kernels, which can be efficiently represented as a product

of sums. Thus, this method can also be classified as learning a non-linear combination of kernels.

However, in [3] the base kernels are restricted to concatenation kernels, where the base kernels

apply to disjoint subspaces. For this approach the authors provide an effective and efficient algo-

rithm and some performanceimprovementis actually observedfor regression problems in very high

dimensions.

This paper studies the general problem of learning kernels based on a polynomial combination of

base kernels. We analyze that problem in the case of regression using the kernel ridge regression

(KRR) algorithm. We show how to simplify its optimization problem from a minimax problem

to a simpler minimization problem and prove that the global solution of the optimization problem

always lies on the boundary. We give a projection-based gradient descent algorithm for solving this

minimization problem that is shown empirically to convergein few iterations. Furthermore,we give

a necessary and sufficient condition for this algorithm to reach a global optimum. Finally, we report

the results of extensive experiments with this algorithm using several publicly available datasets

demonstrating the effectiveness of our technique.

The paper is structured as follows. In Section 2, we introduce the non-linear family of kernels

considered. Section 3 discusses the learning problem, formulates the optimization problem, and

presents our solution. In Section 4, we study the performance of our algorithm for learning non-

linear combinations of kernels in regression (NKRR) on several publicly available datasets.

2Kernel Family

This section introduces and discusses the family of kernels we consider for our learning kernel

problem. Let K1,...,Kpbe a finite set of kernels that we combine to define more complex kernels.

We refer to these kernels as base kernels. In much of the previous work on learning kernels, the

family of kernels considered is that of linear or convex combinations of some base kernels. Here,

we consider polynomial combinations of higher degree d≥1 of the base kernels with non-negative

coefficients of the form:

Kµ=

µk1···kpKk1

?

0≤k1+···+kp≤d, ki≥0, i∈[0,p]

1···Kkp

p,µk1···kp≥0.

(1)

Any kernel function Kµof this form is PDS since products and sums of PDS kernels are PDS [4].

Note that Kµis in fact a linear combination of the PDS kernels Kk1

of coefficients µk1···kpis in O(pd), which may be too large for a reliable estimation from a sample

of size m. Instead, we can assume that for some subset I of all p-tuples (k1,...,kp), µk1···kpcan

be written as a product of non-negative coefficients µ1,...,µp: µk1···kp= µk1

general form of the polynomial combinations we consider becomes

?

where J denotes the complement of the subset I. The total number of free parameters is then

reduced to p+|J|. The choice of the set I and its size depends on the sample size m and possible

prior knowledge about relevant kernel combinations. The second sum of equation (2) defining our

general family of kernels represents a linear combination of PDS kernels. In the following, we

focus on kernels that have the form of the first sum and that are thus non-linear in the parameters

µ1,...,µp. More specifically, we consider kernels Kµdefined by

?

where µ=(µ1,...,µp)⊤∈Rp. For the ease of presentation, our analysis is given for the case d=2,

where the quadratic kernel can be given the following simpler expression:

p

?

But, the extension to higher-degree polynomials is straightforward and our experiments include

results for degrees d up to 4.

1···Kkp

p . However, the number

1···µkp

p . Then, the

K =

(k1,...,kp)∈I

µk1

1···µkp

pKk1

1···Kkp

p+

?

(k1,...,kp)∈J

µk1···kpKk1

1···Kkp

p,

(2)

Kµ=

k1+···+kp=d

µk1

1···µkp

pKk1

1···Kkp

p,

(3)

Kµ=

k,l=1

µkµlKkKl.

(4)

2

Page 3

3Algorithm for Learning Non-Linear Kernel Combinations

3.1Optimization Problem

We consider a standard regression problem where the learner receives a training sample of size

m, S = ((x1,y1),...,(xm,ym)) ∈ (X×Y )m, where X is the input space and Y ∈ R the label

space. The family of hypothesesHµout of which the learner selects a hypothesis is the reproducing

kernel Hilbert space (RKHS) associated to a PDS kernel function Kµ: X×X → R as defined in

the previous section. Unlike standard kernel-based regression algorithms however, here, both the

parameter vector µ defining the kernel Kµand the hypothesis are learned using the training sample

S.

The learning kernel algorithm we consider is derived from kernel ridge regression (KRR). Let y=

[y1,...,ym]⊤∈Rmdenote the vector of training labels and let Kµdenote the Gram matrix of the

kernel Kµfor the sample S: [Kµ]i,j= Kµ(xi,xj), for all i,j ∈ [1,m]. The standard KRR dual

optimization algorithm for a fixed kernel matrix Kµis given in terms of the Lagrange multipliers

α ∈ Rmby [16]:

max

α∈Rm−α⊤(Kµ+ λI)α + 2α⊤y

(5)

The related problem of learning the kernel Kµconcomitantly can be formulated as the following

min-max optimization problem [9]:

min

µ∈Mmax

α∈Rm−α⊤(Kµ+ λI)α + 2α⊤y,

(6)

where M is a positive, bounded, and convex set. The positivity of µ ensures that Kµis positive

semi-definite (PSD) and its boundedness forms a regularization controlling the norm of µ.1Two

natural choices for the set M are the norm-1 and norm-2 bounded sets,

M1= {µ | µ ? 0 ∧ ?µ − µ0?1≤ Λ}

M2= {µ | µ ? 0 ∧ ?µ − µ0?2≤ Λ}.

(7)

(8)

These definitions include an offset parameter µ0for the weights µ. Some natural choices for µ0

are: µ0= 0, or µ0/?µ0? = 1. Note that here, since the objective function is not linear in µ, the

norm-1-type regularization may not lead to a sparse solution.

3.2Algorithm Formulation

For learning linear combinations of kernels, a typical technique consists of applying the minimax

theorem to permute the min and max operators, which can lead to optimization problems com-

putationally more efficient to solve [8,12]. However, in the non-linear case we are studying, this

technique is unfortunately not applicable.

Instead,ourmethodforlearningnon-linearkernelsandsolvingthe min-maxproblemin equation(6)

consists of first directly solving the inner maximization problem. In the case of KRR for any fixed

µ the optimum is given by

α = (Kµ+ λI)−1y.

Plugging the optimal expression of α in the min-max optimization yields the following equivalent

minimization in terms of µ only:

(9)

min

µ∈M

F(µ) = y⊤(Kµ+ λI)−1y.

(10)

We refer to this optimization as the NKRR problem. Although the original min-max problem has

been reduced to a simpler minimization problem, the function F is not convex in general as illus-

trated by Figure 1. For small values of µ, concave regions are observed. Thus, standard interior-

point or gradient methods are not guaranteed to be successful at finding a global optimum.

Inthefollowing,wegiveananalysiswhichshowsthatundercertainconditionsitishoweverpossible

to guarantee the convergenceof a gradient-descent type algorithm to a global minimum.

Algorithm 1 illustrates a general gradient descent algorithm for the norm-2 bounded setting which

projects µ back to the feasible set M2after each gradient step (projecting to M1is very similar).

1To clarify the difference between similar acronyms, a PDS function corresponds to a PSD matrix [4].

3

Page 4

0

0.5

1

0

0.5

1

195

200

205

210

µ2

µ1

F(µ1,µ2)

0

0.5

1

0

0.5

µ2

1

20

20.5

21

µ1

F(µ1,µ2)

0

0.5

1

0

0.5

1

2.06

2.07

2.08

2.09

µ2

µ1

F(µ1,µ2)

Figure 1: Example plots for F defined over two linear base kernels generated from the first two

features of the sonar dataset. From left to right λ = 1,10,100. For larger values of λ it is clear that

there are in fact concave regions of the function near 0.

Algorithm 1 Projection-based Gradient Descent Algorithm

Input: µinit∈ M2, η ∈ [0,1], ǫ > 0, Kk, k ∈ [1,p]

µ′← µinit

repeat

µ ← µ′

µ′← −η∇F(µ) + µ

∀k, µ′

k)

normalize µ′, s.t. ?µ′− µ0? = Λ

until ?µ′− µ? < ǫ

k← max(0,µ′

In Algorithm 1 we have fixed the step size η, however this can be adjusted at each iteration via

a line-search. Furthermore, as shown later, the thresholding step that forces µ′to be positive is

unnecessary since ∇F is never positive.

Note that Algorithm 1 is simpler than the wrapper method proposed by [20]. Because of the closed

form expression (10), we do not alternate between solving for the dual variables and performing a

gradient step in the kernel parameters. We only need to optimize with respect to the kernel parame-

ters.

3.3Algorithm Properties

We first explicitly calculate the gradient of the objective function for the optimization problem (10).

In what follows, ◦ denotes the Hadamard (pointwise) product between matrices.

Proposition 1. For any k ∈ [1,p], the partial derivative of F : µ → y⊤(Kµ+λI)−1y with respect

to µiis given by

∂F

∂µk

where Uk=??p

Proof. In view of the identity ∇MTr(y⊤M−1y)=−M−1⊤yy⊤M−1⊤, we can write:

?∂y⊤(Kµ+ λI)−1y

?

?

= − 2y⊤(Kµ+ λI)−1?

= −2α⊤Ukα,

(11)

r=1(µrKr) ◦ Kk

?.

∂F

∂µk

=Tr

∂(Kµ+ λI)

(Kµ+ λI)−1yy⊤(Kµ+ λI)−1∂(Kµ+ λI)

∂(Kµ+ λI)

∂µk

?

= − Tr

∂µk

p

?

(Kµ+ λI)−1y = −2α⊤Ukα.

?

= − Tr(Kµ+ λI)−1yy⊤(Kµ+ λI)−1?

?

2

r=1

(µrKr) ◦ Kk

??

p

r=1

(µrKr) ◦ Kk

?

4

Page 5

Matrix Ukjust defined in proposition 1 is always PSD, thus

As already mentioned, this fact obliterates the thresholding step in Algorithm 1. We now provide

guarantees for convergence to a global optimum. We shall assume that λ is strictly positive: λ>0.

Proposition 2. Any stationary point µ⋆of the function F : µ → y⊤(Kµ+ λI)−1y necessarily

maximizes F:

F(µ) =?y?2

∂F

∂µk≤0 for all i ∈ [1,p] and ∇F ≤0.

F(µ⋆) = max

µ

λ

.

(12)

Proof. In view of the expression of the gradient given by Proposition 1, at any point µ⋆,

µ⋆⊤∇F(µ⋆) = α⊤

p

?

i=1

µ⋆

kUkα = α⊤Kµ⋆α.

(13)

By definition, if µ⋆is a stationary point, ∇F(µ⋆) = 0, which implies µ⋆⊤∇F(µ⋆) = 0. Thus,

α⊤Kµ⋆α=0, which implies Kµ⋆α=0, that is

Kµ⋆(Kµ⋆ + λI)−1y = 0 ⇔ (Kµ⋆ + λI − λI)(Kµ⋆ + λI)−1y = 0

⇔ y − λ(Kµ⋆ + λI)−1y = 0

⇔ (Kµ⋆ + λI)−1y =y

(14)

(15)

λ.

(16)

Thus, for any such stationary point µ⋆, F(µ⋆) = y⊤(Kµ⋆ +λI)−1y =y⊤y

maximum.

λ, which is clearly a

We next show that there cannot be an interior stationary point, and thus any local minimum strictly

within the feasible set, unless the function is constant.

Proposition 3. If any point µ⋆> 0 is a stationary point of F : µ → y⊤(Kµ+ λI)−1y, then the

function is necessarily constant.

Proof. Assume that µ⋆> 0 is a stationary point, then, by Proposition 2, F(µ⋆) = y⊤(Kµ⋆ +

λI)−1y=y⊤y

alently, y is an eigenvector of Kµ⋆ + λI with eigenvalue λ, which is equivalent to y ∈ null(Kµ⋆).

Thus,

p

?

?

Since the product of PDS functions is also PDS, (*) must be non-negative. Furthermore, since by

assumption µi> 0 for all i ∈ [1,p], it must be the case that the term (*) is equal to zero. Thus,

equation 17 is equal to zero for all µ and the function F is equal to the constant ?y?2/λ.

λ, whichimpliesthaty is aneigenvectorof(Kµ⋆+λI)−1witheigenvalueλ−1. Equiv-

y⊤Kµ⋆y =

k,l=1

µkµl

m

?

r,s=1

yrysKk(xr,xs)Kl(xr,xs)

???

(∗)

= 0.

(17)

The previous propositions are sufficient to show that the gradient descent algorithm will not become

stuck at a local minimum while searching the interior of a convex set M and, furthermore, they

indicate that the optimum is found at the boundary.

The following proposition gives a necessary and sufficient condition for the convexity of F on a

convex region C. If the boundary region defined by ?µ − µ0? = Λ is contained in this convex

region, then Algorithm 1 is guaranteed to converge to a global optimum. Let u ∈ Rprepresent an

arbitrary direction of µ in C. We simplify the analysis of convexity in the following derivation by

separating the terms that depend on Kµand those depending on Ku, which arise when showing

the positive semi-definiteness of the Hessian, i.e. u⊤∇2Fu ? 0. We denote by ⊗ the Kronecker

product of two matrices.

Proposition 4. The function F : µ → y⊤(Kµ+λI)−1y is convex over the convex set C iff the

following condition holds for all µ ∈ C and all u:

?M,N −?1?F ≥ 0,

5

(18)