Learning Non-Linear Combinations of Kernels
76 Ninth Ave
New York, NY 10011
Courant Institute and Google
251 Mercer Street
New York, NY 10012
Courant Institute and Google
251 Mercer Street
New York, NY 10012
This paper studies the general problem of learning kernels based on a polynomial
combination of base kernels. We analyze this problem in the case of regression
andthe kernel ridgeregressionalgorithm. We examinethe correspondinglearning
kerneloptimizationproblem,showhow that minimaxproblemcanbe reducedto a
simpler minimization problem, and prove that the global solution of this problem
always lies on the boundary. We give a projection-based gradient descent algo-
rithm for solving the optimization problem, shown empirically to convergein few
iterations. Finally, we report the results of extensive experiments with this algo-
rithm using several publicly available datasets demonstrating the effectiveness of
Classification algorithms such as support vector machines (SVMs) [6,10], regression algorithms,
e.g., kernel ridge regression and support vector regression (SVR) [16,22],and general dimensional-
ity reduction algorithms such as kernel PCA (KPCA)  all benefit from kernel methods. Positive
definite symmetric (PDS) kernel functions implicitly specify an inner product in a high-dimension
Hilbert space where large-margin solutions are sought. So long as the kernel function used is PDS,
convergenceof the training algorithm is guaranteed.
However, in the typical use of these kernel method algorithms, the choice of the PDS kernel, which
is crucial to improved performance, is left to the user. A less demanding alternative is to require
the user to instead specify a family of kernels and to use the training data to select the most suitable
kernel out of that family. This is commonly referred to as the problem of learning kernels.
There is a large recent body of literature addressing various aspects of this problem, including de-
riving efficient solutions to the optimization problems it generates and providinga better theoretical
analysis of the problem both in classification and regression [1,8,9,11,13,15,21]. With the excep-
tion of a few publicationsconsideringinfinite-dimensionalkernel families such as hyperkernels
or general convex classes of kernels , the great majority of analyses and algorithmic results focus
on learning finite linear combinations of base kernels as originally considered by . However,
despite the substantial progress made in the theoretical understanding and the design of efficient
algorithms for the problem of learning such linear combinations of kernels, no method seems to re-
liably give improvements over baseline methods. For example, the learned linear combination does
not consistently outperformeither the uniformcombinationof base kernels or simply the best single
base kernel (see, for example, UCI dataset experiments in [9,12], see also NIPS 2008 workshop).
This suggests exploring other non-linear families of kernels to obtain consistent and significant
Non-linear combinations of kernels have been recently considered by . However, here too,
experimental results have not demonstrated a consistent performance improvement for the general
learningtask. Anothermethod, hierarchicalmultiple learning, considers learninga linear combi-
nation of an exponential number of linear kernels, which can be efficiently represented as a product
of sums. Thus, this method can also be classified as learning a non-linear combination of kernels.
However, in  the base kernels are restricted to concatenation kernels, where the base kernels
apply to disjoint subspaces. For this approach the authors provide an effective and efficient algo-
rithm and some performanceimprovementis actually observedfor regression problems in very high
This paper studies the general problem of learning kernels based on a polynomial combination of
base kernels. We analyze that problem in the case of regression using the kernel ridge regression
(KRR) algorithm. We show how to simplify its optimization problem from a minimax problem
to a simpler minimization problem and prove that the global solution of the optimization problem
always lies on the boundary. We give a projection-based gradient descent algorithm for solving this
minimization problem that is shown empirically to convergein few iterations. Furthermore,we give
a necessary and sufficient condition for this algorithm to reach a global optimum. Finally, we report
the results of extensive experiments with this algorithm using several publicly available datasets
demonstrating the effectiveness of our technique.
The paper is structured as follows. In Section 2, we introduce the non-linear family of kernels
considered. Section 3 discusses the learning problem, formulates the optimization problem, and
presents our solution. In Section 4, we study the performance of our algorithm for learning non-
linear combinations of kernels in regression (NKRR) on several publicly available datasets.
2 Kernel Family
This section introduces and discusses the family of kernels we consider for our learning kernel
problem. Let K1,...,Kpbe a finite set of kernels that we combine to define more complex kernels.
We refer to these kernels as base kernels. In much of the previous work on learning kernels, the
family of kernels considered is that of linear or convex combinations of some base kernels. Here,
we consider polynomial combinations of higher degree d≥1 of the base kernels with non-negative
coefficients of the form:
0≤k1+···+kp≤d, ki≥0, i∈[0,p]
Any kernel function Kµof this form is PDS since products and sums of PDS kernels are PDS .
Note that Kµis in fact a linear combination of the PDS kernels Kk1
of coefficients µk1···kpis in O(pd), which may be too large for a reliable estimation from a sample
of size m. Instead, we can assume that for some subset I of all p-tuples (k1,...,kp), µk1···kpcan
be written as a product of non-negative coefficients µ1,...,µp: µk1···kp= µk1
general form of the polynomial combinations we consider becomes
where J denotes the complement of the subset I. The total number of free parameters is then
reduced to p+|J|. The choice of the set I and its size depends on the sample size m and possible
prior knowledge about relevant kernel combinations. The second sum of equation (2) defining our
general family of kernels represents a linear combination of PDS kernels. In the following, we
focus on kernels that have the form of the first sum and that are thus non-linear in the parameters
µ1,...,µp. More specifically, we consider kernels Kµdefined by
where µ=(µ1,...,µp)⊤∈Rp. For the ease of presentation, our analysis is given for the case d=2,
where the quadratic kernel can be given the following simpler expression:
But, the extension to higher-degree polynomials is straightforward and our experiments include
results for degrees d up to 4.
p . However, the number
p . Then, the
 A. Argyriou,R. Hauser, C. Micchelli, and M. Pontil. A DC-programmingalgorithm for kernel
selection. In International Conference on Machine Learning, 2006.
 A. Argyriou, C. Micchelli, and M. Pontil. Learning convex combinations of continuously
parameterized basic kernels. In Conference on Learning Theory, 2005.
 F. Bach. Exploringlargefeaturespaces with hierarchicalmultiplekernellearning. In Advances
in Neural Information Processing Systems, 2008.
 C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-
Verlag: Berlin-New York, 1984.
 J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood,Boom-boxesand Blenders: Do-
main Adaptation for Sentiment Classification. In Association for Computational Linguistics,
 B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Conference on Learning Theory, 1992.
 O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for
support vector machines. Machine Learning, 46(1-3), 2002.
 C. Cortes, M. Mohri, and A. Rostamizadeh. Learning sequence kernels. In Machine Learning
for Signal Processing, 2008.
 C. Cortes, M. Mohri, and A. Rostamizadeh. L2regularization for learning kernels. In Uncer-
tainty in Artificial Intelligence, 2009.
 C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20(3), 1995.
 T. Jebara. Multi-task feature and kernel selection for SVMs. In International Conference on
Machine Learning, 2004.
 G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. Jordan. Learning the kernel
matrix with semidefinite programming. Journal of Machine Learning Research, 5, 2004.
 C.MicchelliandM.Pontil. Learningthekernelfunctionviaregularization.JournalofMachine
Learning Research, 6, 2005.
 C. S. Ong, A. Smola, and R. Williamson. Learning the kernel with hyperkernels. Journal of
Machine Learning Research, 6, 2005.
 A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu. Simplemkl. Journal of Machine
Learning Research, 9, 2008.
 C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual
Variables. In International Conference on Machine Learning, 1998.
 B. Sch¨ olkopf and A. Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002.
 B. Scholkopf, A. Smola, and K. Muller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural computation, 10(5), 1998.
 J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-
sity Press, 2004.
 S. Sonnenburg,G. R¨ atsch, C. Sch¨ afer, and B. Sch¨ olkopf. Large scale multiple kernel learning.
Journal of Machine Learning Research, 7, 2006.
 N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned ker-
nels. In Conference on Learning Theory, 2006.
 V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
 M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In Interna-
tional Conference on Machine Learning, 2009.