Content uploaded by Cheng Soon Ong

Author content

All content in this area was uploaded by Cheng Soon Ong

Content may be subject to copyright.

Learning with Non-Positive Kernels

Cheng Soon Ong cheng.ong@anu.edu.au

Computer Sciences Laboratory, RSISE, Australian National University, 0200 ACT, Australia

Xavier Mary xavier.mary@ensae.fr

ENSAE-CREST-LS, 3 avenue Pierre Larousse, 92240 Malakoﬀ, France

St´ephane Canu scanu@insa-rouen.fr

Laboratoire PSI FRE CNRS 2645 - INSA de Rouen, B.P. 08, 76131 Mont-Saint-Aignan Cedex, France

Alexander J. Smola alex.smola@anu.edu.au

RSISE and NICTA Australia, Australian National University, 0200 ACT, Australia

Indeﬁnite Kernels, Reproducing Kernel Kre˘ın Space, Representer Theorem, Rademacher Average, Non-convex

Optimization, Ill-posed Problems

Abstract

In this paper we show that many kernel meth-

ods can be adapted to deal with indeﬁnite

kernels, that is, kernels which are not posi-

tive semideﬁnite. They do not satisfy Mer-

cer’s condition and they induce associated

functional spaces called Reproducing Kernel

Kre˘ın Spaces (RKKS), a generalization of Re-

producing Kernel Hilbert Spaces (RKHS).

Machine learning in RKKS shares many

“nice” properties of learning in RKHS, such

as orthogonality and projection. However,

since the kernels are indeﬁnite, we can no

longer minimize the loss, instead we sta-

bilize it. We show a general representer

theorem for constrained stabilization and

prove generalization bounds by computing

the Rademacher averages of the kernel class.

We list several examples of indeﬁnite kernels

and investigate regularization methods to

solve spline interpolation. Some preliminary

experiments with indeﬁnite kernels for spline

smoothing are reported for truncated spec-

tral factorization, Landweber-Fridman itera-

tions, and MR-II.

Appearing in Proceedings of the 21 st International Confer-

ence on Machine Learning, Banﬀ, Canada, 2004. Copyright

2004 by the authors.

1. Why Non-Positive Kernels?

Almost all current research on kernel methods in ma-

chine learning focuses on functions k(x, x0) which are

positive semideﬁnite. That is, it focuses on kernels

which satisfy Mercer’s condition and which conse-

quently can be seen as scalar products in some Hilbert

space. See (Vapnik, 1998; Sch¨olkopf & Smola, 2002;

Wahba, 1990) for details.

The purpose of this article is to point out that there is

a much larger class of kernel functions available, which

do not necessarily correspond to a RKHS but which

nonetheless can be used for machine learning. Such

kernels are known as indeﬁnite kernels, as the scalar

product matrix may contain a mix of positive and neg-

ative eigenvalues. There are several motivations for

studying indeﬁnite kernels:

•Testing Mercer’s condition for a given kernel can

be a challenging task which may well lie beyond

the abilities of a practitioner.

•Sometimes functions which can be proven not to

satisfy Mercer’s condition may be of other inter-

est. One such instance is the hyperbolic tangent

kernel k(x, x0) = tanh(hx, x0i − 1) of Neural Net-

works, which is indeﬁnite for any range of param-

eters or dimensions (Smola et al., 2000).

•There have been promising empirical reports on

the use of indeﬁnite kernels (Lin & Lin, 2003).

•In H∞control applications and discrimination

the cost function can be formulated as the dif-

ference between two quadratic norms (Haasdonk,

2003; Hassibi et al., 1999), corresponding to an

indeﬁnite inner product.

•RKKS theory (concerning function spaces arising

from indeﬁnite kernels) has become a rather active

area in interpolation and approximation theory.

•In recent work on learning the kernel, such as

(Ong & Smola, 2003), the solution is a linear com-

bination of positive semideﬁnite kernels. How-

ever, an arbitrary linear combination of posi-

tive kernels is not necessarily positive semideﬁ-

nite (Mary, 2003). While the elements of the as-

sociated vector space of kernels can always be de-

ﬁned as the diﬀerence between two positive ker-

nels, what is the functional space associated with

such a kernel?

We will discuss the above issues using topological

spaces similar to Hilbert spaces except for the fact

that the inner product is no longer necessarily pos-

itive. Section 2 deﬁnes RKKS and some properties

required in the subsequent derivations. We also give

some examples of indeﬁnite kernels and describe their

spectrum. Section 3 extends Rademacher type gen-

eralization error bounds for learning using indeﬁnite

kernels. Section 4 shows that we can obtain a theorem

similar to the representer theorem in RKHS. However,

we note that there may be practical problems. Sec-

tion 5 describes how we can perform approximation of

the interpolation problem using the spectrum of the

kernel also using iterative methods. It also shows pre-

liminary results on spline regularization.

2. Reproducing Kernel Krein Spaces

Kre˘ın spaces are indeﬁnite inner product spaces en-

dowed with a Hilbertian topology, yet their inner prod-

uct is no longer positive. Before we delve into deﬁni-

tions and state basic properties of Kre˘ın spaces, we

give an example:

Example 1 (4 dimensional space-time)

Indeﬁnite spaces were ﬁrst introduced by Minkowski

for the solution of problems in special relativity. There

the inner product in space-time (x, y, z, t)is given by

h(x, y, z, t),(x0, y0, z0, t0)i=xx0+yy0+zz0−tt0.

Clearly it is not positive. The vector v= (1,1,1,√3)

belongs to the cone of so-called neutral vectors which

satisfy hv, vi= 0 (in coordinates x2+y2+z2−t2= 0).

In special relativity this cone is also called the “light

cone,” as it corresponds to the propagation of light

from a point event.

2.1. Kre˘ın spaces

The above example shows that there are several dif-

ferences between Kre˘ın spaces and Hilbert spaces. We

now deﬁne Kre˘ın spaces formally. More detailed ex-

positions can be found in (Bogn´ar, 1974; Azizov &

Iokhvidov, 1989). The key diﬀerence is the fact that

the inner products are indeﬁnite.

Deﬁnition 1 (Inner product) Let Kbe a vector

space on the scalar ﬁeld.1An inner product h., .iKon

Kis a bilinear form where for all f , g, h ∈ K,α∈R:

• hf, giK=hg, f iK

• hαf +g, hiK=αhf , hiK+hg, hiK

• hf, giK= 0 for all g∈ K implies ⇒f= 0

An inner product is said to be positive if for all f∈

Kwe have hf, f iK≥0. It is negative if for all f∈

K hf, f iK≤0. Otherwise it is called indeﬁnite.

A vector space Kembedded with the inner product

h., .iKis called an inner product space. Two vectors

f, g of an inner product space are said to be orthogonal

if hf, giK= 0. Given an inner product, we can deﬁne

the associated space.

Deﬁnition 2 (Kre˘ın space) An inner product space

(K,h., .iK)is a Kre˘ın space if there exist two Hilbert

spaces H+,H−spanning Ksuch that

•All f∈ K can be decomposed into f=f++f−,

where f+∈ H+and f−∈ H−.

• ∀f, g ∈ K,hf , giK=hf+, g+iH+− hf−, g−iH−

This suggests that there is an “associated” Hilbert

space, where the diﬀerence in scalar products is re-

placed by a sum:

Deﬁnition 3 (Associated Hilbert Space) Let K

be a Kre˘ın space with decomposition into Hilbert spaces

H+and H−. Then we denote by Kthe associated

Hilbert space deﬁned by

K=H+⊕H−hence hf, giK=hf+, g+iH++hf−, g−iH−

Likewise we can introduce the symbol to indicate that

K=H+H−hence hf, giK=hf+, g+iH+−hf−, g−iH−.

Note that Kis the smallest Hilbert space majorizing

the Kre˘ın space Kand one deﬁnes the strong topology

on Kas the Hilbertian topology of K. The topology

does not depend on the decomposition chosen. Clearly

|hf, f iK|6kfk2

Kfor all f∈ K.

1Like Hilbert spaces, Kre˘ın spaces can be deﬁned on R

or C. We use Rin this paper.

Kis said to be Pontryagin if it admits a decomposi-

tion with ﬁnite dimensional H−, and Minkowski if K

itself is ﬁnite dimensional. We will see how Pontryagin

spaces arise naturally when dealing with conditionally

positive deﬁnite kernels (see Section 2.4).

For estimation we need to introduce Kre˘ın spaces on

functions. Let Xbe the learning domain, and RXthe

set of functions from Xto R. The evaluation func-

tional tells us the value of a function at a certain point,

and we shall see that the RKKS is a subset of RX

where this functional is continuous.

Deﬁnition 4 (Evaluation functional)

Tx:K → Rwhere f7→ Txf=f(x).

Deﬁnition 5 (RKKS) A Kre˘ın space (K,h., .iK)is a

Reproducing Kernel Kre˘ın Space (Alpay, 2001, Chap-

ter 7) if K ⊂ RXand the evaluation functional is con-

tinuous on Kendowed with its strong topology (that is,

via K).

2.2. From Kre˘ın spaces to Kernels

We prove an analog to the Moore-Aronszajn theo-

rem (Wahba, 1990), which tells us that for every ker-

nel there is an associated Kre˘ın space, and for every

RKKS, there is a unique kernel.

Proposition 6 (Reproducing Kernel) Let Kbe

an RKKS with K=H+ H−. Then

1. H+and H−are RKHS (with kernels k+and k−),

2. There is a unique symmetric k(x, x0)with k(x, ·)∈

Ksuch that for all f∈ K,hf , k(x, ·)iK=f(x),

3. k=k+−k−.

Proof Since Kis a RKKS, the evaluation functional is

continuous with respect to the strong topology. Hence

the associated Hilbert Space Kis an RKHS. It fol-

lows that H+and H−, as Hilbertian subspaces of an

RKHS, are RKHS themselves with kernels k+and k−

respectively. Let f=f++f−. Then Tx(f) is given by

Tx(f) = Tx(f+) + Tx(f−)

=hf+, k+(x, ·)iH+− hf−,−k−(x, ·)iH−

=hf, k+(x, ·)−k−(x, ·)iK.

In both lines we exploited the orthogonality of H+

with H−. Clearly k:= k+−k−is symmetric.

Moreover it is unique since the inner product is

non-degenerate.

2.3. From Kernels to Kre˘ın spaces

Let kbe a symmetric real valued function on X2.

Proposition 7 The following are equivalent (Mary,

2003, Theorem 2.28):

•There exists (at least) one RKKS with kernel k.

•kadmits a positive decomposition, that is there

exists two positive kernels k+and k−such that

k=k+−k−.

•kis dominated by some positive kernel p(that is,

p−kis a positive kernel).

There is no bijection but a surjection between the set

of RKKS and the set of generalized kernels deﬁned in

the vector space generated out of the cone of positive

kernels.

2.4. Examples and Spectral Properties

We collect several examples of indeﬁnite kernels in Ta-

ble 1 and plot a 2 dimensional example as well as 20

of the eigenvalues with the largest absolute value. We

investigate the spectrum of radial kernels using the

Hankel transform.

The Fourier transform allows one to ﬁnd the eigenvalue

decomposition of kernels of the form k(x, x0) = κ(x−

x0) by computing the Fourier transform of κ. For x∈

Rnwe have

F[f](kωk) = kωk−νHν[rνκ(r)](kωk)

where ν=1

2n−1 and Hνis the Hankel transform of

order ν. Table 1 depicts the spectra of these kernels.

Negative values in the Hankel transform correspond

to H−, positive ones to H+. Likewise the decomposi-

tion of k(x, x0) = κ(hx, x0i) in terms of associated Leg-

endre polynomials allows one to identify the positive

and negative parts of the Krein space, as the Legendre

polynomials commute with the rotation group.

One common class of translation invariant kernels

which are not positive deﬁnite are so-called condition-

ally positive deﬁnite (cpd) kernels. A cpd kernel of

order pleads to a positive semideﬁnite matrix in a

subspace of coeﬃcients orthogonal to polynomials of

order up to p−1. Moreover, in the subspace of (p−1)

degree polynomials, the inner product is typically neg-

ative deﬁnite. This means that there is a space of poly-

nomials of degree up to order p−1 (which constitutes

an up to n+p−2

p−1-dimensional subspace) with negative

inner product. In other words, we are dealing with a

Pontyragin space.

The standard procedure to use such kernels is to

project out the negative component, replace the lat-

Kernel 2D kernel 20 main Eigenvalues Fourier Transform

Epanechnikov kernel

1−ks−tk2

σp

, for ks−tk2

σ61

Gaussian Combination

exp −ks−tk2

σ1+ exp −ks−tk2

σ2

−exp −ks−tk2

σ3

Multiquadric kernel

qks−tk2

σ+c2

Thin plate spline

ks−tk

σ

2pln ks−tk2

σ

Table 1. Examples of indeﬁnite kernels. Column 2 shows the 2D surface of the kernel with respect to the origin, column

3 shows plots of the 20 eigenvalues with largest magnitude of uniformly spaced data from the interval [−2,2], column 4

shows plots of the Fourier spectra.

ter by a suitably smoothed estimate in the polyno-

mial subspace and treat the remaining subspace as any

RKHS (Wahba, 1990). Using Kre˘ın spaces we can use

these kernels directly, without the need to deal with

the polynomial parts separately.

3. Generalization Bounds via

Rademacher Average

An important issue regarding learning algorithms are

their ability to generalize (to give relevant predictions).

This property is obtained when the learning process

considered shows an uniform convergence behavior. In

(Mendelson, 2003) such a result is demonstrated in the

case of RKHS through the control of the Rademacher

average of the class of function considered. Here we

present an adaptation of this proof in the case of Kre˘ın

spaces. We begin with setting the functional frame-

work for the result.

Let kbe a kernel deﬁned on a set Xand choose a de-

composition k=k+−k−where k+and k−are both

positive kernels. This given decomposition of the ker-

nel can be associated with the RKHS Kdeﬁned by its

positive kernel k=k++k−whose Hilbertian topol-

ogy deﬁnes the strong topology of K. We will then

consider the set BKdeﬁned as follows:

BK=f∈ K

kf+k2+kf−k2=kfk2≤1

Note that in a Kre˘ın space the norm of a function is the

associated Hilbertian norm and usually kfk26=hf , fiK

but always hf, f iK≤ kfk2.

The Rademacher average of a class of functions F

with respect to a measure µis deﬁned as follows.

Let x1, . . . , xm∈ X be i.i.d random variables sam-

pled according to µ. Let εifor i= 1, . . . , m be

Rademacher random variables, that is variables tak-

ing values {−1,+1}with equal probability.

Deﬁnition 8 (Rademacher Average) The

Rademacher average,Rm(F)of a set of functions F

(w.r.t. µ) is deﬁned as

Rm(F) = EµEε

1

√msup

f∈F

m

X

i=1

εif(xi)

Using the Rademacher average as an estimate of the

“size” of a function class, we can obtain general-

ization error bounds which are also called uniform

convergence or sample complexity bounds (Mendel-

son, 2003, Corollary 3), that is for any ε > 0 and

δ > 0, there is an absolute constant Csuch that if

m>C

ε2max{R2

m(J(BK)),log 1

δ}, then,

Prsup

f∈BK

1

m

m

X

i=1

J(f(Xi)) −EJ(f)

≥ε≤δ,

where J(f(x)) denotes the quadratic loss deﬁned as

in (Mendelson, 2003). To get the expected result we

have to show that the Rademacher average is bounded

by a constant independent of the sample size m. To

control the Rademacher average, we ﬁrst give a lemma

regarding the topology of Kre˘ın spaces putting empha-

sis on both diﬀerence and close relationship with the

Hilbertian case.

Lemma 9 For all g∈ K:

sup

f∈BKhf(.), g(.)iK=kgk

Proof It is trivial if g= 0. ∀g∈ K,g6= 0, let

h=g/kgk. By construction khk= 1.

sup

f∈BKhf(.), g(.)iK=kgksup

f∈BKhf(.), h(.)iK

=kgksup

f∈BKhf+, h+iK+− hf−, h−iK−

=kgkhh+, h+iK++hh−, h−iK−

=kgk

In the unit ball of a RKKS, the Rademacher average

with respect to the probability measure µbehave the

same way as the one of its associated RKHS.

Proposition 10 (Rademacher Average) Let

Kbe the Gram matrix of kernel kat points

x1, . . . , xm, If according to the measure µon X

x7−→ k(x, x)∈L1(X, µ), then

Rm(BK)≤M1

2

with

M=1

mEµtr K=ZX

k(x, x)dµ(x)

The proof works just as in the Hilbertian case (Mendel-

son, 2003, Theorem 16) with the application of

lemma 9. As a second slight diﬀerence we choose to

express the bound as a function of the L1(X, µ) norm

of the kernel instead of going through its spectral rep-

resentation. It is simpler since for instance, for the un-

normalized gaussian kernel k(x, y) = exp(−(x−y)2)

on X=Rwe have M= 1 regardless the measure µ

considered. Since we are back to the Hilbertian con-

text (Mendelson, 2003, Corollary 4) applies replacing

Hilbert by Kre˘ın, providing an uniform convergence

result as expected.

4. Machine Learning in RKKS

In order to perform machine learning, we need to be

able to optimize over a class of functions, and also to

be able to prove that the solution exists and is unique.

Instead of minimizing over a class of functionals as in a

RKHS, we look for the stationary point. This is moti-

vated by the fact that in a RKHS, minimization of the

cost functional can be seen as a projection problem.

The equivalent projection problem in RKKS gives us

the stationary point of the cost functional.

4.1. Representer Theorem

The analysis of the learning problem in a RKKS gives

similar representer theorems to the Hilbertian case

(Sch¨olkopf et al., 2001). The key diﬀerence is that the

problem of minimizing a regularized risk functional be-

comes one of ﬁnding the stationary point of a similar

functional. Moreover, the solution need not be unique

any more. The proof technique, however, is rather

similar. The main diﬀerence is that a) we deal with a

constrained optimization problem directly and b) the

Gateaux derivative has to vanish due to the nondegen-

eracy of the inner product. In the following, we deﬁne

the training data X:= (x1, . . . , xm) drawn from the

learning domain X.

Theorem 11 Let Kbe an RKKS with kernel k. De-

note by L{f, X }a continuous convex loss functional

depending on f∈ K only via its evaluations f(xi)with

xi∈X, let Ω(hf, f i)be a continuous stabilizer with

strictly monotonic Ω : R→Rand let C{f, X }be a

continuous functional imposing a set of constraints on

f, that is C:K×Xm→Rn. Then if the optimization

problem

stabilize

f∈K L{f, X }+ Ω(hf, f iK) (1)

subject to C{f , X} ≤ d

has a saddle point f∗, it admits the expansion

f∗=X

i

αik(xi,·)where xi∈Xand αi∈R.(2)

Proof The ﬁrst order conditions for a solution of (1)

imply that the Gateaux derivative of the Lagrangian

L{f, λ}=L{f , X}+ Ω(hf, f iK) + λ>(C{f, X } − d)

needs to vanish. By the nondegeneracy of the inner

product, hf, giK= 0 for all g∈ K implies f= 0.

Next observe that the functional subdiﬀerential of

L{f, λ}with respect to fsatisﬁes (Rockafellar, 1996)

∂fL{f, λ}=

m

X

i=1

∂f(xi)L{f, X }+λ>C{f, X }k(xi,·)

+ 2f∂hf,f iΩ(hf , fiK).(3)

Here ∂is understood to be the subdiﬀerential with

respect to the argument wherever the function is

not diﬀerentiable, since Cand Ω only depends on

f(xi), the subdiﬀerential always exists with respect

to [f(x1), . . . , f (xm)]>. Since for stationarity the

variational derivative needs to vanish, we have

0∈∂fL{f, λ}and consequently f=Piαik(xi,·)

for some αi∈∂f(xi)L{f, X }+λ>C{f, X }. This

proves the claim.

Theorem 12 (Semiparametric Extension) The

same result holds if the optimization is carried out

over f+g, where f∈ K, and gis a parametric

addition to f. Again flies in the span of k(xi,·).

Proof [sketch only] In the Lagrange function the

partial derivative with respect to fneeds to vanish

just as in (3). This is only possible if fis contained

in the span of kernel functions on the data.

4.2. Application to general spline smoothing

We consider the general spline smoothing problem as

presented in (Wahba, 1990), except we are considering

Kre˘ın spaces. The general spline smoothing is deﬁned

as the function stabilizing (that is ﬁnding the station-

ary point) the following criterion:

Jm(f) = 1

m

m

X

i=1yi−f(xi)2+λhf , fiK.(4)

The form for the solution of equation (4) is given by

the representer theorem, which says that the solution

(if it exists) is the solution of the linear equation

(K+λI)α=y,

where Kij =k(xi, xj) is the Gram matrix.

The general spline smoothing problem can be viewed

as applying Tikhonov regularization to the interpola-

tion problem. However, since the matrix Kis indef-

inite, it may have negative eigenvalues. For values of

the regularization parameter λwhich equal a negative

eigenvalue of the Gram matrix K, (K+λI) is singu-

lar. Note that in the case where Kis positive, this does

not occur. Hence, solving the Tikhonov regularization

problem directly may not be successful. Instead, we

use the subspace expansion from Theorem 11 directly.

5. Algorithms for Kre˘ın space

Regularization

Tikhonov regularization restricts the solution of the

interpolation error 1

mPm

i=1(yi−f(xi))2to a ball of

radius 1/λhf, f i2

K. Hence, it projects the solution of

the equation onto the ball. To avoid the problems of

singular (K+λI), the approach we take here is to set

λ= 0, and to ﬁnd an approximation to the solution in

a small subspace of the possible solution space. That

is, we are solving the following optimization problem,

stabilize

f∈K

1

m

m

X

i=1yi−f(xi)2

subject to f∈ L ⊂ span{αik(xi,·)}.

(5)

We describe several diﬀerent ways of choosing the sub-

space L. Deﬁning T:K → Rmto be the evaluation

functional (Deﬁnition 4), we can express the inter-

polation problem f(xi) = yigiven the training data

(x1, y1), . . . , (xm, ym)∈(X × R)m, as the linear sys-

tem T f =y, where f∈ K, and Kis a RKKS. Deﬁne

T∗:Rm→ K to be the adjoint operator of Tsuch

that hT f, y i=hf, T ∗yi. Note that since Toperates on

elements of a Kre˘ın space, T T ∗=Kis indeﬁnite.

5.1. Truncated Spectral Factorization

We perform regularization by controlling the spec-

trum of K. We can obtain the eigenvalue decompo-

sition of K,Kui=µiui, where u1, . . . , umare the

orthonormal eigenvectors of K, and µiare the associ-

ated nonzero eigenvalues (assume Kis regular). Let

vi=sign(µi)

√|µi|T∗ui, then viare the orthogonal eigenvec-

tors for T∗T. The solution of T f =y(if it exists), is

given by

f=

m

X

i=1

hy, uii

µi

T∗ui

Intuitively, we associate eigenvalues with large abso-

lute values to the underlying function, and eigenvalues

close to zero corresponds to signal noise. The Trun-

cated Spectral Factorization (TSF) (Engl & K¨ugler,

2003) method can be obtained by setting all the eigen-

values of small magnitude to zero. This means that the

solution is in the subspace

L= span{T∗ui},|µi|> λ

5.2. Iterative Methods

Iterative methods can be used to minimize the squared

error J(f) := 1

2kT f −yk2. Since J(f) is convex,

we can perform gradient descent. Since ∇fJ(f) =

T∗T f −T∗y, we have the iterative deﬁnition fk+1 =

fk−λ(T∗T f −T∗y), which results in Landweber-

Fridman (LF) iteration (Hanke & Hansen, 1993). The

solution subspace in this case is the polynomial

L= span{(I−λT ∗T)kT∗y}for 1 6k6m.

A more eﬃcient method which utilizes the Krylov sub-

spaces is MR-II (Hanke, 1995). MR-II, which gen-

eralizes conjugate gradient methods to indeﬁnite ker-

nels, searches for the minimizer of kKα −ykwithin

the Krylov subspace

L= span{Kr0, K 2r0, . . . , Kk−2r0},

where r0=y−Kα0. The algorithm is shown in Figure

1. The convergence proof and regularization behavior

can be found in (Hanke, 1995).

5.3. Illustration with Toy Problem

We apply TSF, LF, and MR-II to the spline approx-

imation of sinc(x) and cos(exp(x)). The experiments

was performed using 100 random restarts. The re-

sults using a Gaussian combinations kernel are shown

in Figure 2. The aim of these experiments is to show

that we can solve the regression problem using itera-

tive methods. The three methods perform equally well

on the toy data, based on visually inspecting the ap-

proximation. TSF requires the explicit computation

of the largest eigenvalues, and hence would not be

r0=y−Sx0;r1=r0;x1=x0;

v−1= 0; v0=Sr0;w−1= 0; w0=Sv0;

β=kw0k;v0=v0/β;w0=w0/β;

k= 1;

while (not stop) do

%=hrk, wk−1i;α=hwk−1, Swk−1i;

xk+1 =xk+%vk−1;rk+1 =rk+%wk−1;

vk=wk−1−αvk−1−βvk−2;

wk=Swk−1−αwk−1−βwk−2;

β=kwkk;vk=vk/β;wk=wk/β;

k=k+ 1;

end while

Figure 1. Algorithm: MR-II. Note that there is only one

matrix-vector product in each iteration. Since a matrix-

vector product is O(m), the total number of operations is

just O(km), where kis the number of iterations.

suitable for large problems. LF has been previously

shown to have slow convergence (Hanke & Hansen,

1993), requiring a large number of iterations. MR-II

has the beneﬁts of being an iterative method and also

has faster convergence. The results above required 30

iterations for LF, but only 8 for MR-II.

6. Conclusion

The aim of this paper is to introduce the concept

of an indeﬁnite kernel to the machine learning com-

munity. These kernels, which induce an RKKS, ex-

hibit many of the properties of positive deﬁnite ker-

nels. Several examples of indeﬁnite kernels are given,

along with their spectral properties. Due to the lack

of positivity, we stabilize the loss functional instead

of minimizing it. We have proved that stabilization

provides us with a representer theorem, and also gen-

eralization error bounds via the Rademacher average.

We discussed regularization with respect to optimizing

in Kre˘ın spaces, and illustrated the spline smoothing

problem on toy datasets.

References

Alpay, D. (2001). The schur algorithm, reproducing

kernel spaces and system theory, vol. 5 of SMF/AMS

Texts and Monographs. SMF.

Azizov, T. Y., & Iokhvidov, I. S. (1989). Linear opera-

tors in spaces with an indeﬁnite metric. John Wiley

& Sons. Translated by E. R. Dawson.

Bogn´ar, J. (1974). Indeﬁnite inner product spaces.

Springer Verlag.

Figure 2. Mean and one standard deviation of 100 random experiments to estimate sinc(x) (top row) and cos(exp(x))

(bottom row) using the Gaussian combination with σ1= 0.8, σ2= 1.2, σ3= 10. The left column shows the results using

TSF, the middle column using LF, and the right column using MR-II

Engl, H. W., & K¨ugler, P. (2003). Nonlinear in-

verse problems: Theoretical aspects and some in-

dustrial applications. Inverse Problems: Computa-

tional Methods and Emerging Applications Tutori-

als, UCLA.

Haasdonk, B. (2003). Feature space interpretation of

SVMs with non positive deﬁnite kernels. Unpub-

lished.

Hanke, M. (1995). Conjugate gradient type methods

for ill-posed problems. Pitman Research Notes in

Mathematics Series. Longman Scientiﬁc & Techni-

cal.

Hanke, M., & Hansen, P. (1993). Regularization meth-

ods for large-scale problems. Surveys Math. Ind.,3,

253–315.

Hassibi, B., Sayed, A. H., & Kailath, T. (1999).

Indeﬁnite-quadratic estimation and control: A uni-

ﬁed approach to h2and h∞theories. SIAM.

Lin, H.-T., & Lin, C.-J. (2003). A study on sigmoid

kernels for svm and the training of non-psd kernels

by smo-type methods. March.

Mary, X. (2003). Hilbertian subspaces, subdualities

and applications. Doctoral dissertation, Institut Na-

tional des Sciences Appliquees Rouen.

Mendelson, S. (2003). A few notes on statistical learn-

ing theory. Advanced Lectures in Machine Learning

(pp. 1–40). Springer Verlag.

Ong, C. S., & Smola, A. J. (2003). Machine learn-

ing with hyperkernels. International Conference of

Machine Learning (pp. 568–575).

Rockafellar, R. T. (1996). Convex analysis. Princeton

Univ. Pr. Reprint edition.

Sch¨olkopf, B., Herbrich, R., Smola, A., & Williamson,

R. (2001). A generalized representer theorem. Pro-

ceedings of Computational Learning Theory.

Sch¨olkopf, B., & Smola, A. J. (2002). Learning with

kernels. MIT Press.

Smola, A. J., Ovari, Z. L., & Williamson, R. C. (2000).

Regularization with dot-product kernels. NIPS (pp.

308–314).

Vapnik, V. N. (1998). Statistical learning theory. John

Wiley & Sons.

Wahba, G. (1990). Spline models for observational

data. SIAM.