Conference PaperPDF Available

Learning with non-positive kernels

Authors:

Abstract

In this paper we show that many kernel methods can be adapted to deal with indefinite kernels, that is, kernels which are not positive semidefinite. They do not satisfy Mercer's condition and they induce associated functional spaces called Reproducing Kernel Kre&icaron;n Spaces (RKKS), a generalization of Reproducing Kernel Hilbert Spaces (RKHS).Machine learning in RKKS shares many "nice" properties of learning in RKHS, such as orthogonality and projection. However, since the kernels are indefinite, we can no longer minimize the loss, instead we stabilize it. We show a general representer theorem for constrained stabilization and prove generalization bounds by computing the Rademacher averages of the kernel class. We list several examples of indefinite kernels and investigate regularization methods to solve spline interpolation. Some preliminary experiments with indefinite kernels for spline smoothing are reported for truncated spectral factorization, Landweber-Fridman iterations, and MR-II.
Learning with Non-Positive Kernels
Cheng Soon Ong cheng.ong@anu.edu.au
Computer Sciences Laboratory, RSISE, Australian National University, 0200 ACT, Australia
Xavier Mary xavier.mary@ensae.fr
ENSAE-CREST-LS, 3 avenue Pierre Larousse, 92240 Malakoff, France
St´ephane Canu scanu@insa-rouen.fr
Laboratoire PSI FRE CNRS 2645 - INSA de Rouen, B.P. 08, 76131 Mont-Saint-Aignan Cedex, France
Alexander J. Smola alex.smola@anu.edu.au
RSISE and NICTA Australia, Australian National University, 0200 ACT, Australia
Indefinite Kernels, Reproducing Kernel Kre˘ın Space, Representer Theorem, Rademacher Average, Non-convex
Optimization, Ill-posed Problems
Abstract
In this paper we show that many kernel meth-
ods can be adapted to deal with indefinite
kernels, that is, kernels which are not posi-
tive semidefinite. They do not satisfy Mer-
cer’s condition and they induce associated
functional spaces called Reproducing Kernel
Kre˘ın Spaces (RKKS), a generalization of Re-
producing Kernel Hilbert Spaces (RKHS).
Machine learning in RKKS shares many
“nice” properties of learning in RKHS, such
as orthogonality and projection. However,
since the kernels are indefinite, we can no
longer minimize the loss, instead we sta-
bilize it. We show a general representer
theorem for constrained stabilization and
prove generalization bounds by computing
the Rademacher averages of the kernel class.
We list several examples of indefinite kernels
and investigate regularization methods to
solve spline interpolation. Some preliminary
experiments with indefinite kernels for spline
smoothing are reported for truncated spec-
tral factorization, Landweber-Fridman itera-
tions, and MR-II.
Appearing in Proceedings of the 21 st International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the authors.
1. Why Non-Positive Kernels?
Almost all current research on kernel methods in ma-
chine learning focuses on functions k(x, x0) which are
positive semidefinite. That is, it focuses on kernels
which satisfy Mercer’s condition and which conse-
quently can be seen as scalar products in some Hilbert
space. See (Vapnik, 1998; Sch¨olkopf & Smola, 2002;
Wahba, 1990) for details.
The purpose of this article is to point out that there is
a much larger class of kernel functions available, which
do not necessarily correspond to a RKHS but which
nonetheless can be used for machine learning. Such
kernels are known as indefinite kernels, as the scalar
product matrix may contain a mix of positive and neg-
ative eigenvalues. There are several motivations for
studying indefinite kernels:
Testing Mercer’s condition for a given kernel can
be a challenging task which may well lie beyond
the abilities of a practitioner.
Sometimes functions which can be proven not to
satisfy Mercer’s condition may be of other inter-
est. One such instance is the hyperbolic tangent
kernel k(x, x0) = tanh(hx, x0i − 1) of Neural Net-
works, which is indefinite for any range of param-
eters or dimensions (Smola et al., 2000).
There have been promising empirical reports on
the use of indefinite kernels (Lin & Lin, 2003).
In Hcontrol applications and discrimination
the cost function can be formulated as the dif-
ference between two quadratic norms (Haasdonk,
2003; Hassibi et al., 1999), corresponding to an
indefinite inner product.
RKKS theory (concerning function spaces arising
from indefinite kernels) has become a rather active
area in interpolation and approximation theory.
In recent work on learning the kernel, such as
(Ong & Smola, 2003), the solution is a linear com-
bination of positive semidefinite kernels. How-
ever, an arbitrary linear combination of posi-
tive kernels is not necessarily positive semidefi-
nite (Mary, 2003). While the elements of the as-
sociated vector space of kernels can always be de-
fined as the difference between two positive ker-
nels, what is the functional space associated with
such a kernel?
We will discuss the above issues using topological
spaces similar to Hilbert spaces except for the fact
that the inner product is no longer necessarily pos-
itive. Section 2 defines RKKS and some properties
required in the subsequent derivations. We also give
some examples of indefinite kernels and describe their
spectrum. Section 3 extends Rademacher type gen-
eralization error bounds for learning using indefinite
kernels. Section 4 shows that we can obtain a theorem
similar to the representer theorem in RKHS. However,
we note that there may be practical problems. Sec-
tion 5 describes how we can perform approximation of
the interpolation problem using the spectrum of the
kernel also using iterative methods. It also shows pre-
liminary results on spline regularization.
2. Reproducing Kernel Krein Spaces
Kre˘ın spaces are indefinite inner product spaces en-
dowed with a Hilbertian topology, yet their inner prod-
uct is no longer positive. Before we delve into defini-
tions and state basic properties of Kre˘ın spaces, we
give an example:
Example 1 (4 dimensional space-time)
Indefinite spaces were first introduced by Minkowski
for the solution of problems in special relativity. There
the inner product in space-time (x, y, z, t)is given by
h(x, y, z, t),(x0, y0, z0, t0)i=xx0+yy0+zz0tt0.
Clearly it is not positive. The vector v= (1,1,1,3)
belongs to the cone of so-called neutral vectors which
satisfy hv, vi= 0 (in coordinates x2+y2+z2t2= 0).
In special relativity this cone is also called the “light
cone,” as it corresponds to the propagation of light
from a point event.
2.1. Kre˘ın spaces
The above example shows that there are several dif-
ferences between Kre˘ın spaces and Hilbert spaces. We
now define Kre˘ın spaces formally. More detailed ex-
positions can be found in (Bogn´ar, 1974; Azizov &
Iokhvidov, 1989). The key difference is the fact that
the inner products are indefinite.
Definition 1 (Inner product) Let Kbe a vector
space on the scalar field.1An inner product h., .iKon
Kis a bilinear form where for all f , g, h ∈ K,αR:
• hf, giK=hg, f iK
• hαf +g, hiK=αhf , hiK+hg, hiK
• hf, giK= 0 for all g∈ K implies f= 0
An inner product is said to be positive if for all f
Kwe have hf, f iK0. It is negative if for all f
K hf, f iK0. Otherwise it is called indefinite.
A vector space Kembedded with the inner product
h., .iKis called an inner product space. Two vectors
f, g of an inner product space are said to be orthogonal
if hf, giK= 0. Given an inner product, we can define
the associated space.
Definition 2 (Kre˘ın space) An inner product space
(K,h., .iK)is a Kre˘ın space if there exist two Hilbert
spaces H+,Hspanning Ksuch that
All f∈ K can be decomposed into f=f++f,
where f+∈ H+and f∈ H.
• ∀f, g ∈ K,hf , giK=hf+, g+iH+− hf, giH
This suggests that there is an “associated” Hilbert
space, where the difference in scalar products is re-
placed by a sum:
Definition 3 (Associated Hilbert Space) Let K
be a Kre˘ın space with decomposition into Hilbert spaces
H+and H. Then we denote by Kthe associated
Hilbert space defined by
K=H+⊕Hhence hf, giK=hf+, g+iH++hf, giH
Likewise we can introduce the symbol to indicate that
K=H+Hhence hf, giK=hf+, g+iH+−hf, giH.
Note that Kis the smallest Hilbert space majorizing
the Kre˘ın space Kand one defines the strong topology
on Kas the Hilbertian topology of K. The topology
does not depend on the decomposition chosen. Clearly
|hf, f iK|6kfk2
Kfor all f∈ K.
1Like Hilbert spaces, Kre˘ın spaces can be defined on R
or C. We use Rin this paper.
Kis said to be Pontryagin if it admits a decomposi-
tion with finite dimensional H, and Minkowski if K
itself is finite dimensional. We will see how Pontryagin
spaces arise naturally when dealing with conditionally
positive definite kernels (see Section 2.4).
For estimation we need to introduce Kre˘ın spaces on
functions. Let Xbe the learning domain, and RXthe
set of functions from Xto R. The evaluation func-
tional tells us the value of a function at a certain point,
and we shall see that the RKKS is a subset of RX
where this functional is continuous.
Definition 4 (Evaluation functional)
Tx:K → Rwhere f7→ Txf=f(x).
Definition 5 (RKKS) A Kre˘ın space (K,h., .iK)is a
Reproducing Kernel Kre˘ın Space (Alpay, 2001, Chap-
ter 7) if K ⊂ RXand the evaluation functional is con-
tinuous on Kendowed with its strong topology (that is,
via K).
2.2. From Kre˘ın spaces to Kernels
We prove an analog to the Moore-Aronszajn theo-
rem (Wahba, 1990), which tells us that for every ker-
nel there is an associated Kre˘ın space, and for every
RKKS, there is a unique kernel.
Proposition 6 (Reproducing Kernel) Let Kbe
an RKKS with K=H+ H. Then
1. H+and Hare RKHS (with kernels k+and k),
2. There is a unique symmetric k(x, x0)with k(x, ·)
Ksuch that for all f∈ K,hf , k(x, ·)iK=f(x),
3. k=k+k.
Proof Since Kis a RKKS, the evaluation functional is
continuous with respect to the strong topology. Hence
the associated Hilbert Space Kis an RKHS. It fol-
lows that H+and H, as Hilbertian subspaces of an
RKHS, are RKHS themselves with kernels k+and k
respectively. Let f=f++f. Then Tx(f) is given by
Tx(f) = Tx(f+) + Tx(f)
=hf+, k+(x, ·)iH+− hf,k(x, ·)iH
=hf, k+(x, ·)k(x, ·)iK.
In both lines we exploited the orthogonality of H+
with H. Clearly k:= k+kis symmetric.
Moreover it is unique since the inner product is
non-degenerate.
2.3. From Kernels to Kre˘ın spaces
Let kbe a symmetric real valued function on X2.
Proposition 7 The following are equivalent (Mary,
2003, Theorem 2.28):
There exists (at least) one RKKS with kernel k.
kadmits a positive decomposition, that is there
exists two positive kernels k+and ksuch that
k=k+k.
kis dominated by some positive kernel p(that is,
pkis a positive kernel).
There is no bijection but a surjection between the set
of RKKS and the set of generalized kernels defined in
the vector space generated out of the cone of positive
kernels.
2.4. Examples and Spectral Properties
We collect several examples of indefinite kernels in Ta-
ble 1 and plot a 2 dimensional example as well as 20
of the eigenvalues with the largest absolute value. We
investigate the spectrum of radial kernels using the
Hankel transform.
The Fourier transform allows one to find the eigenvalue
decomposition of kernels of the form k(x, x0) = κ(x
x0) by computing the Fourier transform of κ. For x
Rnwe have
F[f](kωk) = kωkνHν[rνκ(r)](kωk)
where ν=1
2n1 and Hνis the Hankel transform of
order ν. Table 1 depicts the spectra of these kernels.
Negative values in the Hankel transform correspond
to H, positive ones to H+. Likewise the decomposi-
tion of k(x, x0) = κ(hx, x0i) in terms of associated Leg-
endre polynomials allows one to identify the positive
and negative parts of the Krein space, as the Legendre
polynomials commute with the rotation group.
One common class of translation invariant kernels
which are not positive definite are so-called condition-
ally positive definite (cpd) kernels. A cpd kernel of
order pleads to a positive semidefinite matrix in a
subspace of coefficients orthogonal to polynomials of
order up to p1. Moreover, in the subspace of (p1)
degree polynomials, the inner product is typically neg-
ative definite. This means that there is a space of poly-
nomials of degree up to order p1 (which constitutes
an up to n+p2
p1-dimensional subspace) with negative
inner product. In other words, we are dealing with a
Pontyragin space.
The standard procedure to use such kernels is to
project out the negative component, replace the lat-
Kernel 2D kernel 20 main Eigenvalues Fourier Transform
Epanechnikov kernel
1kstk2
σp
, for kstk2
σ61
Gaussian Combination
exp −kstk2
σ1+ exp −kstk2
σ2
exp −kstk2
σ3
Multiquadric kernel
qkstk2
σ+c2
Thin plate spline
kstk
σ
2pln kstk2
σ
Table 1. Examples of indefinite kernels. Column 2 shows the 2D surface of the kernel with respect to the origin, column
3 shows plots of the 20 eigenvalues with largest magnitude of uniformly spaced data from the interval [2,2], column 4
shows plots of the Fourier spectra.
ter by a suitably smoothed estimate in the polyno-
mial subspace and treat the remaining subspace as any
RKHS (Wahba, 1990). Using Kre˘ın spaces we can use
these kernels directly, without the need to deal with
the polynomial parts separately.
3. Generalization Bounds via
Rademacher Average
An important issue regarding learning algorithms are
their ability to generalize (to give relevant predictions).
This property is obtained when the learning process
considered shows an uniform convergence behavior. In
(Mendelson, 2003) such a result is demonstrated in the
case of RKHS through the control of the Rademacher
average of the class of function considered. Here we
present an adaptation of this proof in the case of Kre˘ın
spaces. We begin with setting the functional frame-
work for the result.
Let kbe a kernel defined on a set Xand choose a de-
composition k=k+kwhere k+and kare both
positive kernels. This given decomposition of the ker-
nel can be associated with the RKHS Kdefined by its
positive kernel k=k++kwhose Hilbertian topol-
ogy defines the strong topology of K. We will then
consider the set BKdefined as follows:
BK=f∈ K
kf+k2+kfk2=kfk21
Note that in a Kre˘ın space the norm of a function is the
associated Hilbertian norm and usually kfk26=hf , fiK
but always hf, f iK≤ kfk2.
The Rademacher average of a class of functions F
with respect to a measure µis defined as follows.
Let x1, . . . , xm∈ X be i.i.d random variables sam-
pled according to µ. Let εifor i= 1, . . . , m be
Rademacher random variables, that is variables tak-
ing values {−1,+1}with equal probability.
Definition 8 (Rademacher Average) The
Rademacher average,Rm(F)of a set of functions F
(w.r.t. µ) is defined as
Rm(F) = EµEε
1
msup
f∈F
m
X
i=1
εif(xi)
Using the Rademacher average as an estimate of the
“size” of a function class, we can obtain general-
ization error bounds which are also called uniform
convergence or sample complexity bounds (Mendel-
son, 2003, Corollary 3), that is for any ε > 0 and
δ > 0, there is an absolute constant Csuch that if
m>C
ε2max{R2
m(J(BK)),log 1
δ}, then,
Prsup
f∈BK
1
m
m
X
i=1
J(f(Xi)) EJ(f)
εδ,
where J(f(x)) denotes the quadratic loss defined as
in (Mendelson, 2003). To get the expected result we
have to show that the Rademacher average is bounded
by a constant independent of the sample size m. To
control the Rademacher average, we first give a lemma
regarding the topology of Kre˘ın spaces putting empha-
sis on both difference and close relationship with the
Hilbertian case.
Lemma 9 For all g∈ K:
sup
f∈BKhf(.), g(.)iK=kgk
Proof It is trivial if g= 0. g∈ K,g6= 0, let
h=g/kgk. By construction khk= 1.
sup
f∈BKhf(.), g(.)iK=kgksup
f∈BKhf(.), h(.)iK
=kgksup
f∈BKhf+, h+iK+− hf, hiK
=kgkhh+, h+iK++hh, hiK
=kgk
In the unit ball of a RKKS, the Rademacher average
with respect to the probability measure µbehave the
same way as the one of its associated RKHS.
Proposition 10 (Rademacher Average) Let
Kbe the Gram matrix of kernel kat points
x1, . . . , xm, If according to the measure µon X
x7−k(x, x)L1(X, µ), then
Rm(BK)M1
2
with
M=1
mEµtr K=ZX
k(x, x)(x)
The proof works just as in the Hilbertian case (Mendel-
son, 2003, Theorem 16) with the application of
lemma 9. As a second slight difference we choose to
express the bound as a function of the L1(X, µ) norm
of the kernel instead of going through its spectral rep-
resentation. It is simpler since for instance, for the un-
normalized gaussian kernel k(x, y) = exp((xy)2)
on X=Rwe have M= 1 regardless the measure µ
considered. Since we are back to the Hilbertian con-
text (Mendelson, 2003, Corollary 4) applies replacing
Hilbert by Kre˘ın, providing an uniform convergence
result as expected.
4. Machine Learning in RKKS
In order to perform machine learning, we need to be
able to optimize over a class of functions, and also to
be able to prove that the solution exists and is unique.
Instead of minimizing over a class of functionals as in a
RKHS, we look for the stationary point. This is moti-
vated by the fact that in a RKHS, minimization of the
cost functional can be seen as a projection problem.
The equivalent projection problem in RKKS gives us
the stationary point of the cost functional.
4.1. Representer Theorem
The analysis of the learning problem in a RKKS gives
similar representer theorems to the Hilbertian case
(Sch¨olkopf et al., 2001). The key difference is that the
problem of minimizing a regularized risk functional be-
comes one of finding the stationary point of a similar
functional. Moreover, the solution need not be unique
any more. The proof technique, however, is rather
similar. The main difference is that a) we deal with a
constrained optimization problem directly and b) the
Gateaux derivative has to vanish due to the nondegen-
eracy of the inner product. In the following, we define
the training data X:= (x1, . . . , xm) drawn from the
learning domain X.
Theorem 11 Let Kbe an RKKS with kernel k. De-
note by L{f, X }a continuous convex loss functional
depending on f∈ K only via its evaluations f(xi)with
xiX, let Ω(hf, f i)be a continuous stabilizer with
strictly monotonic Ω : RRand let C{f, X }be a
continuous functional imposing a set of constraints on
f, that is C:K×XmRn. Then if the optimization
problem
stabilize
f∈K L{f, X }+ Ω(hf, f iK) (1)
subject to C{f , X} ≤ d
has a saddle point f, it admits the expansion
f=X
i
αik(xi,·)where xiXand αiR.(2)
Proof The first order conditions for a solution of (1)
imply that the Gateaux derivative of the Lagrangian
L{f, λ}=L{f , X}+ Ω(hf, f iK) + λ>(C{f, X } − d)
needs to vanish. By the nondegeneracy of the inner
product, hf, giK= 0 for all g∈ K implies f= 0.
Next observe that the functional subdifferential of
L{f, λ}with respect to fsatisfies (Rockafellar, 1996)
fL{f, λ}=
m
X
i=1
f(xi)L{f, X }+λ>C{f, X }k(xi,·)
+ 2fhf,f iΩ(hf , fiK).(3)
Here is understood to be the subdifferential with
respect to the argument wherever the function is
not differentiable, since Cand Ω only depends on
f(xi), the subdifferential always exists with respect
to [f(x1), . . . , f (xm)]>. Since for stationarity the
variational derivative needs to vanish, we have
0fL{f, λ}and consequently f=Piαik(xi,·)
for some αif(xi)L{f, X }+λ>C{f, X }. This
proves the claim.
Theorem 12 (Semiparametric Extension) The
same result holds if the optimization is carried out
over f+g, where f∈ K, and gis a parametric
addition to f. Again flies in the span of k(xi,·).
Proof [sketch only] In the Lagrange function the
partial derivative with respect to fneeds to vanish
just as in (3). This is only possible if fis contained
in the span of kernel functions on the data.
4.2. Application to general spline smoothing
We consider the general spline smoothing problem as
presented in (Wahba, 1990), except we are considering
Kre˘ın spaces. The general spline smoothing is defined
as the function stabilizing (that is finding the station-
ary point) the following criterion:
Jm(f) = 1
m
m
X
i=1yif(xi)2+λhf , fiK.(4)
The form for the solution of equation (4) is given by
the representer theorem, which says that the solution
(if it exists) is the solution of the linear equation
(K+λI)α=y,
where Kij =k(xi, xj) is the Gram matrix.
The general spline smoothing problem can be viewed
as applying Tikhonov regularization to the interpola-
tion problem. However, since the matrix Kis indef-
inite, it may have negative eigenvalues. For values of
the regularization parameter λwhich equal a negative
eigenvalue of the Gram matrix K, (K+λI) is singu-
lar. Note that in the case where Kis positive, this does
not occur. Hence, solving the Tikhonov regularization
problem directly may not be successful. Instead, we
use the subspace expansion from Theorem 11 directly.
5. Algorithms for Kre˘ın space
Regularization
Tikhonov regularization restricts the solution of the
interpolation error 1
mPm
i=1(yif(xi))2to a ball of
radius 1hf, f i2
K. Hence, it projects the solution of
the equation onto the ball. To avoid the problems of
singular (K+λI), the approach we take here is to set
λ= 0, and to find an approximation to the solution in
a small subspace of the possible solution space. That
is, we are solving the following optimization problem,
stabilize
f∈K
1
m
m
X
i=1yif(xi)2
subject to f∈ L ⊂ span{αik(xi,·)}.
(5)
We describe several different ways of choosing the sub-
space L. Defining T:K → Rmto be the evaluation
functional (Definition 4), we can express the inter-
polation problem f(xi) = yigiven the training data
(x1, y1), . . . , (xm, ym)(X × R)m, as the linear sys-
tem T f =y, where f∈ K, and Kis a RKKS. Define
T:Rm→ K to be the adjoint operator of Tsuch
that hT f, y i=hf, T yi. Note that since Toperates on
elements of a Kre˘ın space, T T =Kis indefinite.
5.1. Truncated Spectral Factorization
We perform regularization by controlling the spec-
trum of K. We can obtain the eigenvalue decompo-
sition of K,Kui=µiui, where u1, . . . , umare the
orthonormal eigenvectors of K, and µiare the associ-
ated nonzero eigenvalues (assume Kis regular). Let
vi=sign(µi)
|µi|Tui, then viare the orthogonal eigenvec-
tors for TT. The solution of T f =y(if it exists), is
given by
f=
m
X
i=1
hy, uii
µi
Tui
Intuitively, we associate eigenvalues with large abso-
lute values to the underlying function, and eigenvalues
close to zero corresponds to signal noise. The Trun-
cated Spectral Factorization (TSF) (Engl & K¨ugler,
2003) method can be obtained by setting all the eigen-
values of small magnitude to zero. This means that the
solution is in the subspace
L= span{Tui},|µi|> λ
5.2. Iterative Methods
Iterative methods can be used to minimize the squared
error J(f) := 1
2kT f yk2. Since J(f) is convex,
we can perform gradient descent. Since fJ(f) =
TT f Ty, we have the iterative definition fk+1 =
fkλ(TT f Ty), which results in Landweber-
Fridman (LF) iteration (Hanke & Hansen, 1993). The
solution subspace in this case is the polynomial
L= span{(IλT T)kTy}for 1 6k6m.
A more efficient method which utilizes the Krylov sub-
spaces is MR-II (Hanke, 1995). MR-II, which gen-
eralizes conjugate gradient methods to indefinite ker-
nels, searches for the minimizer of kykwithin
the Krylov subspace
L= span{Kr0, K 2r0, . . . , Kk2r0},
where r0=y0. The algorithm is shown in Figure
1. The convergence proof and regularization behavior
can be found in (Hanke, 1995).
5.3. Illustration with Toy Problem
We apply TSF, LF, and MR-II to the spline approx-
imation of sinc(x) and cos(exp(x)). The experiments
was performed using 100 random restarts. The re-
sults using a Gaussian combinations kernel are shown
in Figure 2. The aim of these experiments is to show
that we can solve the regression problem using itera-
tive methods. The three methods perform equally well
on the toy data, based on visually inspecting the ap-
proximation. TSF requires the explicit computation
of the largest eigenvalues, and hence would not be
r0=ySx0;r1=r0;x1=x0;
v1= 0; v0=Sr0;w1= 0; w0=Sv0;
β=kw0k;v0=v0;w0=w0;
k= 1;
while (not stop) do
%=hrk, wk1i;α=hwk1, Swk1i;
xk+1 =xk+%vk1;rk+1 =rk+%wk1;
vk=wk1αvk1βvk2;
wk=Swk1αwk1βwk2;
β=kwkk;vk=vk;wk=wk;
k=k+ 1;
end while
Figure 1. Algorithm: MR-II. Note that there is only one
matrix-vector product in each iteration. Since a matrix-
vector product is O(m), the total number of operations is
just O(km), where kis the number of iterations.
suitable for large problems. LF has been previously
shown to have slow convergence (Hanke & Hansen,
1993), requiring a large number of iterations. MR-II
has the benefits of being an iterative method and also
has faster convergence. The results above required 30
iterations for LF, but only 8 for MR-II.
6. Conclusion
The aim of this paper is to introduce the concept
of an indefinite kernel to the machine learning com-
munity. These kernels, which induce an RKKS, ex-
hibit many of the properties of positive definite ker-
nels. Several examples of indefinite kernels are given,
along with their spectral properties. Due to the lack
of positivity, we stabilize the loss functional instead
of minimizing it. We have proved that stabilization
provides us with a representer theorem, and also gen-
eralization error bounds via the Rademacher average.
We discussed regularization with respect to optimizing
in Kre˘ın spaces, and illustrated the spline smoothing
problem on toy datasets.
References
Alpay, D. (2001). The schur algorithm, reproducing
kernel spaces and system theory, vol. 5 of SMF/AMS
Texts and Monographs. SMF.
Azizov, T. Y., & Iokhvidov, I. S. (1989). Linear opera-
tors in spaces with an indefinite metric. John Wiley
& Sons. Translated by E. R. Dawson.
Bogn´ar, J. (1974). Indefinite inner product spaces.
Springer Verlag.
Figure 2. Mean and one standard deviation of 100 random experiments to estimate sinc(x) (top row) and cos(exp(x))
(bottom row) using the Gaussian combination with σ1= 0.8, σ2= 1.2, σ3= 10. The left column shows the results using
TSF, the middle column using LF, and the right column using MR-II
Engl, H. W., & K¨ugler, P. (2003). Nonlinear in-
verse problems: Theoretical aspects and some in-
dustrial applications. Inverse Problems: Computa-
tional Methods and Emerging Applications Tutori-
als, UCLA.
Haasdonk, B. (2003). Feature space interpretation of
SVMs with non positive definite kernels. Unpub-
lished.
Hanke, M. (1995). Conjugate gradient type methods
for ill-posed problems. Pitman Research Notes in
Mathematics Series. Longman Scientific & Techni-
cal.
Hanke, M., & Hansen, P. (1993). Regularization meth-
ods for large-scale problems. Surveys Math. Ind.,3,
253–315.
Hassibi, B., Sayed, A. H., & Kailath, T. (1999).
Indefinite-quadratic estimation and control: A uni-
fied approach to h2and htheories. SIAM.
Lin, H.-T., & Lin, C.-J. (2003). A study on sigmoid
kernels for svm and the training of non-psd kernels
by smo-type methods. March.
Mary, X. (2003). Hilbertian subspaces, subdualities
and applications. Doctoral dissertation, Institut Na-
tional des Sciences Appliquees Rouen.
Mendelson, S. (2003). A few notes on statistical learn-
ing theory. Advanced Lectures in Machine Learning
(pp. 1–40). Springer Verlag.
Ong, C. S., & Smola, A. J. (2003). Machine learn-
ing with hyperkernels. International Conference of
Machine Learning (pp. 568–575).
Rockafellar, R. T. (1996). Convex analysis. Princeton
Univ. Pr. Reprint edition.
Sch¨olkopf, B., Herbrich, R., Smola, A., & Williamson,
R. (2001). A generalized representer theorem. Pro-
ceedings of Computational Learning Theory.
Sch¨olkopf, B., & Smola, A. J. (2002). Learning with
kernels. MIT Press.
Smola, A. J., Ovari, Z. L., & Williamson, R. C. (2000).
Regularization with dot-product kernels. NIPS (pp.
308–314).
Vapnik, V. N. (1998). Statistical learning theory. John
Wiley & Sons.
Wahba, G. (1990). Spline models for observational
data. SIAM.
... In practice, data are often described by similarities (or dissimilarities) that are not necessarily definite positive (see Example 2). This situation is addressed either by generalizing kernel methods to the "pseudo-Euclidean" framework [16,37], by embedding the sample directly into a Euclidean space whose dot product ressembles the original similarity (Multidimensional Scaling -MDS-is one of these approaches [11]), or by using a proper definite kernel instead of the original indefinite similarity. ...
Article
An abstract indefinite least squares problem with a quadratic constraint is considered. This is a quadratic programming problem with one quadratic equality constraint, where neither the objective nor the constraint are convex functions. Necessary and sufficient conditions are found for the existence of solutions.
Preprint
Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose Clustering by Discriminative Similarity (CDS), a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
Article
A key problem in the field of quantum computing is understanding whether quantum machine learning (QML) models implemented on noisy intermediate-scale quantum (NISQ) machines can achieve quantum advantages. Recently, Huang et al. [Nat Commun 12, 2631] partially answered this question by the lens of quantum kernel learning. Namely, they exhibited that quantum kernels can learn specific datasets with lower generalization error over the optimal classical kernel methods. However, most of their results are established on the ideal setting and ignore the caveats of near-term quantum machines. To this end, a crucial open question is: does the power of quantum kernels still hold under the NISQ setting? In this study, we fill this knowledge gap by exploiting the power of quantum kernels when the quantum system noise and sample error are considered. Concretely, we first prove that the advantage of quantum kernels is vanished for large size of datasets, few number of measurements, and large system noise. With the aim of preserving the superiority of quantum kernels in the NISQ era, we further devise an effective method via indefinite kernel learning. Numerical simulations accord with our theoretical results. Our work provides theoretical guidance of exploring advanced quantum kernels to attain quantum advantages on NISQ devices.
Article
Twin support vector machine (TWSVM) is an efficient algorithm for binary classification. However, the lack of the structural risk minimization principle restrains the generalization of TWSVM and the guarantee of convex optimization constraints TWSVM to only use positive semi-definite kernels (PSD). In this paper, we propose a novel TWSVM for indefinite kernel called indefinite twin support vector machine with difference of convex functions programming (ITWSVM-DC). The indefinite TWSVM (ITWSVM) leverages a maximum margin regularization term to improve the generalization of TWSVM and a smooth quadratic hinge loss function to make the model continuously differentiable. The representer theorem is applied to the ITWSVM and the convexity of the ITWSVM is analyzed. In order to address the non-convex optimization problem when the kernel is indefinite, a difference of convex functions (DC) is used to decompose the non-convex objective function into the subtraction of two convex functions and a line search method is applied in the DC algorithm to accelerate the convergence rate. A theoretical analysis illustrates that ITWSVM-DC can converge to a local optimum and extensive experiments on indefinite and positive semi-definite kernels show the superiority of ITWSVM-DC.
Article
Full-text available
Over the past two decades, support vector machines (SVMs) have become a popular supervised machine learning model, and plenty of distinct algorithms are designed separately based on different KKT conditions of the SVM model for classification/regression with different losses, including convex and or nonconvex loss. In this paper, we propose an algorithm that can train different SVM models in a unified scheme. First, we introduce a definition of the least squares type of difference of convex loss (LS-DC) and show that the most commonly used losses in the SVM community are LS-DC loss or can be approximated by LS-DC loss. Based on the difference of convex algorithm (DCA), we then propose a unified algorithm called UniSVM which can solve the SVM model with any convex or nonconvex LS-DC loss, wherein only a vector is computed by the specifically chosen loss. UniSVM has a dominant advantage over all existing algorithms for training robust SVM models with nonconvex losses because it has a closed-form solution per iteration, while the existing algorithms always need to solve an L1SVM/L2SVM per iteration. Furthermore, by the low-rank approximation of the kernel matrix, UniSVM can solve large-scale nonlinear problems efficiently. To verify the efficacy and feasibility of the proposed algorithm, we perform many experiments on small artificial problems and large benchmark tasks both with and without outliers for classification and regression for comparison with state-of-the-art algorithms. The experimental results demonstrate that UniSVM can achieve comparable performance in less training time. The foremost advantage of UniSVM is that its core code in Matlab is less than 10 lines; hence, it can be easily grasped by users or researchers.
Article
Full-text available
Functions of two variables appearing in integral transforms (Bergman, Segal, Carleman), or more generally kernels in the sense of Laurent Schwartz - defined as weakly continuous linear mappings between the dual of a locally convex vector space and itself - have been investigated for half a century, particularly in the field of distributions, differential equations and in the probability field with the study of Gaussian measures or Gaussian processes. The study of these objects may take various forms, but in case of positive kernels, the study of the properties of the image space initiated by Moore, Bergman and Aronzjan leads to a crucial result: the range of the kernel can be endowed with a natural scalar product that makes it a prehilbertian space and its completion belongs ( under some weak additional topological conditions on the locally convex space) to the locally convex space. Moreover, this injection is continuous. Positive kernels then seem to be deeply related to some particular Hilbert spaces and our will in this thesis is to study the other kernels. What can we say if the kernel is neither positive, nor Hermitian ? To do this we actually follow a second path and study directly spaces rather than kernels. Considering Hilbert spaces, some mathematicians have been interested in a particular subset of the set of Hilbert spaces, those Hilbert spaces that are continuously included in a common locally convex vector space. The relative theory is known as the theory of Hilbertian subspaces and is closely investigated in the first chapter. Its main result is that surprisingly the notions of Hilbertian subspaces and positive kernels are equivalent, which is generally summarized as follows: ``there exists a bijective correspondence between positive kernels and Hilbertian subspaces''. The main difference with the existing theory in the first chapter is the use of dual systems and bilinear forms and one of its consequence is the emergence of some loss of symmetry that will lead to our general theory of subdualities. In the second chapter we study the existing theory of Hermitian (or Krein) subspaces which are indefinite inner product spaces. These spaces actually generalize the previous notion of Hilbertian subspaces and their study is a first step to the greater generalization of chapter three. These spaces are deeply connected to Hermitian kernels but interestingly enough the previous fundamental equivalence is lost. Then we focus on the differences between this theory and the Hilbertian one for these differences will of course remain when dealing with subdualities. In the third chapter we present a new theory of a dual system of vector spaces called subdualities which treat the previous chapters as particular cases. A topological definition of subdualities is as follows: a duality $(E,F)$ is a subduality of the dual system $(\cE,\cF)$ if and only if both $E$ and $F$ are weakly continuously embedded in $\cE$. It appears that we can associate a unique kernel (in the sense of L. Schwarz) with any subduality, whose image is dense in the subduality. The study of the image of a subduality by a weakly continuous linear operator, makes it possible to define a vector space structure upon the set of subdualities, but given a certain equivalence relation. A canonical representative entirely defined by the kernel is then given, which enables us to state a bijection theorem between canonical subdualities and kernels. Finally a fourth chapter is dedicated to applications. We first study the link between Hilbertian subspaces and Gaussian measures and try to extend the theory to Krein subspaces and subdualities. Then we study some particular operators: operators in evaluation subdualities (subdualities of $\KK^(\Omega)$) and differential operators.
Article
The same positive functions (in the sense of reproducing kernel spaces) appear in a natural way in two different domains, namely the modeling of time-invariant dissipative linear systems and the theory of linear operators. We use the associated reproducing kernel Hilbert spaces to study the relationships between these domains. The inverse scattering problem plays a key role in the exposition. The reproducing kernel approach allows us to tackle in a natural way more general cases, such as nonstationary systems, the case of a non-positive metric and the case of pairs of commuting nonself-adjoint operators.
Article
The paper contains a survey of papers reviewed in Ref. Zh. Matematika from 1953–1978 on the theory of linear operators in (mainly Hibert) spaces with indefinite metric and their applications to various domains of mathematics and mechanics. As a preliminary, the needed results on the geometry of spaces with indefinite metric are described.
Article
The widespread habit of "plugging" arbitrary symmetric functions as kernels in sup- port vector machines (SVMs) often yields good empirical classification results. However, in case of non conditionally positive definite (non-cpd) functions they are hard to inter- pret due to missing geometrical and theoretical understanding. In this paper we provide a step towards comprehension of SVM classifiers in these situations. We give a geometric interpretation of SVMs with non-cpd kernel functions. We show that such SVMs are op- timal hyperplane classifiers not by margin maximization but by minimization of distances between convex hulls in pseudo-Euclidean spaces. This interpretation is basis for further analysis, e.g. investigating uniqueness or characterizing situations where SVMs with non- cpd kernels are suitable or not.
Book
I. Inner Product Spaces without Topology.- 1. Vector Spaces.- 2. Inner Products.- 3. Orthogonality.- 4. Isotropic Vectors.- 5. Maximal Non-degenerate Subspaces.- 6. Maximal Semi-definite Subspaces.- 7. Maximal Neutral Subspaces.- S. Projections of Vectors on Subspaces.- 9. Ortho-complemented Subspaces.- 10. Dual Pairs of Subspaces.- 11. Fundamental Decompositions.- Notes to Chapter I.- II. Linear Operators in Inner Product Spaces without Topology.- 1. Linear Operators in Vector Spaces.- 2. Isometric Operators.- 3. Symmetric Operators.- 4. Cayley Transformations.- 5. Principal Vectors of Cayley Transforms.- 6. Pairs of Inner Products: Semi-boundedness.- 7. Pairs of Inner Products: Sign.- 8. Plus-operators.- 9. Pesonen Operators.- 10. Fundamental Projectors.- 11. Fundamental Symmetries. Angular Operators.- Notes to Chapter II.- III. Partial Majorants and Admissible Topologies on Inner Product Spaces.- 1. Locally Convex Topologies on Vector Spaces.- 2. Partial Majorants. The Weak Topology.- 3. Metrizable Partial Majorants.- 4. The Polar of a Normed Partial Majorant.- 5. Admissible Topologies.- 6. Orthogonal Companions and Admissible Topologies.- 7. Projections and Admissible Topologies.- 8. Intrinsic Topology.- 9. Projections and Intrinsic Topology.- Notes to Chapter III.- IV. Majorant Topologies on Inner Product Spaces.- 1. Majorants.- 2. Majorants and Metrizable Partial Majorants.- 3. Orthonormal Systems.- 4. Minimal Majorants.- 5. Majorants and Decomposability.- 6. Decomposition Majorants.- 7. Invariant Properties of E+ and E-.- 8. Subspaces of Spaces with a Hilbert Majorant.- Notes to Chapter IV.- V. The Geometry of Krein Spaces.- 1. Krein Spaces.- 2. Krein Spaces as Completions.- 3. Subspaces.- 4. Maximal Semi-definite Subspaces.- 5. Uniformly Definite Subspaces.- 6. Non-uniformly Definite Subspaces.- 7. Maximal Uniformly Definite Subspaces.- 8. Regular and Singular Subspaces.- 9. Alternating Pairs.- 10. Dissipative Operators in Hilbert Space.- Notes to Chapter V.- VI. Unitary and Selfadjoint Operators in Krein Spaces.- 1. Preliminaries.- 2. The Adjoint of an Operator.- 3. Isometric Operators.- 4. Unitary and Rectangular Isometric Operators.- 5. Spectral Properties of Unitary Operators.- 6. Selfadjoint Operators.- 7. Cayley Transformations.- 8. Unitary Dilations.- Notes to Chapter VI.- VII. Positive Operators and Plus-operators in Krein Spaces.- 1. Positive Operators.- 2. Operators of the Form T*T.- 3. Uniformly Positive Operators.- 4. Plus-operators.- 5. Strict Plus-operators.- 6. Doubly Strict Plus-operators.- Notes to Chapter VII.- VIII. Invariant Semi-definite Subspaces of Linear Operators in Krein Spaces.- 1. Fundamentally Reducible Operators.- 2. Invariant Positive Subspaces of Plus-operators.- 3. Invariant Semi-definite Subspaces of Unitary and Selfadjoint Operators.- 4. Quadratic Pencils of Operators in Hilbert Space.- 5. Quadratic Operator Equations in I-Iilbert Space.- 6. Spectral Functions.- Notes to Chapter VIII.- IX. Pontrjagin Spaces and Their Linear Operators.- 1. The Spaces ?k* Positive Subspaces.- 2. Closed Subspaces.- 3. Isometric Operators: Continuity.- 4. Isometric and Symmetric Operators: Number and Length of Jordan Chains.- 5. Proof of Theorem 4.3.- 6. Regular Symmetric Extensions.- 7. Invariant Positive Subspaces: Existence.- 8. Invariant Positive Subspaces: Uniqueness.- 9. Common Invariant Positive Subspaces for Commuting Operators.- Notes to Chapter IX.- Index of Terms.- Index of Symbols.