Page 1

arXiv:0910.1013v1 [stat.ML] 6 Oct 2009

Functional learning through kernel

Stéphane Canu, Xavier Mary and Alain Rakotomamonjy

PSI

INSA de Rouen,

St Etienne du Rouvray, France

stephane.canu,xavier.mary,alain.rakotomamonjy@insa-rouen.fr

October 6, 2009

Abstract

This paper reviews the functional aspects of statistical learning theory. The main point under con-

sideration is the nature of the hypothesis set when no prior information is available but data. Within this

framework we first discuss about the hypothesis set: it is a vectorial space, it is a set of pointwise defined

functions, and the evaluation functional on this set is a continuous mapping. Based on these principles an

original theory isdeveloped generalizing thenotion of reproduction kernel Hilbert space tonon hilbertian

sets. Then it is shown that the hypothesis set of any learning machine has to be a generalized reproducing

set. Therefore, thanks to a general “representer theorem”, the solution of the learning problem is still a

linear combination of a kernel. Furthermore, a way to design these kernels is given. To illustrate this

framework some examples of such reproducing sets and kernels are given.

1Some questions regarding machine learning

Kernels and in particular Mercer or reproducing kernels play a crucial role in statistical learning theory

and functional estimation. But very little is known about the associated hypothesis set, the underlying

functional space where learning machines look for the solution. How to choose it? How to build it? What

is its relationship with regularization? The machine learning community has been interested in tackling

the problem the other way round. For a given learning task, therefore for a given hypothesis set, is there

a learning machine capable of learning it? The answer to such a question allows to distinguish between

learnable and non-learnable problem. The remaining question is: is there a learning machine capable of

learning any learnable set.

We know since [13] that learning is closely related to the approximation theory, to the generalized spline

theory, to regularization and, beyond, to the notion of reproducing kernel Hilbert space (r.k.h.s). This

framework is based on the minimization of the empirical cost plus a stabilizer (i.e. a norm is some Hilbert

space). Then, under these conditions, the solution to the learning task is a linear combination of some

positive kernel whose shape depends on the natureof the stabilizer. This solution is characterizedby strong

and nice properties such as universal consistency.

But within this framework there remains a gap between theory and practical solutions implemented by

practitioners. For instance, in r.k.h.s, kernels are positive. Some practitioners use hyperbolic tangent

kernel tanh(w⊤x + w0) while it is not a positive kernel: but it works. Another example is given by

practitioners using non-hilbertian framework. The sparsity upholder uses absolute values such as?|f|dµ

orthodoxy by introducing prior knowledge (i.e. a stabilizer) through information type criteria that are not

norms.

This paper aims at revealing some underlying hypothesis of the learning task extending the reproducing

kernel Hilbert space framework. To do so we begin with reviewing some learning principle. We will stress

that the hilbertiannature of the hypothesisset is not necessary while the reproducingpropertyis. This leads

or?

j|αj|: these are L1norms. They are not hilbertian. Others escape the hilbertian approximation

1

Page 2

2

us to define a non hilbertian frameworkfor reproducingkernel allowing non positive kernel, non-hilbertian

norms and other kinds of stabilizers.

The paper is organized as follows. The first point is to establish the three basic principles of learning.

Based on these principles and before entering the non-hilbertian framework, it appears necessary to recall

some basic elements of the theory of reproducing kernel Hilbert space and how to build them from non

reproducing Hilbert space. Then the construction of non-hilbertian reproducing space is presented by

replacing the dot (or inner) product by a more general duality map. This implies distinguishing between

two different sets put in duality, one for hypothesis and the other one for measuring. In the hilbertian

framework these two sets are merged in a single Hilbert space.

But before going into technical details we think it advisable to review the use of r.k.h.s in the learning

machine community.

2

r.k.h.s perspective

2.1Positive kernels

The interest of r.k.h.s arises from its associated kernel. As it were, a r.k.h.s is a set of functions entirely

defined by a kernel function. A Kernel may be characterized as a function from X × X to I R (usually

X ⊆ Rd). Mercer [11] first establishes some remarkablepropertiesof a particularclass of kernels: positive

kernels defining an integral operator. These kernels have to belong to some functional space (typically

L2(X × X), the set of square integrable functions on X × X) so that the associated integral operator is

compact. The positivity of kernel K is defined as follows:

K(x,y) positive ⇔ ∀f ∈ L2, ??K,f?L2,f?L2 ≥ 0

where ?.,.?L2 denotes the dot product in L2. Then, because it is compact, the kernel operator admits a

countable spectrum and thus the kernel can be decomposed. Based on that, the work by Aronszajn [2] can

be presented as follows. Instead of defining the kernel operator from L2to L2Aronszajn focuses on the

r.k.h.s H embeded with its dot product ?.,.?H. In this framework the kernel has to be a pointwise defined

function. The positivity of kernel K is then defined as follows:

K(x,y) positive ⇔ ∀g ∈ H, ??K,g?H,g?H≥ 0

(1)

Aronszajn first establishes a bijection between kernel and r.k.h.s. Then L. Schwartz [16] shows that this

was a particular case of a more general situation. The kernel doesn’t have to be a genuine function. He

generalizes the notion of positive kernels to weakly continuous linear application from the dual set E∗of a

vector space E to itself. To share interesting properties the kernel has to be positive in the following sense:

K positive ⇔ ∀h ∈ E∗((K(h),h)E,E∗ ≥ 0

where (.,.)E,E∗ denotes the duality product between E and its dual set E∗. The positivity is no longer

defined in terms of scalar product. But there is still a bijection between positive Schwartz kernels and

Hilbert spaces.

Of course this is only a short part of the story. For a detailed review on r.k.h.s and a complete literature

survey see [3, 14]. Moreover some authors consider non-positive kernels. A generalization to Banach sets

has been introduced[4] within the frameworkof the approximationtheory. Non-positivekernels have been

also introduced in Kre˘in spaces as the difference between two positive ones ([1] and [16] section 12).

2.2

r.k.h.s and learning in the literature

The first contribution of r.k.h.s to the statistical learning theory is the regression spline algorithm. For an

overview of this method see Wahba’s book [20]. In this book two important hypothesis regarding the ap-

plication of the r.k.h.s theory to statistics are stressed. These are the nature of pointwise defined functions

and the continuity of the evaluation functional1. An important and general result in this framework is the

1These definition are formaly given section 3.5, definition 3.1 and equation (3)

Page 3

3

so-called representer theorem [9]. This theorem states that the solution of some class of approximation

problems is a linear combination of a kernel evaluated at the training points. But only applications in one

or two dimensions are given. This is due to the fact that, in that work, the way to build r.k.h.s was based

on some derivative properties. For practical reason only low dimension regressors were considered by this

means.

PoggioandGirosi extendedtheframeworktolargeinputdimensionbyintroducingradialfunctionsthrough

regularization operator [13]. They show how to build such a kernel as the green functions of a differential

operator defined by its Fourier transform.

Support vector machines (SVM) perform another important link between kernel, sparsity and bounds on

the generalizationerror [19]. This algorithm is based on Mercer’s theorem and on the relationship between

kernel and dot product. It is based on the ability for positive kernel to be separated and decomposed

according to some generating functions. But to use Mercer’s theorem the kernel has to define a compact

operator. This is the case for instance when it belongs to L2functions defined on a compact domain.

Linksbetweengreenfunctions,SVM and reproducingkernelHilbert spacewere introducedin [8] and[17].

The link between r.k.h.s and bounds on a compact learning domain has been presented in a mathematical

way by Cucker and Smale [5].

Another important application of r.k.h.s to learning machines comes from the bayesian learning commu-

nity. This is due to the fact that, in a probabilistic framework, a positive kernel is seen as a covariance

function associated to a gaussian process.

3 Three principles on the nature of the hypothesis set

3.1The learning problem

A supervised learning problem is defined by a learning domain X ⊆ I Rdwhere d denotes the number of

explicative variables, the learning codomain Y ⊆ I R and a n dimensional sample {(xi,yi),i = 1,n}: the

training set.

Main stream formulation of the learning problem considers the loading of a learning machine based on

empiricaldata as theminimizationofa givencriterionwithrespectto somehypothesislyingin ahypothesis

set H. Inthis frameworkhypothesesarefunctionsf fromX to Y andthehypothesisspaceH is afunctional

space.

Hypothesis H1: H is a functional vector space

Technically a convergence criterion is needed in H, i.e. H has to be embedded with a topology. In the

remaining, we will always assumed H to be a convex topological vector space.

Learning is also the minimization of some criterion. Very often the criterion to be minimized contains two

terms. The first one, C, represents the fidelity of the hypothesis with respect to data while Ω, the second

one, represents the compression required to make a difference between memorizing and learning. Thus the

learning machine solves the following minimization problem:

min

f∈HC(f(x1),...,f(xn),y) + Ω(f)

(2)

The fact is, while writing this cost function, we implicitly assume that the value of function f at any point

xiis known. We will now discuss the important consequences this assumption has on the nature of the

hypothesis space H.

3.2The evaluation functional

By writing f(xi) we are assuming that function f can be evaluated at this point. Furthermore if we want

to be able to use our learning machine to make a prediction for a given input x, f(x) has to exist for all

x ∈ X: we want pointwise defined functions. This property is far from being shared by all functions. For

instance function sin(1/t) is not defined in 0. Hilbert space L2of square integrable functions is a quotient

Page 4

4

space of functions defined only almost everywhere (i.e. not on the singletons {x},x ∈ X). L2functions

are not pointwise defined because the L2elements are equivalence classes.

To formalize our point of view we need to define I RXas the set of all pointwise defined functions from

X to I R. For instance when X = I R all finite polynomials (including constant function) belong to I RX. We

can lay down our second principle:

Hypothesis H2: H is a set of pointwise defined function (i.e. a subset of I RX)

Of course this is not enough to define a hypothesis set properly and at least another fundamental prop-

erty is required.

3.3 Continuity of the evaluation functional

Thepointwiseevaluationofthehypothesisfunctionis notenough. We want alsothe pointwiseconvergence

of the hypothesis. If two functions are closed in some sense we don’t want them to disagree on any point.

Assume t is our unknown target function to be learned. For a given sample of size n a learning algorithm

provides a hypothesis fn. Assume this hypothesis converges in some sense to the target hypothesis. Actu-

ally the reason for hypothesis fnis that it will be used to predict the value of t at a given x. For any x we

want fn(x) to converge to t(x) as follows:

fn

H

−→ t =⇒ ∀x ∈ X, fn(x)

I R

−→ t(x)

We are not interested in global convergence properties but in local convergence properties. Note that it

may be rather dangerous to define a learning machine without this property. Usually the topology on H is

defined by a norm. Then the pointwise convergencecan be restated as follow:

∀x ∈ X, ∃Mx∈ I R+such that |f(x) − t(x)| ≤ Mx||f − t||H

(3)

At any point x, the error can be controlled.

It is interesting to restate this hypothesis with the evaluation functional

Definition 3.1 the evaluation functional

δx:H

f

−→

?−→

I R

δxf = f(x)

Applied to the evaluation functional our prerequisite of pointwise convergenceis equivalent to its continu-

ity.

Hypothesis H3: the evaluation functional is continuous on H

Since the evaluation functional is linear and continuous, it belongs to the topological dual of H. We will

see that this is the key point to get the reproducing property.

Note that the continuity of the evaluation functional does not necessarily imply uniform convergence. But

in many practical cases it does. To do so one additional hypothesis is needed, the constants Mxhave to

be bounded: supx∈XMx < ∞. For instance this is the case when the learning domain X is bounded.

Differences between uniform convergence and evaluation functional continuity is a deep and important

topic for learning machine but out of the scope of this paper.

3.4Important consequence

To build a learning machine we do need to choose our hypothesis set as a reproducing space to get the

pointwise evaluation property and the continuity of this evaluation functional. But the Hilbertian structure

is not necessary. Embedding a set of functions with the property of continuity of the evaluation functional

Page 5

5

has many interesting consequences. The most useful one in the field of learning machine is the existence

of a kernel K, a two-variable function with generation property2:

∀f ∈ H, ∃ℓ ∈ I N, (αi)i=1,ℓsuch that f(x) ≈

ℓ

?

i=1

αiK(x,xi)

I being a finite set of indices. Note that for practical reasons f may have a different representation.

If the evaluation set is also a Hilbert space (a vector space embedded with a dot product) it is a repro-

ducing kernel Hilbert space (r.k.h.s). Although not necessary, r.k.h.s are widly used for learning because

they have a lot of nice practical properties. Before moving on more general reproducing sets, let’s review

the most important properties of r.k.h.s for learning.

3.5

IRXthe set of the pointwise defined functions on X

In the following, the function space of the pointwise defined functions I RX= {f : X → I R} will be seen

as a topological vector space embedded with the topology of simple convergence.

I RXwill be put in duality with I R[X]the set of all functions on X equal to zero everywhere except on a

finite subset {xi,i ∈ I} of X. Thus all functions belonging to I R[X]can be written in the following way:

g ∈ I R[X]⇐⇒ ∃{αi},i = 1,n such that g(x) =

?

i

αi1 Ixi(x)

were the indicator function 1 Ixi(x) is null everywhere except on xiwhere it is equal to one.

∀x ∈ X1 Ixi(x) = 0 if x ?= xiand 1 Ixi(x) = 1 if x = xi

Note that the indicator function is closely related to the evaluation functional since they are in bijection

through:

∀f ∈ I RX,∀x ∈ X,δx(f) =

?

y∈X

1 Ix(y)f(y) = f(x)

But formally,?I RX?′= span{δx} is a set of linear formswhile I R[X]is a set of pointwise definedfunctions.

4 Reproducing Kernel Hilbert Space (r.k.h.s)

Definition 4.1 (Hilbert space) A vector space H embeddedwith the positive definite dot product ?.,.?His

a Hilbert space if it is complete for the inducednorm ?f?2

in H).

H= ?f,f?H(i.e. all Cauchy sequences converge

For instance I Rn, Pkthe set of polynomials of order lower or equals to k, L2, ℓ2the set of square sumable

sequences seen as functions on I N are Hilbert spaces. L1and the set of bounded functions L∞are not.

Definition 4.2 (reproduction kernel Hilbert space (r.k.h.s)) A Hilbert space (H,?.,.?H) is a r.k.h.s if

it is defined on I RX(pointwise defined functions) and if the evaluation functional is continuous on H (see

the definition of continuity equation 3).

For instance I Rn, Pkas any finite dimensional set of genuine functions are r.k.h.s. ℓ2is also a r.k.h.s.

The Cameron-Martin space defined example 8.1.2 is a r.k.h.s while L2is not because it is not a set of

pointwise functions.

Definition 4.3 (positive kernel) A function from X × X to I R is a positive kernel if it is symmetric and if

for any finite subset {xi},i = 1,n of X and any sequence of scalar {αi},i = 1,n

n

?

i=1

n

?

j=1

αiαjK(xi,yj) ≥ 0

2this property means that the set of all finite linear combinations of the kernel is dense in H. See proposition 4.1 for a more precise

statement.

Page 6

6

This definition is equivalent to Aronszajn definition of positive kernel given equation (1).

Proposition 4.1 (bijection between r.k.h.s and Kernel) Corollary of proposition23in [16] andtheorem

1.1.1 in [20]. There is a bijection between the set of all possible r.k.h.s and the set of all positive kernels.

Proof.

⇒ from r.k.h.s to Kernel. Let (H,?.,.?H) be a r.k.h.s. By hypothesis the evaluation functional δxis a continuous

linear form so that it belongs to the topological dual of H. Thanks to the Riesz theorem we know that for each

x ∈ X there exists a function Kx(.) belonging to H such that for any function f(.) ∈ H:

δx(f(.)) = ?Kx(.),f(.)?H

Kx(.) is a function from X ×X to I R and thus can be written as a two variable function K(x,y). This function

is symmetric and positive since, for any real finite sequence {αi},i = 1,ℓ,Pℓ

?Pℓ

H

=

ℓ

X

i=1

j=1

i=1αiK(x,xi) ∈ H, we have:

i=1αiK(.,xi)?2

?Pℓ

i=1αiK(.,xi),Pℓ

ℓ

X

αiαjK(xi,xj)

j=1αjK(.,xj)?H

=

⇐ from kernel to r.k.h.s. For any couple (f(.),g(.)) of I R[X](there exist two finite sequences {αi}i = 1,ℓ

and {βj},j = 1,m and two sequence of X points {xi}i = 1,ℓ, {yj},j = 1,m such that f(x) =

Pℓ

i=1αℓ

i=11 Ixi(x) and g(x) =Pm

j=1βj1 Iyj(x)) we define the following bilinear form:

?f(.),g(.)?[X]=

ℓ

X

i=1

m

X

j=1

αiβjK(xi,yj)

Let H0 = {f ∈ I R[X];| ?f(.),f(.)?[X]= 0}. ?.,.?[X]defines a dot product on the quotient set I R[X]/H0. Now

let’sdefine H as theI R[X]completion for thecorresponding norm. Hisa r.k.h.s withkernel K by construction.

Proposition 4.2 (from basis to Kernel) Let H be a r.k.h.s. Its kernel K can be written:

K(x,y) =

?

i∈I

ei(x) ei(y)

for all orthonormal basis {ei}i∈Iof H, I being a set of indices possibly infinite and non-countable.

Proof. K ∈ H implies there exits a real sequence {αi}i∈I such that K(x,.) =P

element of the orthonormal basis:

i∈Iαiei(x). Then for all ei(x)

?K(.,y),ei(.)?H

?K(.,y),ei(.)?H

=

=

=

=

ei(y)

?P

P

αi

because of K reproducing property

and

j∈Iαjej(.),ei(.)?H

j∈Iαj?ej(.),ei(.)?H

because {ei}i∈I is an orthonormal basis

by identification we have αi = ei(y).

Remark 4.1 Thanks to this results it is also possible to associate to any positive kernel a basis, possibly

uncountable. Consequenty to proposition 4.1 we now how to associate a r.k.h.s to any positive kernel and

we get the result because every Hilbert space admit an orthonormal basis.

The fact that the basis is countable or uncountable (that the corresponding r.k.h.s is separable or not) has

noconsequencesonthenatureofthehypothesisset (seeexample8.1.7). ThusMercerkernelsareaparticlar

case of a more general situation since every Mercer kernel is positive in the Aronszajn sense (definition

4.3) while the converseis false. Consequenty,when possible functionnalformulationis preferibleto kernel

formulation of learning algorithm.

Page 7

7

5Kernel and kernel operator

5.1How to build r.k.h.s?

It is possible to build r.k.h.s from a L2(G,µ) Hilbert space where G is a set (usualy G = X) and µ a

measure. To do so, an operator S is defined to map L2functions onto the set of the pointwise valued

functions I RX. A general way to define such an operator consists in remarking that the scalar product

performs such a linear mapping. Based on that remark this operator is built from a family Γxof L2(G,µ)

functions when x ∈ X in the following way:

Definition 5.1 (Carleman operator) Let Γ = {Γx,x ∈ X} be a family of L2(G,µ) functions. The asso-

ciated Carleman operator S is

S :L2

−→ I RX

f ?−→g(.) = (Sf)(.) = ?Γ(.),f?L2 =

?

G

Γ(.)f dµ

That is to say ∀x ∈ X, g(x) = ?Γx,f?L2. To make apparent the bijective restriction of S it is convenient

to factorize it as follows:

S : L2−→ L2/Ker(S)

T

−→ Im(S)

i

−→ I RX

(4)

where L2/Ker(S) is the quotient set, T the bijective restriction of S and i the cannonical injection.

This class of integraloperatorsis knownas Carlemanoperators[18]. Note that this operatorunlike Hilbert-

Schmidt operators need not be compact neither bounded. But when G is a compact set or when Γx ∈

L2(G × G) (it is a square integrable function with respect to both of its variables) S is a Hilbert-Schmidt

operator. As an illustration of this property, see the gaussian example on G = X = I R in table 1. In that

case Γx(τ) ?∈ L2(X × X)3.

Proposition 5.1 (bijection between Carleman operators and the set of r.k.h.s) - Proposition 21 in

[16] or theorems 1 and 4 in [14]. Let S be a Carleman operator. Its image set H = Im(S) is a r.k.h.s. If

H is a r.k.h.s there exists a measure µ on some set G and a Carleman operator S on L2(G,µ) such that

H = Im(S).

Proof.

⇒ Consider T the bijective restriction of S defined in equation (4). H = Im(S) can be embedded with the induced

dot product defined as follows:

∀g1(.),g2(.) ∈ H2,?g1(.),g2(.)?H

=

=

?T−1g1,T−1g2?L2

?f1,f2?L2

where g1(.) = Tf1and g2(.) = Tf2

With respect to the induced norm, T is an isometry. To prove H is a r.k.h.s, we have to check the continuity of

the evaluation functional. This works as follows:

g(x)=

=

(Tf)(x)

?Γx,f?L2

≤

≤

?Γx?L2 ?f?L2

Mx ?g(.)?H

with Mx = ?Γx?L2. In this framework H reproducing kernel K verifies SΓx = K(x,.). It can be built based

on Γ:

K(x,y)=?K(x,.),K(y,.)?H

=?Γx,Γy?L2

⇐ Let {ei},i ∈ I be a L2(G,µ) orthonormal basis and {hj(.)},j ∈ J an orthonormal basis of H. We admit

there exists a couple (G,µ) such that card(I) ≥ card(J) (take for instance the counting measure on the suitable

3To clarify the not so obvious notion of pointwise defined function, whenever possible, we use the notation f when the function is

not a pointwise defined function and f(.) denotes I RXfunctions. Here Γx(τ) is a pointwise defined function with respect to variable

x but not with respect to variable τ. Thus, whenever possible, the confusing notation (τ) is omitted.

Page 8

8

Name

Cameron Martin

Γx(u)

1 I{x≤u}

d

?

K(x,y)

min(x,y)

Polynomial

e0(u) +

i=1

xiei(u)

x⊤y + 1

Gaussian

1/Zexp−(x−u)2

2

1/Z′exp−(x−y)2

4

Table 1: Examples of Carleman operator and their associated reproducing kernel. Note that functions

{ei}i=1,dare a finite subfamily of a L2orthonormal basis. Z and Z′are two constants.

set). Define Γx =P

Carleman operator is the r.k.h.s span by hj(.) since:

j∈Jhj(x)ejas a L2family. Let T be the associated Carleman operator. The image of this

∀f ∈ L2,(Tf)(x)=

=

?Γx,f?L2

?

X

j∈J

X

j∈J

X

j∈J

hj(x)ej,

X

i∈I

αi?ej,ei?L2

αiei?L2

because f =

X

i∈I

αiei

=

hj(x)

X

i∈I

=

αjhj(x)

and family {hi(.)} is orthonormal since hi(.) = Tei.

To put this framework at work the relevant function Γxhas to be found. Some examples with popular

kernels illustrating this definition are shown table 1.

5.2Carleman operator and the regularization operator

The same kind of operator has been introduced by Poggio and Girosi in the regularization framework [13].

They proposed to define the regularization term Ω(f) (defined equation 2) by introducing a regularization

operatorP fromhypothesisset H to L2such that Ω(f) = ?Pf?2

operator P models the prior knowledge about the solution defining its regularity in terms of derivative or

Fourier decomposition properties. Furthermore the authors show that, in their framework, the solution of

the learning problem is a linear combination of a kernel (a representer theorem). They also give a method-

ology to build this kernel as the green function of a differential operator. Following [2] in its introduction

the link between green function and r.k.h.s is straightforward when green function is a positive kernel.

But a problem arises when operator P is chosen as a derivative operator and the resulting kernel is not

derivable (for instance when P is the simple derivation, the associated kernel is the non-derivablefunction

min(x,y)). A way to overcome this technical difficulty is to consider things the other way round by defin-

ing the regularization term as the norm of the function in the r.k.h.s built based on Carleman operator T.

In this case we have Ω(f) = ?f?H= ?T−1g?2

P = T−1. This is no longer a derivative operator but a generalized derivative operator where the derivation

is defined as the inverse of the integration (P is defined as T−1).

L2. This frameworkis very attractivesince

L2. Thus since T is bijective we can define operator P as:

5.3Generalization

It is important to notice that the above framework can be generalized to non L2Hilbert spaces. A way to

see this is to use Kolmogorov’s dilation theorem [7]. Furthermore, the notion of reproducing kernel itself

can be generalizedto non-pointwisedefined functionby emphasizingthe role playedby continuitythrough

positive generalized kernels called Schwartz or hilbertian kernels [16]. But this is out of the scope of our

work.

Page 9

9

6 Reproducing kernel spaces (RKS)

By focusing on the relevant hypothesis for learning we are going to generalize the above framework to

non-hilbertian spaces.

6.1Evaluation spaces

Definition 6.1 (ES)

Let H be a real topological vector space (t.v.s.) on an arbitrary set X, H ⊂ I RX. H is an evaluation space

if and only if:

δx: H−→I R

f?−→δx(f) = f(x)

∀x ∈ X,

is continuous

ES are then topological vector spaces in which δt(the evaluationfunctional at t) is continuous,i.e. belongs

to the topological dual H∗of H.

Remark 6.1 Topological vector space I RXwith the topology of simple convergence is by construction an

ETS (evaluation topological space).

In the case of normed vector space, another characterization can be given:

Proposition 6.1 (normed ES or BES)

Let (H,?.?H) be a real normed vector space on an arbitrary set X, H ⊂ I RX. H is an evaluation kernel

space if and only if the evaluation functional:

∀x ∈ X, ∃Mx∈ I R, ∀f ∈ H, |f(x)| ≤ Mx?f?H

if it is complete for the corresponding norme it is a Banach evaluation space (BES).

Remark 6.2 In the case of a Hilbert space, we can identify H∗and H and, thanks to the Riesz theorem,

the evaluation functional can be seen as a function belonging to H: it is called the reproducing kernel.

This is an important point: thanks to the Hilbertian structure the evaluation functional can be seen as a

hypothesis function and therefore the solution of the learning problem can be built as a linear combination

of this reproducingkernel taken different points. Representer theorem [9] demonstrates this propertywhen

the learning machine minimizes a regularized quadratic error criterion. We shall now generalize these

properties to the case when no hilbertian structure is available.

6.2Reproducing kernels

The key point when using Hilbert space is the dot product. When no such bilinear positive functional is

available its role can be played by a duality map. Without dot product, the hypothesis set H is no longer

in self duality. We need another set M to put in duality with H. This second set M is a set of functions

measuringhow the informationI have at pointx1helps me to measure the quality of the hypothesisat point

x2. These two sets have to be in relation through a specific bilinear form. This relation is called a duality.

Definition 6.2 (Duality between two sets) Two sets (H,M) are in duality if there exists a bilinear form

L on H × M that separates H and M (see [10] for details on the topological aspect of this definition).

Let L be such a bilinear form on H × M that separate them. Then we can define a linear application γH

and its reciprocal θHas follows:

γH:M

f

−→

?−→

H∗

γHf = L(.,f)

θH:

Im(γH)

g = L(.,f)

−→

?−→

M

θHg = f

where H∗(resp. M∗) denotes the dual set of H (resp. M).

Let’s take an important example of such a duality.

Page 10

10

Hilbertian caseGeneral case

?I RX?′

i∗

?

???

?

H′Riesz

?

?

?

?

?

?

κ

??

=H

?

?

i

?

???

?

?

?

?

?

I RX

?I RX?′

i∗

??

j∗

??

κ

??

M′

θM

?????????

??

H′

θH

???

?

?

?

?

?

?

?

?

H

i

??

M

j

??I RX

K(s,t) = ?K(s,.),K(.,t)?H

K(s,t) = LH(κ∗(δs),κ(δt))

Figure 1: illustration of the subduality map.

Proposition 6.2 (duality of pointwise defined functions) Let X be any set (not necessarily compact).

I RXand I R[X]are in duality

Proof. Let’s define the bilinear application L as follows:

L :I RX× I R[X]

−→

?−→

I R

X

i∈I

`f(.),g(.) =

X

i∈I

αi1 Ixi(.)´

αif(xi) =

X

x∈X

f(x)g(x)

Another example is shown in the two following functional spaces:

L1=

?

f

???

?

X

|f| dµ < ∞

?

and

L∞=

?

f

??? esssup

x∈X

|f| < ∞

?

where for instance µ denotes the Lebesgue measure. Theses two spaces are put in duality through the

following duality map:

L :L1× L∞

−→I R

f,g?−→L(f,g) =

?

X

f g dµ

Definition 6.3 (Evaluation subduality) Two sets H and M form an evaluation subduality iff:

- they are in duality through their duality map γH,

- they both are subsets of I RX

- the continuity of the evaluation functional is preserved through:

Span(δx) = γI RX

??I RX?′?

⊆ γH(M)

and

γI RX

??I RX?′?

⊆ θH(H)

The key point is the way of preserving the continuity. Here the strategy to do so is first to consider two sets

in duality and then to build the (weak) topology such that the dual elements are (weakly) continuous.

Proposition 6.3 (Subduality kernel) A unique weakly continuous linear application κ is associated to

each subduality. This linear application, called the subduality kernel, is defined as follows:

κ :

?I RX?′

−→

?−→

I RX

i ◦ θM◦ j∗(??

i∈Iδxi

i∈Iδxi)

where i and j∗are the canonical injections from H to I RXand respectively from?I RX?′to M′(figure 1).

Page 11

11

DualityEvaluation subduality

B′

θ(B,A)

??????????

A′

θ(A,B)

??????????

A

B

?I RX?′

i∗

??

j∗

??

κ

??

M′

θ(M,H)

???????

j

????

H′

θ(H,M)

???

?

?

?

?

?

?

?

?

H

i

??

M

j

??I RX

Figure 2: illustration of the building operators for reproducing kernel subduality from a duality (A,B).

*

Γy

Λx

Proof. for details see [10].

We can illustrate this mapping detailing all performed applications as in figure 1:

?I RX?′

see 3.5

−→

?−→

I R[X]

1 I{x}

j∗

−→

?−→

M′

θM

−→

?−→

H

i

−→

?−→

I RX

K(x,.)δx

L(Kx,.)Kx(.)

Definition 6.4 (Reproducing kernel of an evaluation subduality) Let (H,M) be an evaluation subdu-

ality with respect to map LHassociated with subduality kernel κ. The reproducing kernel associated with

this evaluation subduality is the function of two variables defined as follows:

K :X × X

(x,y)

−→

?−→

I R

K(x,y) = LH(κ∗(δy),κ(δx))

This structure is illustrated in figure 1. Note that this kernel no longer needs to be definite positive. If

the kernel is definite positive it is associated with a unique r.k.h.s. However, as shown in example 8.2.1

it can also be associated with evaluation subdualities. A way of looking at things is to define κ as the

generalizationoftheSchwartzkernelwhileK is thegeneralizationoftheAronszajnkerneltononhilbertian

structures. Based on these definitions the important expression property is preserved.

Proposition 6.4 (generation property) ∀f ∈ H, ∃(αi)i∈Isuch that f(x) ≈

∀g ∈ M, ∃(αi)i∈Isuch that g(x) ≈?

Proof. This property is due to the density of Span{K(.,x),x ∈ X} in H. For more details see [10] Lemma 4.3.

Just like r.k.h.s, another important point is the possibility to build an evaluation subduality, and of course

its kernel, starting from any duality.

?

i∈IαiK(x,xi) and

i∈IαiK(xi,x)

Proposition 6.5 (building evaluation subdualities) Let (A,B) be a duality with respect to map LA. Let

{Γx,x ∈ X} be a total family in A and {Λx,x ∈ X} be a total family in B. Let S (reps. T) be the linear

mapping from A (reps. B) to I RXassociated with Γx(reps. Λx) as follows:

S : A

g

−→

?−→

I RX

Sg(x) = LA(g,Λx)

T : B

f

−→

?−→

I RX

Tf(x) = LA(Γx,f)

Then S and T are injective and (S(A),T(B)) is an evaluation subduality with the reproducing kernel K

defined by:

K(x,y) = LA(Γx,Λy)

Proof. see [10] Lemma 4.5 and proposition 4.6

An example of such subduality is obtained by mapping the (L1,L∞) duality to I RXusing injective opera-

tors defined by the families Γx(τ) = 1 I{x<τ}and Λy(τ) = 1 I{y<τ}:

T :L1

f

−→

?−→

I RX

Tf(x) = (Γx,f)L∞,L1 =?1 I{x<τ}f(τ) dτ

Page 12

12

and

S :L∞

g

−→

?−→

I RX

Sg(y) = (g,Λy)L∞,L1=

?g(τ)1 I{y<τ}dτ

Λ(y,τ)Γ(x,τ) dτ = min(x,y). We define theIn this case H = Im(T), M = Im(S) and K(y,x) =

duality map between H and M through:

?

LX(g1,g2) = LX(Sf1,Tf2) = L(f1,f2)

See example 8.2.1 for details.

All useful properties of r.k.h.s – pointwise evaluation, continuity of the evaluation functional, representa-

tion and building technique – are preserved. A missing dot product has no consequence on this functional

aspect of the learning problem.

7Representer theorem

Another issue is of paramount practical importance: determining the shape of the solution. To this end

representertheoremstates that, whenH is a r.k.h.s, the solutionof the minimizationofthe regularizedcost

defined equation (2) is a linear combination of the reproducing kernel evaluated at the training examples

[9, 15]. When hypothesis set H is a reproducing space associated with a subduality we have the same

kind of result. The solution lies in a finite n-dimensional subspace of H. But we don’t know yet how to

systematically build a convenient generating family in this subspace.

Theorem 7.1 (representer) Assume (H,M) is a subduality of I RXwith kernel K(x,y). Assume the

stabilizer Ω is convex and differentiable (∂Ωdenotes its subdifferential set).

If ∂Ω(?αiK(xi,x)) ⊆ {?βiδxi} ∈ H∗then the solution of cost minimization lies in a n-dimensional

Proof.

Define a M subset M1 =

˘Pn

duality map (i.e. ∀f ∈ H2,∀g ∈ M1 L(f,g) = 0). Then for all f ∈ H2,f(xi) = 0,i = 1,n. Now let H1 be the

complement vector space defined such that

subspace of H.

i=1αiK(xi,.)¯. Let H2 ⊂ H be the M1 orthogonal in the sense of the

H = H1⊕ H2

⇔ ∀f ∈ H ∃f1 ∈ H1and f2 ∈ H2

such that f = f1+ f2

The solution of the minimizing problem lies in H1since:

- ∀f2 ∈ H2,C(f2) = constant

- Ω(f1+ f2) ≥ Ω(f1) + (∂Ω(f1),f2)M,H

- and ∀f2 ∈ H2,;(∂Ω(f1),f2)M,H= 0

By construction H1a n-dimensional subspace of H.

The nature of vector space H1depends on kernel K and on regularizer Ω. In some cases it is possible to

be more precise and retrieve the nature of H1. Let’s assume regularizer Ω(f) is given. H may be chosen

as the set of function such that Ω(f) < ∞ . Then, if it is possible to build a subduality (H,M) with kernel

K such that

E = Vect{K(xi,.)}

?

and if the vector space spaned by the kernel belongs to the regularizer subdifferential ∂Ω(f):

(thanks to the convexity of Ω)

by hypothesis

???

H1

⊕( Vect{K(.,xi)})⊤

?

?? ?

M⊤

1

∀f ∈ H1,∃g ∈ M1such that g ∈ ∂Ω(f)

then solution f∗of the minimization of the regularizedempirical cost is a linear combinationof the kernel:

f∗(x) =

n

?

i=1

αiK(xi,x)

Page 13

13

An example of such result is given with the following regularizer based on the p-norm on G = [0,1]:

Ω(f) =

?1

0

(f′)pdµ

The hypothesis set is Sobolev space Hp(the set of functions defined on [0,1] whose generalized derivative

is p-integrable) put in duality with Hq(with 1/p + 1/q = 1) through the following duality map:

L(f,g) =

?1

0

f′g′dµ

The associated kernel is just like in Cameron Martin case K(x,y) = min(x,y). Some tedious derivations

lead to:

∀h ∈ HL(h,∂Ω(f)) =

?1

0

h′p(f′)p−1dµ

Thus the kernel verifies p(K(.,y)′)p−1∝ K(x,.)

This question of the representer theorem is far from being closed. We are still looking for a way to derive

a generating family from the kernel and the regularizer. To go more deeply into general and constructive

results, a possible way to investigate is to go through Ω Fenchel dual.

8Examples

8.1Examples in Hilbert space

The examples in this section all deal with r.k.h.s included in a L2space.

1. Schmidt ellipsoid:

Let (X,µ) be a measure space, {ei,i ∈ I} a basis of L2(X,µ) I being a countable set of indices.

Any sequence {αi,i ∈ I,

?

function:

∀(x,y) ∈ X2,K(x,y) =

i∈Iα2

i∈Iαiei(x)ei(y), thus a reproducing kernel Hilbert space with kernel

i< +∞} defines a Hilbert-Schmidt operator on L2(X,µ) with

kernel function Γ(x,y) =?

?

i∈I

α2

iei(x)ei(y)

The closed unit ball BHof the r.k.h.s verifies

BH= T(BL2) =

?

f ∈ L2,f =

?

i∈I

fiei,

?

i∈I

?fi

αi

?2

≤ 1

?

and is then a Schmidt ellipsoid in L2. An interesting discussion about Schmidt ellipsoids and their

applications to sample continuity of Gaussian measures may be found in [6].

2. Cameron-Martin space:

Let T be the Carleman integral operator on L2([0,1]µ) (µ is the Lebesgue measure) with kernel

function

Γ(x,y) = Y (x − y) = 1 I{y≤x}

it defines a r.k.h.s with reproducing kernel K(x,y) = min(x,y). The space (H;?.,.?H) is the

Sobolev space of degree 1, also called the Cameron-Martin space.

?

H = {f absolutely continuous,∃f′∈ L2([0,1]), f(x) =?x

0f′dµ}

?f,g?H= ?f′,g′?L2

Page 14

14

3. A Carleman but non Hilbert-Schmidt operator:

Let T be the integral operator on L2(I R,µ) (µ is the Lebesgue measure) with kernel function

Γ(x,y) = exp−1

2(x−y)2

It is a Carleman integral operator, thus we can define a r.k.h.s (H;?.,.?H) = Im(T), but T is not a

Hilbert-Schmidt operator. H reproducing kernel is:

K(x,y) =1

Zexp−1

4(x−y)2

where Z is a suitable constant.

4. Continuous kernel:

This example is based on theorem 3.11 in [12]. Let X be a compact subspace of I R, K(.,.) a con-

tinuous symmetric positive definite kernel. It defines a r.k.h.s (H;?.,.?H) and any Radon measure

µ of full support is kernel-injective. Then, for any such µ, there exists a Carleman operator T on

L2(X,µ) such that (H;?.,.?H) = Im(T).

5. Hilbert space of constants:

Let (H;?.,.?H) be the Hilbert space of constant functions on I R with scalar product ?f,g?H =

f(0)g(0). It is obviouslya r.k.h.s with reproducingkernel K(.,.) ≡ 1. For any probabilitymeasure

µ on I R let:

∀f ∈ L2(I R,µ),Tf =

?

I R

f(s)µ(ds)

Then H = T(L2(I R,µ)) and ∀f,g ∈ H, ?f,g?H= ?f,g?L2.

6. A non-separable r.k.h.s - the L2space of almost surely null functions:

Define the positive definite kernel function on X ⊂ I R by ∀s,t ∈ X, K(s,t) = 1 I{s=t}. It defines

a r.k.h.s (H;?.,.?H) and its functions are null except on a countable set. Define a measure µ on

(X,B) where B is the Borel σ-algebra on X by µ(t) = 1 ∀t ∈ X. µ verifies: µ({t1,··· ,tn}) = n

and µ(A) = +∞ for any non-finite A ∈ B. The kernel function is then square integrable and H is

injectively included in L2(X,B,µ). Moreover, K(s,t) =?

Finally, (H;?.,.?H) = L2(X,B,µ).

XK(t,u)K(u,s)dµ(u) with K Carle-

man integrable and T = IdL2 (note that the identity is a non-compact Carleman integral operator).

7. Separable r.k.h.s :

Let H be a separable r.k.h.s . It is well known that any separable Hilbert space is isomorphic to

ℓ2. Then there exists T kernel operator Im(T) = H. It is easy to construct effectively such a T:

let {hn(.), n ∈ N} be an orthonormal basis of H and define T kernel operator on ℓ2with kernel

Γx→ {hn(x), n ∈ N}(∈ l2). Then Im(T) = H.

8.2 Other examples

Applications to non-hilbertian spaces are also feasible:

1. (L1,L∞) - “Cameron-Martin" evaluation subduality:

Let T be the kernel operator on L1([0,1]µ) (µ is the Lebesgue measure) with kernel function

Γ(t,s) = Y (t − s) = 1 I{s≤t},Γ(t,.) ∈ L∞

it defines an evaluation duality (H1;H∞) with reproducing kernel

∀(s,t) ∈ X2,K(s,t) = min(s,t)

?

H1= {f absolutely continuous,∃f′∈ L1([0,1]), f(t) =?t

0f′(s)ds}

?f?H1= ?f′?L1

and

?

H∞= {f absolutely continuous,∃f′∈ L∞([0,1]), f(t) =?t

0f′(s)ds}

?f?H∞= ?f′?L∞

Page 15

15

2.

?

We have seen that I RXendowed with the topology of simple convergence is an ETS. However, I RX

endowed with the topology of almost sure convergenceis never an ETS unless every singleton of X

has strictly positive measure.

I RX,I R[X]?

:

9Conclusion

It is always possible to learn withoutkernel. But evenif it is not visible, oneis hiddensomewhere! We have

shown, from some basic principles (we want to be able to compute the value of a hypothesis at any point

and we want the evaluationfunctionalto be continuous),how to derivea frameworkgeneralizingr.k.h.s to

non-hilbertian spaces. In our reproducing kernel dualities, all r.k.h.s nice properties are preserved except

the dot product replaced by a duality map. Based on the generalization of the hilbertian case, it is possible

to build associated kernels thanks to simple operators. The construction of evaluation subdualities without

Hilbert structure is easy within this framework (and rather new). The derivation of evaluation subdualities

from any kernel operator has many practical outcome. First, such operators on separable Hilbert spaces

can be representedby matrices, and we can build anyseparable r.k.h.s from well-knownℓ2structures (like

wavelets in a L2space for instance). Furthermore, the set of kernel operators is a vector space whereas

the set of evaluation subdualities is not (the set of r.k.h.s is for instance a convex cone), hence practical

combination of such operators are feasible. On the other hand, from the bayesian point of view, this result

may have many theoretical and practical implications in the theory of Gaussian or Laplacian measures and

abstract Wiener spaces.

Unfortunately, even if some work has been done, a general representer theorem is not available yet. We

are looking for an automatic mechanism designing the shape of the solution of the learning problem in the

following way:

m

?

where Kernel K, number of component m and functions ϕk(x),j = 1,k are derivated from regularizer Ω.

The remaining questions being: how to learn the coefficients and how to determine cost function?

f(x) =

i=1

αiK(xi,x) +

k

?

j=1

βjϕj(x)

Acknowledgements

Part of this work has been realized while the authors were visiting B. Schölkopf in Tuebingen. The sec-

tion dedicated to the representer theorem benefits from O. Bousquet ideas. This work also benefits from

comments and discussion with NATO ASI on Learning Theory and Practice students in Leuven.

References

[1] D.A. Alpay, Some krein spaces of analytic functions and an inverse scattering problem, Michigan Journal of

Mathematics 34 (1987) 349–359.

[2] N. Aronszajn. Theory of reproducing kernels, Transactions of the American Society 68 (1950) 337–404.

[3] M. Attéia, Hilbertian kernels and spline functions, North-Holland (1992).

[4] M. Attéia and J. Audounet, Inf-compact potentials and banachic kernels, In Banach space theory and its

applications, volume 991 of Lecture notes in mathematics, Springer-Verlag (1981) 7–27.

[5] F. Cucker and S. Smale, On the mathematical foundations of learning. Bulletin of the American Mathematical

Society 39 (2002) 1–49.

[6] R. M. Dudley, Uniform central limit theorems, Cambridge university press (1999).