PreprintPDF Available

Local optimisation of Nystr\"om samples through stochastic gradient descent

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We study a relaxed version of the column-sampling problem for the Nystr\"om approximation of kernel matrices, where approximations are defined from multisets of landmark points in the ambient space; such multisets are referred to as Nystr\"om samples. We consider an unweighted variation of the radial squared-kernel discrepancy (SKD) criterion as a surrogate for the classical criteria used to assess the Nystr\"om approximation accuracy; in this setting, we discuss how Nystr\"om samples can be efficiently optimised through stochastic gradient descent. We perform numerical experiments which demonstrate that the local minimisation of the radial SKD yields Nystr\"om samples with improved Nystr\"om approximation accuracy.
Content may be subject to copyright.
Local optimisation of Nyström samples
through stochastic gradient descent
Matthew HUTCHINGS§Bertrand GAUTHIER§
March 28, 2022
Abstract
We study a relaxed version of the column-sampling problem for the Nyström approximation of
kernel matrices, where approximations are defined from multisets of landmark points in the ambient
space; such multisets are referred to as Nyström samples. We consider an unweighted variation of
the radial squared-kernel discrepancy (SKD) criterion as a surrogate for the classical criteria used to
assess the Nyström approximation accuracy; in this setting, we discuss how Nyström samples can be
efficiently optimised through stochastic gradient descent. We perform numerical experiments which
demonstrate that the local minimisation of the radial SKD yields Nyström samples with improved
Nyström approximation accuracy.
Keywords: Low-rank matrix approximation; Nyström method; reproducing kernel Hilbert spaces;
stochastic gradient descent.
1 Introduction
In Data Science, the Nyström method refers to a specific technique for the low-rank approximation of
symmetric positive-semidefinite (SPSD) matrices; see e.g. [4, 5, 10, 11, 18]. Given an 𝑁×𝑁SPSD
matrix 𝐊, with 𝑁, the Nyström method consists of selecting a sample of 𝑛columns of 𝐊,
generally with 𝑛 𝑁, and next defining a low-rank approximation
𝐊of 𝐊based on this sample of
columns. More precisely, let 𝒄1,,𝒄𝑁𝑁be the columns of 𝐊, so that 𝐊= (𝒄1𝒄𝑁), and let
𝐼= {𝑖1,, 𝑖𝑛}{1,, 𝑁}denote the indices of a sample of 𝑛columns of 𝐊(note that 𝐼is a multiset,
i.e. the indices of some columns might potentially be repeated). Let 𝐂= (𝒄𝑖1𝒄𝑖𝑛)be the 𝑁×𝑛
matrix defined from the considered sample of columns of 𝐊, and let 𝐖be the 𝑛×𝑛principal submatrix
of 𝐊defined by the indices in 𝐼, i.e. the 𝑘, 𝑙 entry of 𝐖is [𝐊]𝑖𝑘,𝑖𝑙, the 𝑖𝑘, 𝑖𝑙entry of 𝐊. The Nyström
approximation of 𝐊defined from the sample of columns indexed by 𝐼is given by
𝐊=𝐂𝐖𝐂𝑇,(1)
with 𝐖the Moore-Penrose pseudoinverse of 𝐖. The column-sampling problem for Nyström approx-
imation consists of designing samples of columns such that the induced approximations are as accurate
as possible (see Section 1.2 for more details).
HutchingsM1@cardiff.ac.uk
GauthierB@cardiff.ac.uk
§Cardiff University, School of Mathematics
Abacws, Senghennydd Road, Cardiff, CF24 4AG, United Kingdom
1
arXiv:2203.13284v1 [stat.ML] 24 Mar 2022
1.1 Kernel Matrix Approximation
If the initial SPSD matrix 𝐊is a kernel matrix, defined from a SPSD kernel 𝐾and a set or multiset of
points = {𝑥1,, 𝑥𝑁}𝒳(and with 𝒳a general ambient space), i.e. the 𝑖, 𝑗 entry of 𝐊is 𝐾(𝑥𝑖, 𝑥𝑗),
then a sample of columns of 𝐊is naturally associated with a subset of ; more precisely, a sample of
columns {𝒄𝑖1,,𝒄𝑖𝑛}, indexed by 𝐼, naturally defines a multiset {𝑥𝑖1,, 𝑥𝑖𝑛}, so that the induced
Nyström approximation can in this case be regarded as an approximation induced by a subset of points in
. Consequently, in the kernel-matrix framework, instead of relying only on subsets of columns, we may
more generally consider Nyström approximations defined from a multiset 𝒳. Using matrix notation,
the Nyström approximation of 𝐊defined by a subset = {𝑠1,, 𝑠𝑛}is the 𝑁×𝑁SPSD matrix
𝐊(),
with 𝑖, 𝑗 entry
𝐊()𝑖,𝑗 =𝐤𝑇(𝑥𝑖)𝐊
𝐤(𝑥𝑗),(2)
where 𝐊is the 𝑛×𝑛kernel matrix defined by the kernel 𝐾and the subset , and where
𝐤(𝑥) = 𝐾(𝑥, 𝑠1),, 𝐾(𝑥, 𝑠𝑛)𝑇𝑛.
We shall refer to such a set or multiset as a Nyström sample, and to the elements of as landmark points;
the notation
𝐊()emphasises that the considered Nyström approximation of 𝐊is induced by . As in
the column-sampling case, the landmark-point-based framework naturally raises questions related to the
characterisation and the design of efficient Nyström samples (i.e. leading to accurate approximations
of 𝐊). As an interesting feature, Nyström samples of size 𝑛may be regarded as elements of 𝒳𝑛, and if the
underlying set 𝒳is regular enough, they might be directly optimised on 𝒳𝑛; the situation we consider
in this work corresponds to the case 𝒳=𝑑, with 𝑑, but 𝒳may more generally be a differentiable
manifold.
Remark 1.1. If we denote by the reproducing kernel Hilbert space (RKHS, see e.g. [1, 14]) of real-
valued functions on 𝒳associated with 𝐾, we may then note that the matrix
𝐊()is the kernel matrix
defined by 𝐾𝑆and the set , with 𝐾𝑆the reproducing kernel of the subspace
𝑆= span{𝑘𝑠1,, 𝑘𝑠𝑛},
where, for 𝑡𝒳, the function 𝑘𝑡is defined as 𝑘𝑡(𝑥) = 𝐾(𝑥, 𝑡), for all 𝑥𝒳.
1.2 Assessing the Accuracy of Nyström Approximations
In the classical literature on the Nyström approximation of SPSD matrices, the accuracy of the approxi-
mation induced by a Nyström sample is often assessed through the following criteria:
(C.1)
𝐊
𝐊()
, with .the trace norm;
(C.2)
𝐊
𝐊()F, with .Fthe Frobenius norm;
(C.3)
𝐊
𝐊()
2, with .2the spectral norm.
Although defining relevant and easily interpretable measures of the approximation error, these criteria are
relatively costly to evaluate. Indeed, each of them involves the inversion or pseudoinversion of the kernel
matrix 𝐊, with complexity (𝑛3). The evaluation of the criterion (C.1) also involves the computation
of the 𝑁diagonal entries of
𝐊(), leading to an overall complexity of (𝑛3+𝑁𝑛2). The evaluation of
(C.2) involves the full construction of the matrix
𝐊(), with an overall complexity of (𝑛3+𝑛2𝑁2), and
the evaluation of (C.3) in addition requires the computation of the largest eigenvalue of an 𝑁×𝑁SPSD
matrix, leading to an overall complexity of (𝑛3+𝑛2𝑁2+𝑁3). If 𝒳=𝑑, then the evaluation of the
partial derivatives of these criteria (regarded as maps from 𝒳𝑛to ) with respect to a single coordinate
of a landmark point has a complexity similar to the complexity of evaluating the criteria themselves. As
a result, a direct optimisation of these criteria over 𝒳𝑛is intractable in most practical applications.
2
1.3 Radial Squared-Kernel Discrepancy
As a surrogate for the criteria (C.1)-(C.3), and following the connections between the Nyström approx-
imation of SPSD matrices, the approximation of integral operators with SPSD kernels and the kernel
embedding of measures, we consider the following radial squared-kernel discrepancy criterion (radial
SKD, see [7, 9]), denoted by 𝑅and given by, for = {𝑠1,, 𝑠𝑛},
𝑅() = 𝐊2
F1
𝐊2
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑥𝑖, 𝑠𝑗)2
,if 𝐊F>0,(3)
and 𝑅() = 𝐊2
Fif 𝐊F= 0; the notation 𝐾2(𝑥𝑖, 𝑠𝑗)stands for 𝐾(𝑥𝑖, 𝑠𝑗)2. We may note that
𝑅()0. In (3), the evaluation of the term 𝐊2
Fhas complexity (𝑁2); nevertheless, this term does
not depend on the Nyström sample , and may thus be regarded as a constant. The complexity of the
evaluation of the term 𝑅() 𝐊2
F, i.e. of the radial SKD up to the constant 𝐊2
F, is (𝑛2+𝑛𝑁),
and the same holds for the complexity of the evaluation of the partial derivative of 𝑅()with respect to
a coordinate of a landmark point, see equation (5) below. We may in particular note that the evaluation
of the radial SKD criterion or its partial derivatives does not involve the inversion or pseudoinversion of
the 𝑛×𝑛matrix 𝐊.
Remark 1.2. From a theoretical standpoint, the radial SKD criterion measures the distance, in the Hilbert
space of all Hilbert-Schmidt operators on , between the integral operator corresponding to the initial
matrix 𝐊, and the projection of this operator onto the subspace spanned by an integral operator defined
from the kernel 𝐾and a uniform measure on . The radial SKD may also be defined for non-uniform
measures, and the criterion in this case depends not only on , but also on a set of relative weights
associated with each landmark point in ; in this work, we only focus on the uniform-weight case. See
[7, 9] for more details.
The following inequalities hold:
𝐊
𝐊()
2
2
𝐊
𝐊()
2
F𝑅()𝐊2
F,and 1
𝑁
𝐊
𝐊()
2
𝐊
𝐊()
2
F,
which, in complement to the theoretical properties enjoyed by the radial SKD, further support the use of
the radial SKD as a numerically affordable surrogate for (C.1)-(C.3) (see also the numerical experiments
in Section 4).
From now on, we assume that 𝒳=𝑑. Let [𝑠]𝑙, with 𝑙 {1,, 𝑑}, be the 𝑙-th coordinate of 𝑠in
the canonical basis of 𝒳=𝑑. For 𝑥𝒳, we denote by (assuming they exist)
𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑥)and 𝜕[d]
[𝑠]𝑙𝐾2(𝑠, 𝑠)(4)
the partial derivatives of the maps 𝑠𝐾2(𝑠, 𝑥)and 𝑠𝐾2(𝑠, 𝑠)at 𝑠and with respect to the 𝑙-th
coordinate of 𝑠, respectively; the notation 𝜕[l] indicates that the left entry of the kernel is considered,
while 𝜕[d] refers to the diagonal of the kernel; we use similar notations for any kernel function on 𝒳×𝒳.
For a fixed number of landmark points 𝑛, the radial SKD criterion can be regarded as a function
from 𝒳𝑛to . For a Nyström sample = {𝑠1,, 𝑠𝑛} 𝒳𝑛, and for 𝑘 {1,, 𝑛}and 𝑙 {1,, 𝑑},
we denote by 𝜕[𝑠𝑘]𝑙𝑅()the partial derivative of the map 𝑅𝒳𝑛at with respect to the 𝑙-th
3
coordinate of the 𝑘-th landmark point 𝑠𝑘𝒳. We have
𝜕[𝑠𝑘]𝑙𝑅() = 1
𝐊4
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖)2𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)
2
𝐊2
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖) 𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖).
(5)
In this work, we investigate the possibility to use the partial derivatives (5), or stochastic approximations
of these derivatives, to directly optimise the radial SKD criterion 𝑅over 𝒳𝑛via gradient or stochastic
gradient descent; the stochastic approximation schemes we consider aim at reducing the burden of the
numerical cost induced by the evaluation of the partial derivatives of 𝑅when 𝑁is large.
The document is organised as follows. In Section 2, we discuss the convergence of a gradient descent
with fixed step size for the minimisation of 𝑅over 𝒳𝑛. The stochastic approximation of the gradient
of the radial SKD criterion (3) is discussed in Section 3, and some numerical experiments are carried
out in Section 4. Section 5 consists of a concluding discussion, and the Appendix contains a proof of
Theorem 2.1.
2 A Convergence Result
We use the same notation as in Section 1.3 (in particular, we still assume that 𝒳=𝑑), and by analogy
with (4), for 𝑠and 𝑥𝒳, and for 𝑙 {1,, 𝑑}, we denote by 𝜕[r]
[𝑠]𝑙𝐾2(𝑥, 𝑠)the partial derivative of
the map 𝑠𝐾2(𝑥, 𝑠)with respect to the 𝑙-th coordinate of 𝑠. Also, for a fixed 𝑛, we denote by
𝑅() 𝒳𝑛=𝑛𝑑 the gradient of 𝑅𝒳𝑛at ; in matrix notation, we have
𝑅() = 𝑠1𝑅()𝑇,,𝑠1𝑅()𝑇𝑇
,
with 𝑠𝑘𝑅() = 𝜕[𝑠𝑘]1𝑅(),, 𝜕[𝑠𝑘]𝑑𝑅()𝑇𝑑for 𝑘 {1,, 𝑛}.
Theorem 2.1. We make the following assumptions on the squared-kernel 𝐾2, which we assume hold for
all 𝑥and 𝑦𝒳=𝑑, and all 𝑙and 𝑙 {1,, 𝑑}, uniformly:
(C.1) there exists 𝛼 > 0such that 𝐾2(𝑥, 𝑥)𝛼;
(C.2) there exists 𝑀1>0such that 𝜕[d]
[𝑥]𝑙𝐾2(𝑥, 𝑥)𝑀1and 𝜕[l]
[𝑥]𝑙𝐾2(𝑥, 𝑦)𝑀1;
(C.3) there exists 𝑀2>0such that 𝜕[d]
[𝑥]𝑙𝜕[d]
[𝑥]𝑙𝐾2(𝑥, 𝑥)𝑀2,𝜕[l]
[𝑥]𝑙𝜕[l]
[𝑥]𝑙𝐾2(𝑥, 𝑦)𝑀2and
𝜕[l]
[𝑥]𝑙𝜕[r]
[𝑦]𝑙𝐾2(𝑥, 𝑦)𝑀2.
Let and 𝑛𝑑 be two Nyström samples; under the above assumptions, there exists 𝐿 > 0such that
𝑅()−∇𝑅()
𝐿
with .the Euclidean norm of 𝑛𝑑 ; in other words, the gradient of 𝑅𝑛𝑑 is Lipschitz-continuous
with Lipschitz constant 𝐿.
Since 𝑅is bounded from below, for 0< 𝛾 1∕𝐿and independently of the considered initial Nyström
sample (0), Theorem 2.1 entails that a gradient descent from (0) , with fixed stepsize 𝛾for the minimi-
sation of 𝑅over 𝒳𝑛, produces a sequence of iterates that converges to a critical point of 𝑅. Barring some
4
specific and largely pathological cases, the resulting critical point is likely to be a local minimum of 𝑅,
see for instance [12]. See the Appendix for a proof of Theorem 2.1.
The conditions considered in Theorem 2.1 ensure the existence of a general Lipschitz constant 𝐿
for the gradient of 𝑅; they, for instance, hold for all sufficiently regular Matérn kernels (thus includ-
ing the Gaussian or squared-exponential kernel). These conditions are only sufficient conditions for
the convergence of a gradient descent for the minimisation of 𝑅. By introducing additional problem-
dependent conditions, some convergence results might be obtained for more general squared kernels 𝐾2
and adequate initial Nyström samples (0). For instance, the condition (C.1) simply aims at ensuring that
𝐊2
F𝑛𝛼 > 0for all 𝒳𝑛; this condition might be relaxed to account for kernels with vanishing
diagonal, but one might then need to introduce ad hoc conditions to ensure that 𝐊2
Fremains large
enough during the minimisation process.
3 Stochastic Approximation of the Radial SKD Gradient
The complexity of evaluating a partial derivative of 𝑅𝒳𝑛is (𝑛2+𝑛𝑁), which might become
prohibitive for large values of 𝑁. To overcome this limitation, stochastic approximations of the gradient
of 𝑅might be considered (see e.g. [2]).
The evaluation of (5) involves, for instance, terms of the form 𝑁
𝑖=1 𝐾2(𝑠, 𝑥𝑖), with 𝑠𝒳and
= {𝑥1,, 𝑥𝑁}. Introducing a random variable 𝑋with uniform distribution on , we can note that
𝑁
𝑖=1
𝐾2(𝑠, 𝑥𝑖) = 𝑁𝔼𝐾2(𝑠, 𝑋),
and the mean 𝔼[𝐾2(𝑠, 𝑋)] may then, classically, be approximated by random sampling. More precisely,
if 𝑋1,, 𝑋𝑏are 𝑏copies of 𝑋, we have
𝔼𝐾2(𝑠, 𝑋)=1
𝑏
𝑏
𝑗=1
𝔼𝐾2(𝑠, 𝑋𝑗)and 𝔼𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑋)=1
𝑏
𝑏
𝑗=1
𝔼𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑋𝑗),
so that we can easily define unbiased estimators of the various terms appearing in (5). We refer to the
sample size 𝑏as the batch size.
Let 𝑘 {1,, 𝑛}and 𝑙 {1,, 𝑑}; the partial derivative (5) can be rewritten as
𝜕[𝑠𝑘]𝑙𝑅() =
𝑇2
1
𝐊4
F
Υ() 2𝑇1𝑇𝑘,𝑙
2
𝐊2
F
,
with 𝑇1=𝑁
𝑖=1 𝑛
𝑗=1 𝐾2(𝑠𝑗, 𝑥𝑖)and 𝑇𝑘,𝑙
2=𝑁
𝑖=1 𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖), and
Υ() = 𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗).
The terms 𝑇1and 𝑇𝑘,𝑙
2are the only terms in (5) that depend on . From a uniform random sample
𝐗= {𝑋1,, 𝑋𝑏}, we define the unbiased estimators
𝑇1(𝐗)of 𝑇1, and
𝑇𝑘,𝑙
2(𝐗)of 𝑇𝑘,𝑙
2, as
𝑇1(𝐗) = 𝑁
𝑏
𝑛
𝑖=1
𝑏
𝑗=1
𝐾2(𝑠𝑖, 𝑋𝑗),and
𝑇𝑘,𝑙
2(𝐗) = 𝑁
𝑏
𝑏
𝑗=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑋𝑗).
In what follows, we discuss the properties of some stochastic approximations of the gradient of 𝑅that
can be defined from such estimators.
5
One-Sample Approximation. Using a single random sample 𝐗= {𝑋1,, 𝑋𝑏}of size 𝑏, we can
define the following stochastic approximation of the partial derivative (5):
𝜕[𝑠𝑘]𝑙𝑅(;𝐗) =
𝑇1(𝐗)2
𝐊4
F
Υ() 2
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐗)
𝐊2
F
.(6)
An evaluation of
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)has complexity (𝑛2+𝑛𝑏), as opposed to (𝑛2+𝑛𝑁)for the corresponding
exact partial derivative. However, due to the dependence between
𝑇1(𝐗)and
𝑇𝑘,𝑙
2(𝐗), and to the fact that
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)involves the square of
𝑇1(𝐗), the stochastic partial derivative
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)will generally
be a biased estimator of 𝜕[𝑠𝑘]𝑙𝑅().
Two-Sample Approximation. To obtain an unbiased estimator of the partial derivative (5), instead of
considering a single random sample, we may define a stochastic approximation based on two independent
random samples 𝐗= {𝑋1,, 𝑋𝑏𝐗}and 𝐘= {𝑌1,, 𝑌𝑏𝐘}, consisting of 𝑏𝐗and 𝑏𝐘copies of 𝑋
(i.e. consisting of uniform random variables on ), with 𝑏=𝑏𝐗+𝑏𝐘. The two-sample estimator of (5)
is then given by
𝜕[𝑠𝑘]𝑙𝑅(;𝐗,𝐘) =
𝑇1(𝐗)
𝑇1(𝐘)
𝐊4
F
Υ() 2
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)
𝐊2
F
,(7)
and since 𝔼
𝑇1(𝐗)
𝑇1(𝐘)=𝑇2
1and 𝔼
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)=𝑇1𝑇𝑘,𝑙
2, we have
𝔼
𝜕[𝑠𝑘]𝑙𝑅(;𝐗,𝐘)=𝜕[𝑠𝑘]𝑙𝑅().
Although being unbiased, for a common batch size 𝑏, the variance of the two-sample estimator (7)
will generally be larger than the variance of the one-sample estimator (6). In our numerical experiments,
the larger variance of the unbiased estimator (7) seems to actually slow down the descent when compared
to the descent obtained with the one-sample estimator (6).
Remark 3.1. While considering two independent samples 𝐗and 𝐘, the two terms
𝑇1(𝐗)
𝑇1(𝐘)and
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)appearing in (7) are dependent. This dependence may complicate the analysis of the prop-
erties of the resulting SGD; nevertheless, this issue might be overcome by considering four independent
samples instead of two.
4 Numerical Experiments
Throughout this section, the matrices 𝐊are defined from multisets = {𝑥1,, 𝑥𝑁}𝑑and from
kernels 𝐾of the form 𝐾(𝑥, 𝑡) = 𝑒𝜌𝑥𝑡2, with 𝜌 > 0and where .is the Euclidean norm of 𝑑
(Gaussian kernel). Except for the synthetic example of Section 4.1, all the multisets we consider
consist of the entries of data sets available on the UCI Machine Learning Repository; see [6].
Our experiments are based on the following protocol: for a given 𝑛, we consider an initial
Nyström sample (0) consisting of 𝑛points drawn uniformly at random, without replacement, from .
The initial sample (0) is regarded as an element of 𝒳𝑛, and used to initialise a GD or SGD, with fixed
stepsize 𝛾 > 0, for the minimisation of 𝑅over 𝒳𝑛, yielding, after 𝑇iterations, a locally optimised
Nyström sample (𝑇). The SGDs are performed with the one-sample estimator (6) and are based on
independent and identically distributed uniform random variables on (i.e. i.i.d. sampling), with batch
size 𝑏; see Section 3. We assess the accuracy of the Nyström approximations of 𝐊induced by (0)
and (𝑇)in terms of radial SKD and of the classical criteria (C.1)-(C.3).
6
For a Nyström sample 𝒳𝑛of size 𝑛, the matrix
𝐊()is of rank at most 𝑛. Following [4,10],
to further assess the efficiency of the approximation of 𝐊induced by , we introduce the approximation
factors
tr () = 𝐊
𝐊()
𝐊𝐊𝑛
,F() = 𝐊
𝐊()F
𝐊𝐊𝑛F
,and sp() = 𝐊
𝐊()2
𝐊𝐊𝑛2
,(8)
where 𝐊𝑛denotes an optimal rank-𝑛approximation of 𝐊(i.e. the approximation of 𝐊obtained by trun-
cation of a spectral expansion of 𝐊and based on 𝑛of the largest eigenvalues of 𝐊). The closer tr (),
F()and sp()are to 1, the more efficient the approximation is.
4.1 Bi-Gaussian Example
We consider a kernel matrix 𝐊defined by a set consisting of 𝑁= 2,000 points in [−1,1]22(i.e.
𝑑= 2); for the kernel parameter, we use 𝜌= 1. A graphical representation of the set is given in Figure 1;
it consists of 𝑁independent realisations of a bivariate random variable whose density is proportional to
the restriction of a bi-Gaussian density to the set [−1,1]2(the two modes of the underlying distribution
are located at (−0.8,0.8) and (0.8,−0.8), and the covariance matrix of the each Gaussian density is 𝕀2∕2,
with 𝕀2the 2×2identity matrix).
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
0 200 400 600 800 1000 1200
0
10000
20000
30000
40000
t7→ RS(t)
t
Figure 1: Graphical representation of the path followed by the landmark points of a Nystrom sample
during the local minimisation of 𝑅through GD, with 𝑛= 50,𝛾= 10−6 and 𝑇= 1,300; the green squares
are the landmark points of the initial sample (0), the red dots are the landmark points of the locally
optimised sample (𝑇), and the purple lines correspond to the paths followed by each landmark point
(left). The corresponding decay of the radial SKD is also presented (right).
The initial samples (0) are optimised via GD with stepsize 𝛾= 10−6 and for a fixed number of itera-
tions 𝑇. A graphical representation of the paths followed by the landmark points during the optimisation
process is given in Figure 1 (for 𝑛= 50 and 𝑇= 1,300); we observe that the landmark points exhibit a
relatively complex dynamic, some of them showing significant displacements from their initial positions.
The optimised landmark points concentrate around the regions where the density of points in is the
largest, and inherit a space-filling-type property in accordance with the stationarity of the kernel 𝐾.
To assess the improvement yielded by the optimisation process, for a given number of landmark points
𝑛, we randomly draw an initial Nyström sample (0) from (uniform sampling without replacement)
7
and compute the corresponding locally optimised sample (𝑇)(GD with 𝛾= 10−6 and 𝑇= 1,000). We
then compare 𝑅(0)with 𝑅(𝑇), and compute the corresponding approximation factors with respect
to the trace, Frobenius and spectral norms, see (8). We consider three different values of 𝑛, namely 𝑛= 20,
50 and 80, and each time perform 𝑚= 1,000 repetitions of this experiment. Our results are presented in
Figure 2; we observe that, independently of 𝑛, the local optimisation produces a significant improvement
of the Nyström approximation accuracy for all the criterion considered; the improvements are particularly
noticeable for the trace and Frobenius norms, and slightly less for the spectral norm (which of the three,
appears the coarsest measure of the approximation accuracy). Remarkably, the efficiencies of the locally
optimised Nyström samples are relatively close to each other, in particular in terms of trace and Frobenius
norms, suggesting that a large proportion of the local minima of the radial SKD induce approximations
of comparable quality.
Index
n= 20
ini. opt.
R(S)
30000
40000
50000
60000
70000
80000
90000
ini. opt.
Etr(S)
1.3
1.4
1.5
1.6
1.7
1.8
1.9
ini. opt.
EF(S)
1.4
1.6
1.8
2.0
2.2
2.4
ini. opt.
Esp(S)
1.5
2.0
2.5
3.0
3.5
4.0
Index
n= 50
ini. opt.
R(S)
10000
20000
30000
40000
50000
60000
ini. opt.
Etr(S)
1.5
2.0
2.5
3.0
3.5
ini. opt.
EF(S)
2
3
4
5
ini. opt.
Esp(S)
2
4
6
8
10
n= 80
ini. opt.
R(S)
0
10000
20000
30000
40000
ini. opt.
Etr(S)
2
3
4
5
6
7
ini. opt.
EF(S)
2
4
6
8
10
12
ini. opt.
Esp(S)
5
10
15
20
25
Figure 2: For the Bi-Gaussian example, comparison of the efficiency of the Nyström approximations for
the initial samples (0) and the locally optimised samples (𝑇)(optimisation through GD with 𝛾= 10−6
and 𝑇= 1,000). Each row corresponds to a given value of 𝑛; in each case 𝑚= 1,000 repetitions are
performed. The first column corresponds to the radial SKD, and the following three correspond to the
approximation factors defined in (8).
4.2 Abalone Data Set
We now consider the 𝑑= 8 attributes of the Abalone data set. After removing two observations that are
clear outliers, we are left with 𝑁= 4,175 entries. Each of the 8features is standardised such that it has
zero mean and unit variance. We set 𝑛= 50 and consider three different values of the kernel paramater 𝜌,
namely 𝜌= 0.25,1, and 4; this values are chosen so that the eigenvalues of the kernel matrix 𝐊exhibit
sharp, moderate and shallower decays, respectively. For the Nyström sample optimisation, we use SGD
with i.i.d. sampling and batch size 𝑏= 50,𝑇= 10,000 and 𝛾= 8 × 10−7; these values were chosen to
obtain relatively efficient optimisations for the whole range of values of 𝜌we consider. For each value of
8
𝜌, we perform 𝑚= 200 repetitions. The results are presented in Figure 3.
ini. opt.
R(S)
0
100000
200000
300000
400000
500000
600000
ini. opt.
Etr(S)
1.6
2.0
2.4
2.8
ini. opt.
EF(S)
2
3
4
5
6
7
ini. opt.
Esp(S)
5
10
15
20
25
ini. opt.
R(S)
50000
100000
150000
200000
250000
300000
ini. opt.
Etr(S)
1.2
1.3
1.4
1.5
1.6
1.7
1.8
ini. opt.
EF(S)
1.5
2.0
2.5
3.0
3.5
4.0
ini. opt.
Esp(S)
2
4
6
8
10
12
14
16
ini. opt.
R(S)
20000
30000
40000
50000
60000
ini. opt.
Etr(S)
1.10
1.15
1.20
1.25
1.30
ini. opt.
EF(S)
1.2
1.4
1.6
1.8
2.0
2.2
ini. opt.
Esp(S)
2
3
4
5
6
7
8
9
Figure 3: For the Abalone data set with 𝑛= 50 and 𝜌 {0.25,1,4}, comparison of the efficiency of
the Nyström approximations for the initial Nyström samples (0) and the locally optimised samples (𝑇)
(SGD with i.i.d sampling, 𝑏= 50,𝛾= 8 × 10−7 and 𝑇= 10,000). Each row corresponds to a given value
of 𝜌; in each case, 𝑚= 200 repetitions are performed.
We observe that regardless of the values of 𝜌and in comparison with the initial Nyström samples,
the efficiencies of the locally optimised samples in terms of trace, Frobenius and spectral norms are
significantly improved. As observed in Section 4.1, the gains yielded by the local optimisations are more
evident in terms of trace and Frobenius norms, and the impact of the initialisation appears limited.
4.3 MAGIC Data Set
We consider the 𝑑= 10 attributes of the MAGIC Gamma Telescope data set. In pre-processing, we
remove the 115 duplicated entries in the data set, leaving us with 𝑁= 18,905 data points; we then
standardise each of the 𝑑= 10 features of the data set. For the kernel parameter, we use 𝜌= 0.2.
In Figure 4, we present the results obtained after the local optimisation of 𝑚= 200 random initial
Nyström samples of size 𝑛= 100 and 200. Each optimisation was performed through SGD with i.i.d.
sampling, batch size 𝑏= 50 and stepsize 𝛾= 5 × 10−8; as number of iterations, for 𝑛= 100, we used
𝑇= 3,000, and 𝑇= 4,000 for 𝑛= 200. The optimisation parameters were chosen to obtain relatively
efficient but not fully completed descents, as illustrated in Figure 4. Alongside the radial SKD, we only
compute the approximation factor corresponding to the trace norm (the trace norm is indeed the least
costly to evaluate of the three matrix norms we consider, see Section 1.2). As in the previous experiments,
we observe a significant improvement of the initial Nyström samples obtained by local optimisation of
the radial SKD.
9
Index
n= 100
n= 200
ini. opt.
R(S)
1000000
2000000
3000000
4000000
ini. opt.
R(S)
500000
1000000
1500000
2000000
2500000
ini. opt.
Etr(S)
1.3
1.4
1.5
1.6
1.7
ini. opt.
Etr(S)
1.45
1.50
1.55
1.60
1.65
1.70
1.75
Index
c()
n= 200
0 1000 2000 3000 4000
0
500000
1000000
1500000
t7→ RS(t)
t
Figure 4: For the MAGIC data set, boxplots of the radial SKD 𝑅and of the approximation factor tr
before and after the local optimisation via SGD of random Nyström samples of size 𝑛= 100 and 200; for
each value of 𝑛,𝑚= 200 repetitions are performed. The SGD is based on i.i.d. sampling, with 𝑏= 50
and 𝛾= 5 × 10−8; for 𝑛= 100, the descent is stopped after 𝑇= 3,000 iterations, and after 𝑇= 4,000
iterations for 𝑛= 200 (left). A graphical representation of the decay of the radial SKD is also presented
for 𝑛= 200 (right).
4.4 MiniBooNE Data Set
In this last experiment, we consider the 𝑑= 50 attributes of the MiniBooNE particle identification
data set. In pre-processing, we remove the 471 entries in the data set with missing values, and 1entry
appearing as a clear outlier, leaving us with 𝑁= 129,592 data points; we then standardise each of the
𝑑= 50 features of the data set. We use 𝜌= 0.04 (kernel parameter).
0 2000 4000 6000 8000
0
5000000
10000000
15000000
20000000
25000000
t7→ RS(t)
t
Kˆ
K(S(0))
63,272.7
Kˆ
K(S(T))
53,657.2
Figure 5: For the MiniBooNE data set, decay of the radial SKD during the optimisation of a random
initial Nyström sample of size 𝑛= 1,000. The SGD is based on i.i.d. sampling with batch size 𝑏= 200
and stepsize 𝛾= 2 × 10−7, and the descent is stopped after 𝑇= 8,000 iterations; the cost is evaluated
every 100 iterations.
We consider a random initial Nyström sample of size 𝑛= 1,000, and optimise it through SGD with
i.i.d. sampling, batch size 𝑏= 200, stepsize 𝛾= 2 × 10−7; the descent is stopped after 𝑇= 8,000
iterations. The resulting decay of the radial SKD is presented in Figure 5 (the cost is evaluated every 100
iterations), and the trace norm of the Nyström approximation error for the initial and locally optimised
samples are reported. In terms of computation time, on our machine (endowed with an 3.5 GHz Dual-Core
10
Intel Core i7 processor, and using a single-threaded C implementation interfaced with R), for 𝑛= 1,000,
an evaluation of the radial SKD (up to the constant 𝐊2
F) takes 6.8 s, while an evaluation of the term
𝐊
𝐊(𝑆)takes 6,600 s; performing the optimisation reported in Figure 5 without checking the decay
of the cost takes 1,350 s. This experiment illustrates the ability of the considered framework to tackle
relatively large problems.
5 Conclusion
We demonstrated the relevance of the radial-SKD-based framework for the local optimisation, through
SGD, of Nyström samples for SPSD kernel-matrix approximation. We studied the Lipschitz continuity
of the underlying gradient and discussed its stochastic approximation. We performed numerical experi-
ments illustrating that local optimisation of the radial SKD yields significant improvement of the Nyström
approximation in terms of trace, Frobenius and spectral norms.
In our experiments, we implemented SGD with i.i.d. sampling, fixed stepsize and fixed number of
iterations; although already bringing satisfactory results, to improve the time efficiency of the approach,
the optimisation strategy could be accelerated by considering for instance adaptive stepsize, paralleli-
sation or momentum-type techniques (see [16] for an overview). The initial Nyström samples (0) we
considered were draw uniformly at random without replacement; while our experiments suggest that the
local minima of the radial SKD often induce approximations of comparable quality, the use of more
efficient initialisation strategies may be investigated (see e.g. [3,4, 11, 13,18]).
As a side note, when considering the trace norm, the Nyström sampling problem is intrinsically re-
lated to the integrated-mean-squared-error design criterion in kernel regression (see e.g. [8, 15, 17]);
consequently the approach considered in this paper may be used for the design of experiments for such
models.
Appendix
Proof of Theorem 2.1. We consider a Nyström sample 𝒳𝑛and introduce
𝑐=1
𝐊2
F
𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑥𝑖, 𝑠𝑗).(9)
In view of (5), the partial derivative of 𝑅at with respect to the 𝑙-th coordinate of the 𝑘-th landmark
point 𝑠𝑘can be written as
𝜕[𝑠𝑘]𝑙𝑅() = 𝑐2
𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗) 2𝑐
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖).(10)
For 𝑘and 𝑘 {1,, 𝑛}with 𝑘𝑘, and for 𝑙and 𝑙 {1,, 𝑑}, the second-order partial derivatives
of 𝑅at , with respect to the coordinates of the landmark points in , verify
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙𝑅() = 𝑐2
𝜕[d]
[𝑠𝑘]𝑙𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2𝑐(𝜕[𝑠𝑘]𝑙𝑐)𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)
+ 2𝑐2
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)+4𝑐(𝜕[𝑠𝑘]𝑙𝑐)
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)
2𝑐
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖) 2(𝜕[𝑠𝑘]𝑙𝑐)
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖),and
(11)
11
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙𝑅()=2𝑐(𝜕[𝑠𝑘]𝑙𝑐)𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2𝑐2
𝜕[l]
[𝑠𝑘]𝑙𝜕[r]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)
+ 4𝑐(𝜕[𝑠𝑘]𝑙𝑐)
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗) 2(𝜕[𝑠𝑘]𝑙𝑐)
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖),(12)
where the partial derivative of 𝑐with respect to the 𝑙-th coordinate of the 𝑘-th landmark point 𝑠𝑘is given
by
𝜕[𝑠𝑘]𝑙𝑐=1
𝐊2
F𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖) 𝑐𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)−2𝑐
𝑛
𝑗=1,
𝑗𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗).(13)
From (C.1), we have
𝐊2
F=
𝑛
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑖, 𝑠𝑗)
𝑛
𝑖=1
𝐾2(𝑠𝑖, 𝑠𝑖)𝑛𝛼. (14)
By the Schur product theorem, the squared kernel 𝐾2is SPSD; we denote by the RKHS of real-
valued functions on 𝒳for which 𝐾2is reproducing. For 𝑥and 𝑦𝒳, we have 𝐾2(𝑥, 𝑦) = 𝑘2
𝑥, 𝑘2
𝑦,
with ,the inner product on , and where 𝑘2
𝑥is such that 𝑘2
𝑥(𝑡) = 𝐾2(𝑡, 𝑥), for all 𝑡𝒳. From
the Cauchy-Schwartz inequality, we have
𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖) =
𝑁
𝑖=1
𝑛
𝑗=1𝑘2
𝑠𝑗, 𝑘2
𝑥𝑖=𝑛
𝑗=1
𝑘2
𝑠𝑗,
𝑁
𝑖=1
𝑘2
𝑥𝑖
𝑛
𝑗=1
𝑘2
𝑠𝑗
𝑁
𝑖=1
𝑘2
𝑥𝑖
=𝐊F𝐊F.(15)
By combining (9) with inequalities (14) and (15), we obtain
0𝑐𝐊F
𝐊F
𝐊F
𝑛𝛼
=𝐶0.(16)
Let 𝑘 {1,, 𝑛}and let 𝑙 {1,, 𝑑}. From equation (13), and using inequalities (14) and (16)
together with (C.2), we obtain
𝜕[𝑠𝑘]𝑙𝑐𝑀1
𝑛𝛼 [𝑁+ (2𝑛 1)𝐶0] = 𝐶1.(17)
In addition, let 𝑘 {1,, 𝑛}{𝑘}and 𝑙 {1,, 𝑑}; from equations (11), (12), (16) and (17), and
conditions (C.2) and (C.3), we get
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙𝑅()
𝐶2
0𝑀2+ 2𝐶0𝐶1𝑀1+ 2(𝑛 1)𝐶2
0𝑀2+ 4(𝑛 1)𝐶0𝐶1𝑀1+ 2𝐶0𝑀2𝑁+ 2𝐶1𝑀1𝑁
= (2𝑛 1)𝐶2
0𝑀2+ (4𝑛 2)𝐶0𝐶1𝑀1+ 2𝑁(𝐶0𝑀2+𝐶1𝑀1),(18)
and
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙𝑅()2𝐶0𝐶1𝑀1+ 2𝐶2
0𝑀2+ 4(𝑛 1)𝐶0𝐶1𝑀1+ 2𝐶1𝑀1𝑁
= 2𝐶2
0𝑀2+ (4𝑛 2)𝐶0𝐶1𝑀1+ 2𝑁𝐶1𝑀1.(19)
12
For 𝑘, 𝑘 {1,, 𝑛}, we denote by 𝐁𝑘,𝑘the 𝑑×𝑑matrix with 𝑙, 𝑙entry given by (11) if 𝑘=𝑘, and
by (12) otherwise. The Hessian 2𝑅()can then be represented as a block-matrix, that is
2𝑅() =
𝐁1,1𝐁1,𝑛
𝐁𝑛,1𝐁𝑛,𝑛
𝑛𝑑×𝑛𝑑 .
The 𝑑2entries of the 𝑛diagonal blocks of 2𝑅()are of the form (11), and the 𝑑2entries of the 𝑛(𝑛 1)
off-diagonal blocks of 2𝑅()are the form (12). From inequalities (18) and (19), we obtain
2𝑅()2
22𝑅()2
F=
𝑛
𝑘=1
𝑑
𝑙=1
𝑑
𝑙=1
[𝐁𝑘,𝑘]2
𝑙,𝑙+
𝑛
𝑘=1
𝑛
𝑘=1,
𝑘𝑘
𝑑
𝑙=1
𝑑
𝑙=1
[𝐁𝑘,𝑘]2
𝑙,𝑙𝐿2,
with
𝐿=𝑛𝑑2[(2𝑛 1)𝐶2
0𝑀2+ (4𝑛 2)𝐶0𝐶1𝑀1+ 2𝑁(𝐶0𝑀2+𝐶1𝑀1)]2
+ 4𝑛(𝑛 1)𝑑2[𝐶2
0𝑀2+ (2𝑛 1)𝐶0𝐶1𝑀1+𝑁𝐶1𝑀1]21
2.
For all 𝒳𝑛, the constant 𝐿is an upper bound for the spectral norm of the Hessian matrix 2𝑅(),
so the gradient of 𝑅is Lipschitz continuous over 𝒳𝑛, with Lipschitz constant 𝐿.
References
[1] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability
and Statistics. Springer Science, 2004.
[2] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine
learning. Siam Review, 60(2):223–311, 2018.
[3] Difeng Cai, Edmond Chow, Lucas Erlandson, Yousef Saad, and Yuanzhe Xi. SMASH: Structured
matrix approximation by separation and hierarchy. Numerical Linear Algebra with Applications,
25, 2018.
[4] Michal Derezinski, Rajiv Khanna, and Michael W. Mahoney. Improved guarantees and a multiple-
descent curve for Column Subset Selection and the Nyström method. In Advances in Neural Infor-
mation Processing Systems, 2020.
[5] Petros Drineas and Michael W. Mahoney. On the Nyström method for approximating a Gram matrix
for improved kernel-based learning. Journal of Machine Learning Research, 6:2153–2175, 2005.
[6] Dheeru Dua and Casey Graff. UCI machine learning repository, 2019.
[7] Bertrand Gauthier. Nyström approximation and reproducing kernels: embeddings, projections and
squared-kernel discrepancy. Preprint, 2021.
[8] Bertrand Gauthier and Luc Pronzato. Convex relaxation for IMSE optimal design in random-field
models. Computational Statistics and Data Analysis, 113:375–394, 2017.
[9] Bertrand Gauthier and Johan Suykens. Optimal quadrature-sparsification for integral operator ap-
proximation. SIAM Journal on Scientific Computing, 40:A3636–A3674, 2018.
13
[10] Alex Gittens and Michael W. Mahoney. Revisiting the Nyström method for improved large-scale
machine learning. Journal of Machine Learning Research, 17:1–65, 2016.
[11] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nyström method.
Journal of Machine Learning Research, 13:981–1006, 2012.
[12] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only
converges to minimizers. In Conference on learning theory, pages 1246–1257. PMLR, 2016.
[13] Harald Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. SIAM, 1992.
[14] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel
Hilbert Spaces. Cambridge University Press, 2016.
[15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT press,
Cambridge, MA, 2006.
[16] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[17] Thomas J. Santner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer
Experiments. Springer, 2018.
[18] Shusen Wang, Zhihua Zhang, and Tong Zhang. Towards more efficient SPSD matrix approximation
and CUR matrix decomposition. Journal of Machine Learning Research, 17:7329–7377, 2016.
14
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The design of sparse quadratures for the approximation of integral operators related to symmetric positive-semidefinite kernels is addressed. Particular emphasis is placed on the approximation of the main eigenpairs of an initial operator and on the assessment of the approximation accuracy. Special attention is drawn to the design of sparse quadratures with support included in fixed finite sets of points (that is, quadrature-sparsification), this framework encompassing the approximation of kernel matrices. For a given kernel, the accuracy of a quadrature approximation is assessed through the squared Hilbert-Schmidt norm (for operators acting on the underlying reproducing kernel Hilbert space) of the difference between the integral operators related to the initial and approximate measures; by analogy with the notion of kernel discrepancy, the underlying criterion is referred to as the squared-kernel discrepancy between the two measures. In the quadrature-sparsification framework, sparsity of the approximate quadrature is promoted through the introduction of an ℓ¹-type penalization, and the computation of a penalized squared-kernel-discrepancy-optimal approximation then consists in a convex quadratic minimization problem; such quadratic programs can in particular be interpreted as the Lagrange dual formulations of distorted one-class support-vector machines related to the squared kernel. Error bounds on the induced spectral approximations are derived, and the connection between penalization, sparsity, and accuracy of the spectral approximation is investigated. Numerical strategies for solving large-scale penalized squared-kernel-discrepancy minimization problems are discussed, and the efficiency of the approach is illustrated by a series of examples. In particular, the ability of the proposed methodology to lead to accurate approximations of the main eigenpairs of kernel matrices related to large-scale datasets is demonstrated.
Article
Full-text available
This paper presents an efficient method to perform Structured Matrix Approximation by Separation and Hierarchy (SMASH), when the original dense matrix is associated with a kernel function. Given points in a domain, a tree structure is first constructed based on an adaptive partitioning of the computational domain to facilitate subsequent approximation procedures. In contrast to existing schemes based on either analytic or purely algebraic approximations, SMASH takes advantage of both approaches and greatly improves the efficiency. The algorithm follows a bottom-up traversal of the tree and is able to perform the operations associated with each node on the same level in parallel. A strong rank-revealing factorization is applied to the initial analytic approximation in the separation regime so that a special structure is incorporated into the final nested bases. As a consequence, the storage is significantly reduced on one hand and a hierarchy of the original grid is constructed on the other hand. Due to this hierarchy, nested bases at upper levels can be computed in a similar way as the leaf level operations but on coarser grids. The main advantages of SMASH include its simplicity of implementation, its flexibility to construct various hierarchical rank structures and a low storage cost. Rigorous error analysis and complexity analysis are conducted to show that this scheme is fast and stable. The efficiency and robustness of SMASH are demonstrated through various test problems arising from integral equations, structured matrices, etc.
Article
Full-text available
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
Conference Paper
The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing interpretable low-rank approximations of large datasets by selecting a small but representative set of features or instances. A fundamental question in this area is: what is the cost of this interpretability, i.e., how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay. Our analysis also reveals an intriguing phenomenon: the cost of interpretability as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve. A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystrom tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
Book
Reproducing kernel Hilbert spaces have developed into an important tool in many areas, especially statistics and machine learning, and they play a valuable role in complex analysis, probability, group representation theory, and the theory of integral operators. This unique text offers a unified overview of the topic, providing detailed examples of applications, as well as covering the fundamental underlying theory, including chapters on interpolation and approximation, Cholesky and Schur operations on kernels, and vector-valued spaces. Self-contained and accessibly written, with exercises at the end of each chapter, this unrivalled treatment of the topic serves as an ideal introduction for graduate students across mathematics, computer science, and engineering, as well as a useful reference for researchers working in functional analysis or its applications. © Vern I. Paulsen and Mrinal Raghupathi 2016. All rights reserved.
Article
The construction of optimal designs for random-field interpolation models via convex design theory is considered. The definition of an Integrated Mean-Squared Error (IMSE) criterion yields a particular Karhunen-Loève expansion of the underlying random field. After spectral truncation, the model can be interpreted as a Bayesian (or regularised) linear model based on eigenfunctions of this Karhunen-Loève expansion, and can be further approximated by a linear model involving orthogonal observation errors. Using the continuous relaxation of approximate design theory, the search of an IMSE optimal design can then be turned into a Bayesian -optimal design problem, which can be efficiently solved by convex optimisation. A careful analysis of this approach is presented, also including the situation where the model contains a linear parametric trend, which requires specific treatments. Several approaches are proposed, one of them enforcing orthogonality between the trend functions and the complementary random field. Convex optimisation, based on a quadrature approximation of the IMSE criterion and a discretisation of the design space, yields an optimal design in the form of a probability measure with finite support. A greedy extraction procedure of the exchange type is proposed for the selection of observation locations within this support, the size of the extracted design being controlled by the level of spectral truncation. The performance of the approach is investigated on a series of examples indicating that designs with high IMSE efficiency are easily obtained.
Article
We reconsider randomized algorithms for the low-rank approximation of symmetric positive semi-definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods; they characterize the effects of common data preprocessing steps on the performance of these algorithms; and they point to important differences between uniform sampling and nonuniform sampling methods based on leverage scores. In addition, our empirical results illustrate that existing theory is so weak that it does not provide even a qualitative guide to practice. Thus, we complement our empirical results with a suite of worst-case theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds - e.g., improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds for trace norm error - and they point to future directions to make these algorithms useful in even larger-scale machine learning applications.
Article
Matrix sketching schemes and the Nystr\"om method have both been extensively used to speed up large-scale eigenvalue computation and kernel learning methods. Matrix sketching methods produce accurate matrix approximations, but they are only computationally efficient on skinny matrices where one of the matrix dimensions is relatively small. In particular, they are not efficient on large square matrices. The Nystr\"om method, on the other hand, is highly efficient on symmetric (and thus square) matrices, but can only achieve low matrix approximation accuracy. In this paper we propose a novel combination of the sketching method and the Nystr\"om method to improve their efficiency/effectiveness, leading to a novel approximation which we call the Sketch-Nystr\"om method. The Sketch-Nystr\"om method is computationally nearly as efficient as the Nystr\"om method on symmetric matrices with approximation accuracy comparable to that of the sketching method. We show theoretically that the Sketch-Nystr\"om method can potentially solve eigenvalue problems and kernel learning problems in linear time with respect to the matrix size to achieve 1+ϵ1+\epsilon relative-error, whereas the sketch methods and the Nystr\"om method cost at least quadratic time to attain comparable error bound. Our technique can be straightforwardly applied to make the CUR matrix decomposition more efficiently computed without much affecting the accuracy. Empirical experiments demonstrate the effectiveness of the proposed methods.