Available via license: CC BY 4.0
Content may be subject to copyright.
Local optimisation of Nyström samples
through stochastic gradient descent
Matthew HUTCHINGS†§Bertrand GAUTHIER‡§
March 28, 2022
Abstract
We study a relaxed version of the column-sampling problem for the Nyström approximation of
kernel matrices, where approximations are defined from multisets of landmark points in the ambient
space; such multisets are referred to as Nyström samples. We consider an unweighted variation of
the radial squared-kernel discrepancy (SKD) criterion as a surrogate for the classical criteria used to
assess the Nyström approximation accuracy; in this setting, we discuss how Nyström samples can be
efficiently optimised through stochastic gradient descent. We perform numerical experiments which
demonstrate that the local minimisation of the radial SKD yields Nyström samples with improved
Nyström approximation accuracy.
Keywords: Low-rank matrix approximation; Nyström method; reproducing kernel Hilbert spaces;
stochastic gradient descent.
1 Introduction
In Data Science, the Nyström method refers to a specific technique for the low-rank approximation of
symmetric positive-semidefinite (SPSD) matrices; see e.g. [4, 5, 10, 11, 18]. Given an 𝑁×𝑁SPSD
matrix 𝐊, with 𝑁∈ℕ, the Nyström method consists of selecting a sample of 𝑛∈ℕcolumns of 𝐊,
generally with 𝑛 ≪ 𝑁, and next defining a low-rank approximation
𝐊of 𝐊based on this sample of
columns. More precisely, let 𝒄1,⋯,𝒄𝑁∈ℝ𝑁be the columns of 𝐊, so that 𝐊= (𝒄1⋯𝒄𝑁), and let
𝐼= {𝑖1,⋯, 𝑖𝑛}⊆{1,⋯, 𝑁}denote the indices of a sample of 𝑛columns of 𝐊(note that 𝐼is a multiset,
i.e. the indices of some columns might potentially be repeated). Let 𝐂= (𝒄𝑖1⋯𝒄𝑖𝑛)be the 𝑁×𝑛
matrix defined from the considered sample of columns of 𝐊, and let 𝐖be the 𝑛×𝑛principal submatrix
of 𝐊defined by the indices in 𝐼, i.e. the 𝑘, 𝑙 entry of 𝐖is [𝐊]𝑖𝑘,𝑖𝑙, the 𝑖𝑘, 𝑖𝑙entry of 𝐊. The Nyström
approximation of 𝐊defined from the sample of columns indexed by 𝐼is given by
𝐊=𝐂𝐖†𝐂𝑇,(1)
with 𝐖†the Moore-Penrose pseudoinverse of 𝐖. The column-sampling problem for Nyström approx-
imation consists of designing samples of columns such that the induced approximations are as accurate
as possible (see Section 1.2 for more details).
†HutchingsM1@cardiff.ac.uk
‡GauthierB@cardiff.ac.uk
§Cardiff University, School of Mathematics
Abacws, Senghennydd Road, Cardiff, CF24 4AG, United Kingdom
1
arXiv:2203.13284v1 [stat.ML] 24 Mar 2022
1.1 Kernel Matrix Approximation
If the initial SPSD matrix 𝐊is a kernel matrix, defined from a SPSD kernel 𝐾and a set or multiset of
points = {𝑥1,⋯, 𝑥𝑁}⊆𝒳(and with 𝒳a general ambient space), i.e. the 𝑖, 𝑗 entry of 𝐊is 𝐾(𝑥𝑖, 𝑥𝑗),
then a sample of columns of 𝐊is naturally associated with a subset of ; more precisely, a sample of
columns {𝒄𝑖1,⋯,𝒄𝑖𝑛}, indexed by 𝐼, naturally defines a multiset {𝑥𝑖1,⋯, 𝑥𝑖𝑛}⊆, so that the induced
Nyström approximation can in this case be regarded as an approximation induced by a subset of points in
. Consequently, in the kernel-matrix framework, instead of relying only on subsets of columns, we may
more generally consider Nyström approximations defined from a multiset ⊆𝒳. Using matrix notation,
the Nyström approximation of 𝐊defined by a subset = {𝑠1,⋯, 𝑠𝑛}is the 𝑁×𝑁SPSD matrix
𝐊(),
with 𝑖, 𝑗 entry
𝐊()𝑖,𝑗 =𝐤𝑇(𝑥𝑖)𝐊†
𝐤(𝑥𝑗),(2)
where 𝐊is the 𝑛×𝑛kernel matrix defined by the kernel 𝐾and the subset , and where
𝐤(𝑥) = 𝐾(𝑥, 𝑠1),⋯, 𝐾(𝑥, 𝑠𝑛)𝑇∈ℝ𝑛.
We shall refer to such a set or multiset as a Nyström sample, and to the elements of as landmark points;
the notation
𝐊()emphasises that the considered Nyström approximation of 𝐊is induced by . As in
the column-sampling case, the landmark-point-based framework naturally raises questions related to the
characterisation and the design of efficient Nyström samples (i.e. leading to accurate approximations
of 𝐊). As an interesting feature, Nyström samples of size 𝑛may be regarded as elements of 𝒳𝑛, and if the
underlying set 𝒳is regular enough, they might be directly optimised on 𝒳𝑛; the situation we consider
in this work corresponds to the case 𝒳=ℝ𝑑, with 𝑑∈ℕ, but 𝒳may more generally be a differentiable
manifold.
Remark 1.1. If we denote by the reproducing kernel Hilbert space (RKHS, see e.g. [1, 14]) of real-
valued functions on 𝒳associated with 𝐾, we may then note that the matrix
𝐊()is the kernel matrix
defined by 𝐾𝑆and the set , with 𝐾𝑆the reproducing kernel of the subspace
𝑆= span{𝑘𝑠1,⋯, 𝑘𝑠𝑛}⊆,
where, for 𝑡∈𝒳, the function 𝑘𝑡∈is defined as 𝑘𝑡(𝑥) = 𝐾(𝑥, 𝑡), for all 𝑥∈𝒳.⊲
1.2 Assessing the Accuracy of Nyström Approximations
In the classical literature on the Nyström approximation of SPSD matrices, the accuracy of the approxi-
mation induced by a Nyström sample is often assessed through the following criteria:
(C.1)
𝐊−
𝐊()
∗, with .∗the trace norm;
(C.2)
𝐊−
𝐊()F, with .Fthe Frobenius norm;
(C.3)
𝐊−
𝐊()
2, with .2the spectral norm.
Although defining relevant and easily interpretable measures of the approximation error, these criteria are
relatively costly to evaluate. Indeed, each of them involves the inversion or pseudoinversion of the kernel
matrix 𝐊, with complexity (𝑛3). The evaluation of the criterion (C.1) also involves the computation
of the 𝑁diagonal entries of
𝐊(), leading to an overall complexity of (𝑛3+𝑁𝑛2). The evaluation of
(C.2) involves the full construction of the matrix
𝐊(), with an overall complexity of (𝑛3+𝑛2𝑁2), and
the evaluation of (C.3) in addition requires the computation of the largest eigenvalue of an 𝑁×𝑁SPSD
matrix, leading to an overall complexity of (𝑛3+𝑛2𝑁2+𝑁3). If 𝒳=ℝ𝑑, then the evaluation of the
partial derivatives of these criteria (regarded as maps from 𝒳𝑛to ℝ) with respect to a single coordinate
of a landmark point has a complexity similar to the complexity of evaluating the criteria themselves. As
a result, a direct optimisation of these criteria over 𝒳𝑛is intractable in most practical applications.
2
1.3 Radial Squared-Kernel Discrepancy
As a surrogate for the criteria (C.1)-(C.3), and following the connections between the Nyström approx-
imation of SPSD matrices, the approximation of integral operators with SPSD kernels and the kernel
embedding of measures, we consider the following radial squared-kernel discrepancy criterion (radial
SKD, see [7, 9]), denoted by 𝑅and given by, for = {𝑠1,⋯, 𝑠𝑛},
𝑅() = 𝐊2
F−1
𝐊2
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑥𝑖, 𝑠𝑗)2
,if 𝐊F>0,(3)
and 𝑅() = 𝐊2
Fif 𝐊F= 0; the notation 𝐾2(𝑥𝑖, 𝑠𝑗)stands for 𝐾(𝑥𝑖, 𝑠𝑗)2. We may note that
𝑅()⩾0. In (3), the evaluation of the term 𝐊2
Fhas complexity (𝑁2); nevertheless, this term does
not depend on the Nyström sample , and may thus be regarded as a constant. The complexity of the
evaluation of the term 𝑅() − 𝐊2
F, i.e. of the radial SKD up to the constant 𝐊2
F, is (𝑛2+𝑛𝑁),
and the same holds for the complexity of the evaluation of the partial derivative of 𝑅()with respect to
a coordinate of a landmark point, see equation (5) below. We may in particular note that the evaluation
of the radial SKD criterion or its partial derivatives does not involve the inversion or pseudoinversion of
the 𝑛×𝑛matrix 𝐊.
Remark 1.2. From a theoretical standpoint, the radial SKD criterion measures the distance, in the Hilbert
space of all Hilbert-Schmidt operators on , between the integral operator corresponding to the initial
matrix 𝐊, and the projection of this operator onto the subspace spanned by an integral operator defined
from the kernel 𝐾and a uniform measure on . The radial SKD may also be defined for non-uniform
measures, and the criterion in this case depends not only on , but also on a set of relative weights
associated with each landmark point in ; in this work, we only focus on the uniform-weight case. See
[7, 9] for more details. ⊲
The following inequalities hold:
𝐊−
𝐊()
2
2⩽
𝐊−
𝐊()
2
F⩽𝑅()⩽𝐊2
F,and 1
𝑁
𝐊−
𝐊()
2
∗⩽
𝐊−
𝐊()
2
F,
which, in complement to the theoretical properties enjoyed by the radial SKD, further support the use of
the radial SKD as a numerically affordable surrogate for (C.1)-(C.3) (see also the numerical experiments
in Section 4).
From now on, we assume that 𝒳=ℝ𝑑. Let [𝑠]𝑙, with 𝑙∈ {1,⋯, 𝑑}, be the 𝑙-th coordinate of 𝑠in
the canonical basis of 𝒳=ℝ𝑑. For 𝑥∈𝒳, we denote by (assuming they exist)
𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑥)and 𝜕[d]
[𝑠]𝑙𝐾2(𝑠, 𝑠)(4)
the partial derivatives of the maps 𝑠↦𝐾2(𝑠, 𝑥)and 𝑠↦𝐾2(𝑠, 𝑠)at 𝑠and with respect to the 𝑙-th
coordinate of 𝑠, respectively; the notation 𝜕[l] indicates that the left entry of the kernel is considered,
while 𝜕[d] refers to the diagonal of the kernel; we use similar notations for any kernel function on 𝒳×𝒳.
For a fixed number of landmark points 𝑛∈ℕ, the radial SKD criterion can be regarded as a function
from 𝒳𝑛to ℝ. For a Nyström sample = {𝑠1,⋯, 𝑠𝑛} ∈ 𝒳𝑛, and for 𝑘∈ {1,⋯, 𝑛}and 𝑙∈ {1,⋯, 𝑑},
we denote by 𝜕[𝑠𝑘]𝑙𝑅()the partial derivative of the map 𝑅∶𝒳𝑛→ℝat with respect to the 𝑙-th
3
coordinate of the 𝑘-th landmark point 𝑠𝑘∈𝒳. We have
𝜕[𝑠𝑘]𝑙𝑅() = 1
𝐊4
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖)2𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)
−2
𝐊2
F𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖) 𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖).
(5)
In this work, we investigate the possibility to use the partial derivatives (5), or stochastic approximations
of these derivatives, to directly optimise the radial SKD criterion 𝑅over 𝒳𝑛via gradient or stochastic
gradient descent; the stochastic approximation schemes we consider aim at reducing the burden of the
numerical cost induced by the evaluation of the partial derivatives of 𝑅when 𝑁is large.
The document is organised as follows. In Section 2, we discuss the convergence of a gradient descent
with fixed step size for the minimisation of 𝑅over 𝒳𝑛. The stochastic approximation of the gradient
of the radial SKD criterion (3) is discussed in Section 3, and some numerical experiments are carried
out in Section 4. Section 5 consists of a concluding discussion, and the Appendix contains a proof of
Theorem 2.1.
2 A Convergence Result
We use the same notation as in Section 1.3 (in particular, we still assume that 𝒳=ℝ𝑑), and by analogy
with (4), for 𝑠and 𝑥∈𝒳, and for 𝑙∈ {1,⋯, 𝑑}, we denote by 𝜕[r]
[𝑠]𝑙𝐾2(𝑥, 𝑠)the partial derivative of
the map 𝑠↦𝐾2(𝑥, 𝑠)with respect to the 𝑙-th coordinate of 𝑠. Also, for a fixed 𝑛∈ℕ, we denote by
∇𝑅() ∈ 𝒳𝑛=ℝ𝑛𝑑 the gradient of 𝑅∶𝒳𝑛→ℝat ; in matrix notation, we have
∇𝑅() = ∇𝑠1𝑅()𝑇,⋯,∇𝑠1𝑅()𝑇𝑇
,
with ∇𝑠𝑘𝑅() = 𝜕[𝑠𝑘]1𝑅(),⋯, 𝜕[𝑠𝑘]𝑑𝑅()𝑇∈ℝ𝑑for 𝑘∈ {1,⋯, 𝑛}.
Theorem 2.1. We make the following assumptions on the squared-kernel 𝐾2, which we assume hold for
all 𝑥and 𝑦∈𝒳=ℝ𝑑, and all 𝑙and 𝑙′∈ {1,⋯, 𝑑}, uniformly:
(C.1) there exists 𝛼 > 0such that 𝐾2(𝑥, 𝑥)⩾𝛼;
(C.2) there exists 𝑀1>0such that 𝜕[d]
[𝑥]𝑙𝐾2(𝑥, 𝑥)⩽𝑀1and 𝜕[l]
[𝑥]𝑙𝐾2(𝑥, 𝑦)⩽𝑀1;
(C.3) there exists 𝑀2>0such that 𝜕[d]
[𝑥]𝑙𝜕[d]
[𝑥]𝑙′𝐾2(𝑥, 𝑥)⩽𝑀2,𝜕[l]
[𝑥]𝑙𝜕[l]
[𝑥]𝑙′𝐾2(𝑥, 𝑦)⩽𝑀2and
𝜕[l]
[𝑥]𝑙𝜕[r]
[𝑦]𝑙′𝐾2(𝑥, 𝑦)⩽𝑀2.
Let and ′∈ℝ𝑛𝑑 be two Nyström samples; under the above assumptions, there exists 𝐿 > 0such that
∇𝑅()−∇𝑅(′)
⩽𝐿
−′
with .the Euclidean norm of ℝ𝑛𝑑 ; in other words, the gradient of 𝑅∶ℝ𝑛𝑑 →ℝis Lipschitz-continuous
with Lipschitz constant 𝐿.
Since 𝑅is bounded from below, for 0< 𝛾 ⩽1∕𝐿and independently of the considered initial Nyström
sample (0), Theorem 2.1 entails that a gradient descent from (0) , with fixed stepsize 𝛾for the minimi-
sation of 𝑅over 𝒳𝑛, produces a sequence of iterates that converges to a critical point of 𝑅. Barring some
4
specific and largely pathological cases, the resulting critical point is likely to be a local minimum of 𝑅,
see for instance [12]. See the Appendix for a proof of Theorem 2.1.
The conditions considered in Theorem 2.1 ensure the existence of a general Lipschitz constant 𝐿
for the gradient of 𝑅; they, for instance, hold for all sufficiently regular Matérn kernels (thus includ-
ing the Gaussian or squared-exponential kernel). These conditions are only sufficient conditions for
the convergence of a gradient descent for the minimisation of 𝑅. By introducing additional problem-
dependent conditions, some convergence results might be obtained for more general squared kernels 𝐾2
and adequate initial Nyström samples (0). For instance, the condition (C.1) simply aims at ensuring that
𝐊2
F⩾𝑛𝛼 > 0for all ∈𝒳𝑛; this condition might be relaxed to account for kernels with vanishing
diagonal, but one might then need to introduce ad hoc conditions to ensure that 𝐊2
Fremains large
enough during the minimisation process.
3 Stochastic Approximation of the Radial SKD Gradient
The complexity of evaluating a partial derivative of 𝑅∶𝒳𝑛→ℝis (𝑛2+𝑛𝑁), which might become
prohibitive for large values of 𝑁. To overcome this limitation, stochastic approximations of the gradient
of 𝑅might be considered (see e.g. [2]).
The evaluation of (5) involves, for instance, terms of the form 𝑁
𝑖=1 𝐾2(𝑠, 𝑥𝑖), with 𝑠∈𝒳and
= {𝑥1,⋯, 𝑥𝑁}. Introducing a random variable 𝑋with uniform distribution on , we can note that
𝑁
𝑖=1
𝐾2(𝑠, 𝑥𝑖) = 𝑁𝔼𝐾2(𝑠, 𝑋),
and the mean 𝔼[𝐾2(𝑠, 𝑋)] may then, classically, be approximated by random sampling. More precisely,
if 𝑋1,⋯, 𝑋𝑏are 𝑏∈ℕcopies of 𝑋, we have
𝔼𝐾2(𝑠, 𝑋)=1
𝑏
𝑏
𝑗=1
𝔼𝐾2(𝑠, 𝑋𝑗)and 𝔼𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑋)=1
𝑏
𝑏
𝑗=1
𝔼𝜕[l]
[𝑠]𝑙𝐾2(𝑠, 𝑋𝑗),
so that we can easily define unbiased estimators of the various terms appearing in (5). We refer to the
sample size 𝑏as the batch size.
Let 𝑘∈ {1,…, 𝑛}and 𝑙∈ {1,…, 𝑑}; the partial derivative (5) can be rewritten as
𝜕[𝑠𝑘]𝑙𝑅() =
𝑇2
1
𝐊4
F
Υ() − 2𝑇1𝑇𝑘,𝑙
2
𝐊2
F
,
with 𝑇1=𝑁
𝑖=1 𝑛
𝑗=1 𝐾2(𝑠𝑗, 𝑥𝑖)and 𝑇𝑘,𝑙
2=𝑁
𝑖=1 𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖), and
Υ() = 𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗).
The terms 𝑇1and 𝑇𝑘,𝑙
2are the only terms in (5) that depend on . From a uniform random sample
𝐗= {𝑋1,⋯, 𝑋𝑏}, we define the unbiased estimators
𝑇1(𝐗)of 𝑇1, and
𝑇𝑘,𝑙
2(𝐗)of 𝑇𝑘,𝑙
2, as
𝑇1(𝐗) = 𝑁
𝑏
𝑛
𝑖=1
𝑏
𝑗=1
𝐾2(𝑠𝑖, 𝑋𝑗),and
𝑇𝑘,𝑙
2(𝐗) = 𝑁
𝑏
𝑏
𝑗=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑋𝑗).
In what follows, we discuss the properties of some stochastic approximations of the gradient of 𝑅that
can be defined from such estimators.
5
One-Sample Approximation. Using a single random sample 𝐗= {𝑋1,⋯, 𝑋𝑏}of size 𝑏, we can
define the following stochastic approximation of the partial derivative (5):
𝜕[𝑠𝑘]𝑙𝑅(;𝐗) =
𝑇1(𝐗)2
𝐊4
F
Υ() − 2
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐗)
𝐊2
F
.(6)
An evaluation of
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)has complexity (𝑛2+𝑛𝑏), as opposed to (𝑛2+𝑛𝑁)for the corresponding
exact partial derivative. However, due to the dependence between
𝑇1(𝐗)and
𝑇𝑘,𝑙
2(𝐗), and to the fact that
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)involves the square of
𝑇1(𝐗), the stochastic partial derivative
𝜕[𝑠𝑘]𝑙𝑅(;𝐗)will generally
be a biased estimator of 𝜕[𝑠𝑘]𝑙𝑅().
Two-Sample Approximation. To obtain an unbiased estimator of the partial derivative (5), instead of
considering a single random sample, we may define a stochastic approximation based on two independent
random samples 𝐗= {𝑋1,⋯, 𝑋𝑏𝐗}and 𝐘= {𝑌1,⋯, 𝑌𝑏𝐘}, consisting of 𝑏𝐗and 𝑏𝐘∈ℕcopies of 𝑋
(i.e. consisting of uniform random variables on ), with 𝑏=𝑏𝐗+𝑏𝐘. The two-sample estimator of (5)
is then given by
𝜕[𝑠𝑘]𝑙𝑅(;𝐗,𝐘) =
𝑇1(𝐗)
𝑇1(𝐘)
𝐊4
F
Υ() − 2
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)
𝐊2
F
,(7)
and since 𝔼
𝑇1(𝐗)
𝑇1(𝐘)=𝑇2
1and 𝔼
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)=𝑇1𝑇𝑘,𝑙
2, we have
𝔼
𝜕[𝑠𝑘]𝑙𝑅(;𝐗,𝐘)=𝜕[𝑠𝑘]𝑙𝑅().
Although being unbiased, for a common batch size 𝑏, the variance of the two-sample estimator (7)
will generally be larger than the variance of the one-sample estimator (6). In our numerical experiments,
the larger variance of the unbiased estimator (7) seems to actually slow down the descent when compared
to the descent obtained with the one-sample estimator (6).
Remark 3.1. While considering two independent samples 𝐗and 𝐘, the two terms
𝑇1(𝐗)
𝑇1(𝐘)and
𝑇1(𝐗)
𝑇𝑘,𝑙
2(𝐘)appearing in (7) are dependent. This dependence may complicate the analysis of the prop-
erties of the resulting SGD; nevertheless, this issue might be overcome by considering four independent
samples instead of two. ⊲
4 Numerical Experiments
Throughout this section, the matrices 𝐊are defined from multisets = {𝑥1,⋯, 𝑥𝑁}⊂ℝ𝑑and from
kernels 𝐾of the form 𝐾(𝑥, 𝑡) = 𝑒−𝜌𝑥−𝑡2, with 𝜌 > 0and where .is the Euclidean norm of ℝ𝑑
(Gaussian kernel). Except for the synthetic example of Section 4.1, all the multisets we consider
consist of the entries of data sets available on the UCI Machine Learning Repository; see [6].
Our experiments are based on the following protocol: for a given 𝑛∈ℕ, we consider an initial
Nyström sample (0) consisting of 𝑛points drawn uniformly at random, without replacement, from .
The initial sample (0) is regarded as an element of 𝒳𝑛, and used to initialise a GD or SGD, with fixed
stepsize 𝛾 > 0, for the minimisation of 𝑅over 𝒳𝑛, yielding, after 𝑇∈ℕiterations, a locally optimised
Nyström sample (𝑇). The SGDs are performed with the one-sample estimator (6) and are based on
independent and identically distributed uniform random variables on (i.e. i.i.d. sampling), with batch
size 𝑏∈ℕ; see Section 3. We assess the accuracy of the Nyström approximations of 𝐊induced by (0)
and (𝑇)in terms of radial SKD and of the classical criteria (C.1)-(C.3).
6
For a Nyström sample ∈𝒳𝑛of size 𝑛∈ℕ, the matrix
𝐊()is of rank at most 𝑛. Following [4,10],
to further assess the efficiency of the approximation of 𝐊induced by , we introduce the approximation
factors
tr () = 𝐊−
𝐊()∗
𝐊−𝐊𝑛∗
,F() = 𝐊−
𝐊()F
𝐊−𝐊𝑛F
,and sp() = 𝐊−
𝐊()2
𝐊−𝐊𝑛2
,(8)
where 𝐊𝑛denotes an optimal rank-𝑛approximation of 𝐊(i.e. the approximation of 𝐊obtained by trun-
cation of a spectral expansion of 𝐊and based on 𝑛of the largest eigenvalues of 𝐊). The closer tr (),
F()and sp()are to 1, the more efficient the approximation is.
4.1 Bi-Gaussian Example
We consider a kernel matrix 𝐊defined by a set consisting of 𝑁= 2,000 points in [−1,1]2⊂ℝ2(i.e.
𝑑= 2); for the kernel parameter, we use 𝜌= 1. A graphical representation of the set is given in Figure 1;
it consists of 𝑁independent realisations of a bivariate random variable whose density is proportional to
the restriction of a bi-Gaussian density to the set [−1,1]2(the two modes of the underlying distribution
are located at (−0.8,0.8) and (0.8,−0.8), and the covariance matrix of the each Gaussian density is 𝕀2∕2,
with 𝕀2the 2×2identity matrix).
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
0 200 400 600 800 1000 1200
0
10000
20000
30000
40000
t7→ RS(t)
t
Figure 1: Graphical representation of the path followed by the landmark points of a Nystrom sample
during the local minimisation of 𝑅through GD, with 𝑛= 50,𝛾= 10−6 and 𝑇= 1,300; the green squares
are the landmark points of the initial sample (0), the red dots are the landmark points of the locally
optimised sample (𝑇), and the purple lines correspond to the paths followed by each landmark point
(left). The corresponding decay of the radial SKD is also presented (right).
The initial samples (0) are optimised via GD with stepsize 𝛾= 10−6 and for a fixed number of itera-
tions 𝑇. A graphical representation of the paths followed by the landmark points during the optimisation
process is given in Figure 1 (for 𝑛= 50 and 𝑇= 1,300); we observe that the landmark points exhibit a
relatively complex dynamic, some of them showing significant displacements from their initial positions.
The optimised landmark points concentrate around the regions where the density of points in is the
largest, and inherit a space-filling-type property in accordance with the stationarity of the kernel 𝐾.
To assess the improvement yielded by the optimisation process, for a given number of landmark points
𝑛∈ℕ, we randomly draw an initial Nyström sample (0) from (uniform sampling without replacement)
7
and compute the corresponding locally optimised sample (𝑇)(GD with 𝛾= 10−6 and 𝑇= 1,000). We
then compare 𝑅(0)with 𝑅(𝑇), and compute the corresponding approximation factors with respect
to the trace, Frobenius and spectral norms, see (8). We consider three different values of 𝑛, namely 𝑛= 20,
50 and 80, and each time perform 𝑚= 1,000 repetitions of this experiment. Our results are presented in
Figure 2; we observe that, independently of 𝑛, the local optimisation produces a significant improvement
of the Nyström approximation accuracy for all the criterion considered; the improvements are particularly
noticeable for the trace and Frobenius norms, and slightly less for the spectral norm (which of the three,
appears the coarsest measure of the approximation accuracy). Remarkably, the efficiencies of the locally
optimised Nyström samples are relatively close to each other, in particular in terms of trace and Frobenius
norms, suggesting that a large proportion of the local minima of the radial SKD induce approximations
of comparable quality.
Index
n= 20
ini. opt.
R(S)
30000
40000
50000
60000
70000
80000
90000
ini. opt.
Etr(S)
1.3
1.4
1.5
1.6
1.7
1.8
1.9
ini. opt.
EF(S)
1.4
1.6
1.8
2.0
2.2
2.4
ini. opt.
Esp(S)
1.5
2.0
2.5
3.0
3.5
4.0
Index
n= 50
ini. opt.
R(S)
10000
20000
30000
40000
50000
60000
ini. opt.
Etr(S)
1.5
2.0
2.5
3.0
3.5
ini. opt.
EF(S)
2
3
4
5
ini. opt.
Esp(S)
2
4
6
8
10
n= 80
ini. opt.
R(S)
0
10000
20000
30000
40000
ini. opt.
Etr(S)
2
3
4
5
6
7
ini. opt.
EF(S)
2
4
6
8
10
12
ini. opt.
Esp(S)
5
10
15
20
25
Figure 2: For the Bi-Gaussian example, comparison of the efficiency of the Nyström approximations for
the initial samples (0) and the locally optimised samples (𝑇)(optimisation through GD with 𝛾= 10−6
and 𝑇= 1,000). Each row corresponds to a given value of 𝑛; in each case 𝑚= 1,000 repetitions are
performed. The first column corresponds to the radial SKD, and the following three correspond to the
approximation factors defined in (8).
4.2 Abalone Data Set
We now consider the 𝑑= 8 attributes of the Abalone data set. After removing two observations that are
clear outliers, we are left with 𝑁= 4,175 entries. Each of the 8features is standardised such that it has
zero mean and unit variance. We set 𝑛= 50 and consider three different values of the kernel paramater 𝜌,
namely 𝜌= 0.25,1, and 4; this values are chosen so that the eigenvalues of the kernel matrix 𝐊exhibit
sharp, moderate and shallower decays, respectively. For the Nyström sample optimisation, we use SGD
with i.i.d. sampling and batch size 𝑏= 50,𝑇= 10,000 and 𝛾= 8 × 10−7; these values were chosen to
obtain relatively efficient optimisations for the whole range of values of 𝜌we consider. For each value of
8
𝜌, we perform 𝑚= 200 repetitions. The results are presented in Figure 3.
Index
ρ= 0.25
ini. opt.
R(S)
0
100000
200000
300000
400000
500000
600000
ini. opt.
Etr(S)
1.6
2.0
2.4
2.8
ini. opt.
EF(S)
2
3
4
5
6
7
ini. opt.
Esp(S)
5
10
15
20
25
Index
ρ= 1
ini. opt.
R(S)
50000
100000
150000
200000
250000
300000
ini. opt.
Etr(S)
1.2
1.3
1.4
1.5
1.6
1.7
1.8
ini. opt.
EF(S)
1.5
2.0
2.5
3.0
3.5
4.0
ini. opt.
Esp(S)
2
4
6
8
10
12
14
16
ρ= 4
ini. opt.
R(S)
20000
30000
40000
50000
60000
ini. opt.
Etr(S)
1.10
1.15
1.20
1.25
1.30
ini. opt.
EF(S)
1.2
1.4
1.6
1.8
2.0
2.2
ini. opt.
Esp(S)
2
3
4
5
6
7
8
9
Figure 3: For the Abalone data set with 𝑛= 50 and 𝜌∈ {0.25,1,4}, comparison of the efficiency of
the Nyström approximations for the initial Nyström samples (0) and the locally optimised samples (𝑇)
(SGD with i.i.d sampling, 𝑏= 50,𝛾= 8 × 10−7 and 𝑇= 10,000). Each row corresponds to a given value
of 𝜌; in each case, 𝑚= 200 repetitions are performed.
We observe that regardless of the values of 𝜌and in comparison with the initial Nyström samples,
the efficiencies of the locally optimised samples in terms of trace, Frobenius and spectral norms are
significantly improved. As observed in Section 4.1, the gains yielded by the local optimisations are more
evident in terms of trace and Frobenius norms, and the impact of the initialisation appears limited.
4.3 MAGIC Data Set
We consider the 𝑑= 10 attributes of the MAGIC Gamma Telescope data set. In pre-processing, we
remove the 115 duplicated entries in the data set, leaving us with 𝑁= 18,905 data points; we then
standardise each of the 𝑑= 10 features of the data set. For the kernel parameter, we use 𝜌= 0.2.
In Figure 4, we present the results obtained after the local optimisation of 𝑚= 200 random initial
Nyström samples of size 𝑛= 100 and 200. Each optimisation was performed through SGD with i.i.d.
sampling, batch size 𝑏= 50 and stepsize 𝛾= 5 × 10−8; as number of iterations, for 𝑛= 100, we used
𝑇= 3,000, and 𝑇= 4,000 for 𝑛= 200. The optimisation parameters were chosen to obtain relatively
efficient but not fully completed descents, as illustrated in Figure 4. Alongside the radial SKD, we only
compute the approximation factor corresponding to the trace norm (the trace norm is indeed the least
costly to evaluate of the three matrix norms we consider, see Section 1.2). As in the previous experiments,
we observe a significant improvement of the initial Nyström samples obtained by local optimisation of
the radial SKD.
9
Index
n= 100
n= 200
ini. opt.
R(S)
1000000
2000000
3000000
4000000
ini. opt.
R(S)
500000
1000000
1500000
2000000
2500000
ini. opt.
Etr(S)
1.3
1.4
1.5
1.6
1.7
ini. opt.
Etr(S)
1.45
1.50
1.55
1.60
1.65
1.70
1.75
Index
c()
n= 200
0 1000 2000 3000 4000
0
500000
1000000
1500000
t7→ RS(t)
t
Figure 4: For the MAGIC data set, boxplots of the radial SKD 𝑅and of the approximation factor tr
before and after the local optimisation via SGD of random Nyström samples of size 𝑛= 100 and 200; for
each value of 𝑛,𝑚= 200 repetitions are performed. The SGD is based on i.i.d. sampling, with 𝑏= 50
and 𝛾= 5 × 10−8; for 𝑛= 100, the descent is stopped after 𝑇= 3,000 iterations, and after 𝑇= 4,000
iterations for 𝑛= 200 (left). A graphical representation of the decay of the radial SKD is also presented
for 𝑛= 200 (right).
4.4 MiniBooNE Data Set
In this last experiment, we consider the 𝑑= 50 attributes of the MiniBooNE particle identification
data set. In pre-processing, we remove the 471 entries in the data set with missing values, and 1entry
appearing as a clear outlier, leaving us with 𝑁= 129,592 data points; we then standardise each of the
𝑑= 50 features of the data set. We use 𝜌= 0.04 (kernel parameter).
0 2000 4000 6000 8000
0
5000000
10000000
15000000
20000000
25000000
t7→ RS(t)
t
K−ˆ
K(S(0))
∗
≈63,272.7
K−ˆ
K(S(T))
∗
≈53,657.2
Figure 5: For the MiniBooNE data set, decay of the radial SKD during the optimisation of a random
initial Nyström sample of size 𝑛= 1,000. The SGD is based on i.i.d. sampling with batch size 𝑏= 200
and stepsize 𝛾= 2 × 10−7, and the descent is stopped after 𝑇= 8,000 iterations; the cost is evaluated
every 100 iterations.
We consider a random initial Nyström sample of size 𝑛= 1,000, and optimise it through SGD with
i.i.d. sampling, batch size 𝑏= 200, stepsize 𝛾= 2 × 10−7; the descent is stopped after 𝑇= 8,000
iterations. The resulting decay of the radial SKD is presented in Figure 5 (the cost is evaluated every 100
iterations), and the trace norm of the Nyström approximation error for the initial and locally optimised
samples are reported. In terms of computation time, on our machine (endowed with an 3.5 GHz Dual-Core
10
Intel Core i7 processor, and using a single-threaded C implementation interfaced with R), for 𝑛= 1,000,
an evaluation of the radial SKD (up to the constant 𝐊2
F) takes 6.8 s, while an evaluation of the term
𝐊−
𝐊(𝑆)∗takes 6,600 s; performing the optimisation reported in Figure 5 without checking the decay
of the cost takes 1,350 s. This experiment illustrates the ability of the considered framework to tackle
relatively large problems.
5 Conclusion
We demonstrated the relevance of the radial-SKD-based framework for the local optimisation, through
SGD, of Nyström samples for SPSD kernel-matrix approximation. We studied the Lipschitz continuity
of the underlying gradient and discussed its stochastic approximation. We performed numerical experi-
ments illustrating that local optimisation of the radial SKD yields significant improvement of the Nyström
approximation in terms of trace, Frobenius and spectral norms.
In our experiments, we implemented SGD with i.i.d. sampling, fixed stepsize and fixed number of
iterations; although already bringing satisfactory results, to improve the time efficiency of the approach,
the optimisation strategy could be accelerated by considering for instance adaptive stepsize, paralleli-
sation or momentum-type techniques (see [16] for an overview). The initial Nyström samples (0) we
considered were draw uniformly at random without replacement; while our experiments suggest that the
local minima of the radial SKD often induce approximations of comparable quality, the use of more
efficient initialisation strategies may be investigated (see e.g. [3,4, 11, 13,18]).
As a side note, when considering the trace norm, the Nyström sampling problem is intrinsically re-
lated to the integrated-mean-squared-error design criterion in kernel regression (see e.g. [8, 15, 17]);
consequently the approach considered in this paper may be used for the design of experiments for such
models.
Appendix
Proof of Theorem 2.1. We consider a Nyström sample ∈𝒳𝑛and introduce
𝑐=1
𝐊2
F
𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑥𝑖, 𝑠𝑗).(9)
In view of (5), the partial derivative of 𝑅at with respect to the 𝑙-th coordinate of the 𝑘-th landmark
point 𝑠𝑘can be written as
𝜕[𝑠𝑘]𝑙𝑅() = 𝑐2
𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)− 2𝑐
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖).(10)
For 𝑘and 𝑘′∈ {1,⋯, 𝑛}with 𝑘≠𝑘′, and for 𝑙and 𝑙′∈ {1,⋯, 𝑑}, the second-order partial derivatives
of 𝑅at , with respect to the coordinates of the landmark points in , verify
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙′𝑅() = 𝑐2
𝜕[d]
[𝑠𝑘]𝑙𝜕[d]
[𝑠𝑘]𝑙′𝐾2(𝑠𝑘, 𝑠𝑘)+2𝑐(𝜕[𝑠𝑘]𝑙′𝑐)𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)
+ 2𝑐2
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝜕[l]
[𝑠𝑘]𝑙′𝐾2(𝑠𝑘, 𝑠𝑗)+4𝑐(𝜕[𝑠𝑘]𝑙′𝑐)
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗)
− 2𝑐
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝜕[l]
[𝑠𝑘]𝑙′𝐾2(𝑠𝑘, 𝑥𝑖) − 2(𝜕[𝑠𝑘]𝑙′𝑐)
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖),and
(11)
11
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘′]𝑙′𝑅()=2𝑐(𝜕[𝑠𝑘′]𝑙′𝑐)𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)+2𝑐2
𝜕[l]
[𝑠𝑘]𝑙𝜕[r]
[𝑠𝑘′]𝑙′𝐾2(𝑠𝑘, 𝑠𝑘′)
+ 4𝑐(𝜕[𝑠𝑘′]𝑙′𝑐)
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗) − 2(𝜕[𝑠𝑘′]𝑙′𝑐)
𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖),(12)
where the partial derivative of 𝑐with respect to the 𝑙-th coordinate of the 𝑘-th landmark point 𝑠𝑘is given
by
𝜕[𝑠𝑘]𝑙𝑐=1
𝐊2
F𝑁
𝑖=1
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑥𝑖) − 𝑐𝜕[d]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑘)−2𝑐
𝑛
𝑗=1,
𝑗≠𝑘
𝜕[l]
[𝑠𝑘]𝑙𝐾2(𝑠𝑘, 𝑠𝑗).(13)
From (C.1), we have
𝐊2
F=
𝑛
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑖, 𝑠𝑗)⩾
𝑛
𝑖=1
𝐾2(𝑠𝑖, 𝑠𝑖)⩾𝑛𝛼. (14)
By the Schur product theorem, the squared kernel 𝐾2is SPSD; we denote by the RKHS of real-
valued functions on 𝒳for which 𝐾2is reproducing. For 𝑥and 𝑦∈𝒳, we have 𝐾2(𝑥, 𝑦) = 𝑘2
𝑥, 𝑘2
𝑦,
with ⋅,⋅the inner product on , and where 𝑘2
𝑥∈is such that 𝑘2
𝑥(𝑡) = 𝐾2(𝑡, 𝑥), for all 𝑡∈𝒳. From
the Cauchy-Schwartz inequality, we have
𝑁
𝑖=1
𝑛
𝑗=1
𝐾2(𝑠𝑗, 𝑥𝑖) =
𝑁
𝑖=1
𝑛
𝑗=1𝑘2
𝑠𝑗, 𝑘2
𝑥𝑖=𝑛
𝑗=1
𝑘2
𝑠𝑗,
𝑁
𝑖=1
𝑘2
𝑥𝑖
⩽
𝑛
𝑗=1
𝑘2
𝑠𝑗
𝑁
𝑖=1
𝑘2
𝑥𝑖
=𝐊F𝐊F.(15)
By combining (9) with inequalities (14) and (15), we obtain
0⩽𝑐⩽𝐊F
𝐊F
⩽𝐊F
𝑛𝛼
=𝐶0.(16)
Let 𝑘∈ {1,…, 𝑛}and let 𝑙∈ {1,…, 𝑑}. From equation (13), and using inequalities (14) and (16)
together with (C.2), we obtain
𝜕[𝑠𝑘]𝑙𝑐⩽𝑀1
𝑛𝛼 [𝑁+ (2𝑛− 1)𝐶0] = 𝐶1.(17)
In addition, let 𝑘′∈ {1,…, 𝑛}⧵{𝑘}and 𝑙′∈ {1,…, 𝑑}; from equations (11), (12), (16) and (17), and
conditions (C.2) and (C.3), we get
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘]𝑙′𝑅()
⩽𝐶2
0𝑀2+ 2𝐶0𝐶1𝑀1+ 2(𝑛− 1)𝐶2
0𝑀2+ 4(𝑛− 1)𝐶0𝐶1𝑀1+ 2𝐶0𝑀2𝑁+ 2𝐶1𝑀1𝑁
= (2𝑛− 1)𝐶2
0𝑀2+ (4𝑛− 2)𝐶0𝐶1𝑀1+ 2𝑁(𝐶0𝑀2+𝐶1𝑀1),(18)
and
𝜕[𝑠𝑘]𝑙𝜕[𝑠𝑘′]𝑙′𝑅()⩽2𝐶0𝐶1𝑀1+ 2𝐶2
0𝑀2+ 4(𝑛− 1)𝐶0𝐶1𝑀1+ 2𝐶1𝑀1𝑁
= 2𝐶2
0𝑀2+ (4𝑛− 2)𝐶0𝐶1𝑀1+ 2𝑁𝐶1𝑀1.(19)
12
For 𝑘, 𝑘′∈ {1,…, 𝑛}, we denote by 𝐁𝑘,𝑘′the 𝑑×𝑑matrix with 𝑙, 𝑙′entry given by (11) if 𝑘=𝑘′, and
by (12) otherwise. The Hessian ∇2𝑅()can then be represented as a block-matrix, that is
∇2𝑅() =
𝐁1,1⋯𝐁1,𝑛
⋮ ⋱ ⋮
𝐁𝑛,1⋯𝐁𝑛,𝑛
∈ℝ𝑛𝑑×𝑛𝑑 .
The 𝑑2entries of the 𝑛diagonal blocks of ∇2𝑅()are of the form (11), and the 𝑑2entries of the 𝑛(𝑛− 1)
off-diagonal blocks of ∇2𝑅()are the form (12). From inequalities (18) and (19), we obtain
∇2𝑅()2
2⩽∇2𝑅()2
F=
𝑛
𝑘=1
𝑑
𝑙=1
𝑑
𝑙′=1
[𝐁𝑘,𝑘]2
𝑙,𝑙′+
𝑛
𝑘=1
𝑛
𝑘′=1,
𝑘′≠𝑘
𝑑
𝑙=1
𝑑
𝑙′=1
[𝐁𝑘,𝑘′]2
𝑙,𝑙′⩽𝐿2,
with
𝐿=𝑛𝑑2[(2𝑛− 1)𝐶2
0𝑀2+ (4𝑛− 2)𝐶0𝐶1𝑀1+ 2𝑁(𝐶0𝑀2+𝐶1𝑀1)]2
+ 4𝑛(𝑛− 1)𝑑2[𝐶2
0𝑀2+ (2𝑛− 1)𝐶0𝐶1𝑀1+𝑁𝐶1𝑀1]21
2.
For all ∈𝒳𝑛, the constant 𝐿is an upper bound for the spectral norm of the Hessian matrix ∇2𝑅(),
so the gradient of 𝑅is Lipschitz continuous over 𝒳𝑛, with Lipschitz constant 𝐿.
References
[1] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability
and Statistics. Springer Science, 2004.
[2] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine
learning. Siam Review, 60(2):223–311, 2018.
[3] Difeng Cai, Edmond Chow, Lucas Erlandson, Yousef Saad, and Yuanzhe Xi. SMASH: Structured
matrix approximation by separation and hierarchy. Numerical Linear Algebra with Applications,
25, 2018.
[4] Michal Derezinski, Rajiv Khanna, and Michael W. Mahoney. Improved guarantees and a multiple-
descent curve for Column Subset Selection and the Nyström method. In Advances in Neural Infor-
mation Processing Systems, 2020.
[5] Petros Drineas and Michael W. Mahoney. On the Nyström method for approximating a Gram matrix
for improved kernel-based learning. Journal of Machine Learning Research, 6:2153–2175, 2005.
[6] Dheeru Dua and Casey Graff. UCI machine learning repository, 2019.
[7] Bertrand Gauthier. Nyström approximation and reproducing kernels: embeddings, projections and
squared-kernel discrepancy. Preprint, 2021.
[8] Bertrand Gauthier and Luc Pronzato. Convex relaxation for IMSE optimal design in random-field
models. Computational Statistics and Data Analysis, 113:375–394, 2017.
[9] Bertrand Gauthier and Johan Suykens. Optimal quadrature-sparsification for integral operator ap-
proximation. SIAM Journal on Scientific Computing, 40:A3636–A3674, 2018.
13
[10] Alex Gittens and Michael W. Mahoney. Revisiting the Nyström method for improved large-scale
machine learning. Journal of Machine Learning Research, 17:1–65, 2016.
[11] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nyström method.
Journal of Machine Learning Research, 13:981–1006, 2012.
[12] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only
converges to minimizers. In Conference on learning theory, pages 1246–1257. PMLR, 2016.
[13] Harald Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. SIAM, 1992.
[14] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel
Hilbert Spaces. Cambridge University Press, 2016.
[15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT press,
Cambridge, MA, 2006.
[16] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[17] Thomas J. Santner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer
Experiments. Springer, 2018.
[18] Shusen Wang, Zhihua Zhang, and Tong Zhang. Towards more efficient SPSD matrix approximation
and CUR matrix decomposition. Journal of Machine Learning Research, 17:7329–7377, 2016.
14