PreprintPDF Available

Kernel Support Matrix Machines with Ramp Loss

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

To overcome the shortcomings of classical support vector machines in classifying matrix-type data and outliers, we aim at studying kernel support matrix machines with ramp loss. For this purpose, a class of proximal stationary points is introduced. First, the relationship between the proximal stationary point, the Karush-Kuhn-Tucker point, and the locally optimal solution to the proposed model is built. Second, to solve the kernel support matrix machines with ramp loss, an alternating direction method of multipliers algorithm is developed. Any limit point of the sequence generated by this algorithm is shown to be a proximal stationary point. Finally, through extensive numerical simulations, we showcase the superiority of the proposed model with convolutional neural tangent kernels over existing state-of-the-art methods for matrix input data.
Content may be subject to copyright.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss
Shihai Chen1, Han Feng2, Rongrong Lin1, and Yulan Liu1
1. School of Mathematics and Statistics, Guangdong University of Technology,
Guangzhou 510520, P. R. China
2. Department of Mathematics, City University of Hong Kong, Tat Chee Avenue,
Kowloon Tong, Hongkong, P.R. China
wyyxcsh@163.com; hanfeng@cityu.edu.hk; linrr@gdut.edu.cn; ylliu@gdut.edu.cn
To overcome the shortcomings of classical support vector machines in classifying matrix-
type data and outliers, we aim at studying kernel support matrix machines with ramp
loss. For this purpose, a class of proximal stationary points is introduced. First, the re-
lationship between the proximal stationary point, the Karush–Kuhn–Tucker point, and
the locally optimal solution to the proposed model is built. Second, to solve the kernel
support matrix machines with ramp loss, an alternating direction method of multipliers
algorithm is developed. Any limit point of the sequence generated by this algorithm is
shown to be a proximal stationary point. Finally, through extensive numerical simu-
lations, we showcase the superiority of the proposed model with convolutional neural
tangent kernels over existing state-of-the-art methods for matrix input data.
Keywords: Support matrix machines; proximal stationary points; Karush-Kuhn-Tucker
points; local minimizers; convolutional neural tangent kernels
Mathematics Subject Classification 2020: 90C46; 90C26; 65K05
1. Introduction
Support vector machines (SVMs) have been widely acknowledged as one of the
most successful classification methods in machine learning [7,25,32]. The classical
SVMs require the input data to be in vector form. Specifically, given a set of n
vector input data xiRdlabeled with yi {−1,1},i[n] := {1,2, . . . , n}, the
standard soft-margin linear SVM [24] can be modeled as follows:
min
wRd,bR
1
2w2
2+C
n
X
i=1
(1 yi(wxi+b))+,(1.1)
where C > 0 is a regularization parameter and the hinge loss is denoted by
(t)+:= max{0, t}for any tR. However, much of the data encountered in practi-
cal applications is in matrix form, which is typically fed into SVMs by converting
these matrices into vectors such as the MNIST database of handwritten digits. The
process of vectorization eliminates the spatial relationships that are inherent in the
matrix data. Additionally, transforming matrices into vectors often leads to input
Corresponding author
1
November 11, 2024 5:32 output
2S. Chen, H. Feng, R. Lin, Y. Liu
data with very high dimensionality, which can significantly increase computational
complexity and weaken generalization ability.
To address these challenges in classifying data that is originally in matrix form,
numerous variants of SVMs have been introduced, for example, the support matrix
machines (SMMs) [16]. Given a matrix input dataset D:= {(Xi, yi) : i[n]}with
XiRp×qand yi {−1,1}, SMMs generally have the following form:
min
WRp×q,bRP(W) + C
n
X
i=1
(1 yi(W, XiRp×q+b))+,(1.2)
where A, BRp×q:= Pp
i=1 Pq
j=1 aij bij for any matrices A= (aij), B = (bij )Rp×q
and P:Rp×q[0,+) is a penalty function with respect to the regression matrix
W. Clearly, it is nothing that the model (1.2) can be reformed as the classical SVM
(1.1) when P(W) = 1
2W, W Rp×q. The existing research on this model (1.2) can be
roughly divided into two categories. The first research approach involved the low
rank or sparsity priors of the regression matrix. A rank-kSMM with taking P(W) =
1
2W, W Rp×qand an additional constraint rank(W)kwas studied in [29]. The
regression matrix Win [19] was factorized as W=UV with URp×d,VRd×q
and dmin{p, q}. The rank regularization on matrix regression problems, that
is, P(W) = rank(W) in (1.2), was investigated in [35]. Luo et al. [16] introduced
the spectral elastic net regularization with P(W) = 1
2W, W Rp×q+λW, where
λ > 0 and ∥·∥denotes the nuclear norm of a matrix. Later, Zheng et al. [33]
proposed a robust SMM by decomposing each sample into the sum of a low-rank
matrix and a sparse matrix and assuming the low-rankness of the learned regression
matrix. The sparse SMM with P(W) = Pq
j=1 Pp
i=1 |wij |+λWwas considered
in [34]. In addition, other loss functions such as the truncated hinge loss [20] and
the pinball loss [8] are applied to the sparse SMM to improve the robustness of the
model. For a thorough review, we refer the reader to [12].
This work concerns the second category, which uses the kernel function to assess
the structural similarity of two matrices or grayscale images. The generalization
capacity of (1.2) can be enhanced by a suitable kernel function incorporating prior
information underlying image-type data [13]. Obviously, the commonly used kernels
such as the Gaussian kernel and polynomial kernel, designed for the vector-form
data, can not be directly adopted for matrix-type data. To this end, B. Scolkopf et
al. [23] suggested constructing a locality-improved kernel via the general polynomial
kernel and local correlations, which results in an incomplete polynomial kernel,
see (5.1) for its specific formula. In the same spirit, neighborhood kernels were
proposed by V. L. Brailovsky et al. [5] and a histogram intersection kernel to image
classification was introduced by A. Barla et al. [2]. Recently, Y. Ye [31] proposed
a class of nonlinear kernels via the matrix inner product in matrix Hilbert spaces.
Specifically, the kernel constructed in [31] is κ(A, B) := [A, B ]Rp×q, V Rq×qfor
any given weight matrix VRq×q, where [·,·]Rp×q:Rp×q×Rp×qRq×qis a
mapping. Inspired by the great success of deep convolution neural networks with
the ReLU activation function in image classification, [1] designed a convolutional
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 3
neural tangent kernel (CNTK) that derived from a convolutional neural network
in the infinite-width limit trained by gradient descent. Later, an enhanced CNTK
via local average pooling or random image patches was suggested in [14]. A recent
review on neural tangent kernels is given in [26]. Both the CNTK and incomplete
polynomial kernel will be used as matrix-input kernels in our numerical experiments.
Motivated by the aforementioned work and training data contaminated by out-
lying observations, we focus on combining the kernel method and the ramp loss,
which is well-known for its robustness and insensitivity to noise [6,10,28,30,27]. The
ramp loss is defined as follows:
lr(t) := max{0,min{1, t}}.(1.3)
Let Fbe a Hilbert space endowed with the inner product ⟨·,·⟩Fand Φ : Rp×q F
be a nonlinear feature map. Define a kernel function κ:Rp×q×Rp×qRby
κ(A, B) := Φ(A),Φ(B)Ffor any A, B Rp×qas in [24] The kernel SMM with
ramp loss (R-KSMM) with respect to the dataset Dcan be described as follows:
min
w∈F,bR
1
2w2
F+C
n
X
i=1
lr(1 yi(w,Φ(Xi)F+b)).(1.4)
The ramp loss was introduced to SMM [9,17]. However, the ramp loss was typically
approximated by using a smooth function in order to solve these SMM with the
ramp loss. To the best of our knowledge, there is a small body of work that examines
the first-order optimal conditions of the R-KSMM and solves it directly. This paper
focuses on the theory and algorithm for R-KSMM (1.4) with generic matrix-input
kernels. We develop an alternating direction method of multipliers (ADMM) algo-
rithm to solve R-KSMM and find out that any limit point of the sequence generated
by ADMM is a proximal stationary point. In addition, the relationship between the
proximal stationary point, Karush–Kuhn–Tucker point, and locally optimal point
to R-KSMM (1.4) is built.
The rest of this paper is organized as follows. In Section 2, the subdifferential
and proximal operator of the ramp loss lris characterized. In Section 3, the first-
order necessary conditions of R-KSMM (1.4) are discussed. To solve it, an ADMM
algorithm is applied in Section 4. Experiments with real datasets highlight the
advantages of R-KSMM over state-of-the-art techniques in Section 5. A summary
conclusion is given in the last section.
2. Preliminaries
In this section, the limiting subdifferential of the ramp loss is characterized, and its
proximal operator is recalled.
2.1. Subdifferential of the ramp loss
The limiting subdifferential of the ramp loss lris firstly characterized in [28,
Page 12]. But, there is a case where its expression is incorrect. In this subsec-
November 11, 2024 5:32 output
4S. Chen, H. Feng, R. Lin, Y. Liu
tion, the true expression of ∂lris characterized by the definition of the limiting
subdifferential. To this end, we first recall from the monograph [21] the regular
and limiting subdifferential of a function f:RlR:= R {+∞} at a point
zdomf:={zRl:f(z)<∞}. Let Ndenote the set of all positive integers. The
inner product in Rlis denoted by ⟨·,·⟩.
Definition 2.1. (see [21, Definition 1.5 & Lemma 1.7]) Consider a function f:
RlRand a point zdomf. The lower limit of fat zis defined by
lim inf
zzf(z) := lim
δ0( inf
zUδ(z)f(z)) = min{αR:zkzwith f(zk)α}.
The function fis lower semi-continuous (lsc) at zif
lim inf
zzf(z)f(z),or equivalently lim inf
zzf(z) = f(z).
Definition 2.2. (see [21, Definition 8.3]) Consider a function f:RlRand a
point zdomf. The regular and limiting subdifferential of fat zare, respectively,
defined as
b
∂f (z) := nvRl: lim inf
z=zz
f(z)f(z) v,zz
zz2
0o,
and
∂f (z):= nvRl:zk
fzand vkb
∂f (zk) with vkvas k o,
where zk
fzmeans that zkzwith f(zk)f(z).
It is well known that the sets ∂f(z) and b
∂f (z) are closed, and b
∂f (z)∂f (z)
with b
∂f (z) being convex.
Lemma 2.1. The regular and limit subdifferentials of ramp loss lr(·)defined as in
(1.3) are given as follows respectively:
b
∂lr(t) =
{0},if t < 0,
[ 0,1 ],if t= 0,
{1},if 0 < t < 1,
,if t= 1,
{0},if t > 1,
and ∂lr(t) =
{0},if t < 0,
[ 0,1 ],if t= 0,
{1},if 0 < t < 1,
{0,1},if t= 1,
{0},if t > 1.
Proof. Notice that
lr(t) = max{0,min{1, t}} =
0,if t0,
t, if 0 < t < 1,
1,if t1.
(2.1)
Obviously, from [21, Exercise 8.8], it follows that
∂lr(t) = b
∂lr(t) =
{0},if t < 0,
{1},if 0 < t < 1,
{0},if t > 1.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 5
Hence, it is sufficient to characterize ∂lr(t) and b
∂lr(t) when t= 0 and t= 1. We
divide the rest proof into the following two cases:
Case 1. t=0. In this case, we need to check that ∂lr(0) = b
∂lr(0) =[0,1]. First,
we argue that b
∂lr(0) =[0,1]. Fix any vb
∂lr(0). Notice that
0lim inf
t0
lr(t)lr(0) v, t
|t|
= min nlim
t0
t>0
lr(t)lr(0) v, t
|t|,lim
t0
t<0
lr(t)lr(0) v, t
|t|o= min{1v, v},
which implies that v[0,1]. Hence, b
∂lr(0) [0,1]. Conversely, fix any v[0,1].
Notice that
lim inf
t0
lr(t)lr(0) v, t
|t|= min nlim
t0
t>0
lr(t)lr(0) v, t
|t|,lim
t0
t<0
lr(t)lr(0) v, t
|t|o
=min{1v, v } 0.
Hence, it holds that vb
∂lr(0) by Definition 2.2 and then [0,1] b
∂lr(0) by the
arbitrariness of vin [0,1]. This, together with b
∂lr(0) [0,1], yields b
∂lr(0) =[0,1].
In what follows, we prove that ∂lr(0) = [0,1]. Fix any v∂lr(0). By Definition
2.2, there exist two sequences tk0 and vkvsuch that lr(tk)lr(0) and
vkb
∂lr(tk). Note that tk0. If there exists an infinite integer set J1Nsuch
that 0 < tk<1 for any kJ1. Together with vkb
∂lr(tk) and (2.1), we know
that vk= 1 for any kJ1and then v= 1 by the fact vkv, which implies that
v[0,1]. If there exists an infinite integer set J2Nsuch that tk<0 for any
kJ2. Together with vkb
∂lr(tk) and (2.1), we have vk= 0 for any kJ2and
then v= 0 by the fact vkv, which implies that v[0,1]. Otherwise, there exists
an integer k, such that for any k > k,tk= 0. Together with vkb
∂lr(tk) and (2.1),
we have vk[0,1] and then v[0,1] by the fact vkv. In summary, v[0,1] and
∂lr(0) [0,1] by the arbitrariness of v. This, together with [0,1] = b
∂lr(0) lr(0),
leads to b
∂lr(0) = lr(0) = [0,1].
Case 2. t=1. In this case, we need to check that b
∂lr(1) = and lr(1) = {0,1}.
First, we argue that b
∂lr(1) = . Notice that
lim inf
t1
lr(t)lr(1) v, t 1
|t1|
= min nlim
t1
t>1
lr(t)lr(1) v, t 1
|t1|,lim
t1
t<1
lr(t)lr(1) v, t 1
|t1|o
= min{−v, v 1}.
Obviously, there does not exists vRsuch that min{−v, v 1} 0, that is,
b
∂lr(1) = by Definition 2.2.
In the rest part, we argue that ∂lr(1) = {0,1}. It is easy to check that {0,1}
∂lr(1) by Definition 2.2. Hence, we only prove that ∂lr(1) {0,1}. Fix any v
∂lr(1). By Definition 2.2, there exist two sequences tktand vkvsuch that
November 11, 2024 5:32 output
6S. Chen, H. Feng, R. Lin, Y. Liu
lr(tk)lr(t) and vkb
∂lr(tk). By (2.1), we can conclude that tk= 1 by the fact
vkb
∂lr(tk) for each k. Notice that tk1. If there exists an infinite integer set
J1Nsuch that tk>1 for any kJ1. Together with vkb
∂lr(tk) and (2.1), we
know vk= 0 for each kJ1and then v= 0 by the fact vkv. Otherwise, there
exists an infinite integer set J2Nsuch that 0 < tk<1 for any kJ2. Together
with vkb
∂lr(tk) and (2.1), we see that vk= 1 for any kJ2and then v= 1 by
the assumption vkv. In summary, v {0,1}and then ∂lr(1) {0,1}.
Remark 2.1. The expression of lr(·) in Lemma 2.1 is different from [28] when
t=1 because lr(1) = [0,1] in [28].
2.2. Proximal operator
The proximal algorithm is a prevalent methodology for addressing nonconvex and
nonsmooth optimization problems, provided that all pertinent proximal operators
can be efficiently computed. Now, we recall the definition of the proximal operator.
Definition 2.3. (see [18, Definition 1.1]) Given γ > 0 and C > 0, the proximal
operator with respect to a function f:RlRis defined by
ProxγCf (z) := arg min
xRlCf (x) + 1
2γxz2
2,zRl.
The proximal operator of ramp loss lris provided by [28].
Lemma 2.2. (see [28, Proposition 1]) Given two constants γ, C > 0. If 0< γ C <
2, the proximal operator ProxγClrat tRis given as
ProxγClr(t) =
{t},if t > 1 + γ C
2or t0,
{t, t γC},if t= 1 + γ C
2,
{tγC },if γC t < 1 + γ C
2,
{0},if 0 < t < γC.
To guarantee the injectivity of the lr’s proximal operator, its modified version
will be adopted as follow:
ProxγClr(t) =
{t},if t1 + γ C
2or t0,
{tγC },if γC t < 1 + γ C
2,
{0},if 0 < t < γC.
(2.2)
Remark 2.2. When γC 2, the proximal operator Prox γClris also characterized
in [28, Proposition 2]. However, it will not be used in this paper. We leave it out.
With the ramp loss lrdefined as in (1.3), define a function Lr:RnRby
Lr(z) :=
n
X
i=1
lr(zi),zRn.(2.3)
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 7
Notice that the function Lris separable. By [3, Theorem 6.6], the proximal operator
ProxγCLrof the function Lrhas the following form:
ProxγCLr(z) = ProxγClr(z1)×ProxγC lr(z2)× · · · × Proxγ Clr(zn),zRn.(2.4)
2.3. Kernel support matrix machine
In this section, rewrite R-KSMM (1.4) in the vector form by applying the well-
known representer theorem. Observe that the term w,Φ(Xi)Fin (1.4) can not be
directly handled once Fis a high-dimensional or even infinite-dimensional Hilbert
space. To overcome this, we resort to the representer theorem for regularized learn-
ing which lies at the basis of kernel-based methods in machine learning [22,24]. It
was first proven in the context of the squared loss function [11], and later extended
to any general pointwise loss function [22]. On the basis of the orthogonal decom-
position theorem in a Hilbert space, we immediately have the following representer
theorem for the matrix input data.
Lemma 2.3. (see [24, Theorem 4.2]) For any solution (w, b) F × Rof R-KSMM
(1.4), there exist constants cj, j [n]such that the regression matrix
w=
n
X
j=1
cjΦ(Xj).(2.5)
Write 1:= (1,1,...,1)and y:= (y1, y2, . . . , yn)Rn. For the given matrix
input dataset D, we introduce the kernel matrix
K:= [Kij :i, j [n]] := [κ(Xi, Xj) : i, j [n]].
By the equation (2.5), it follows that
w2
F=
n
X
i=1
n
X
j=1
cicjΦ(Xi),Φ(Xj)F=cTKc,
and w,Φ(Xi)F=
n
P
j=1
κ(Xi, Xj)cj=
n
P
j=1
Kij cj. Hence, R-KSMM (1.4) can be
written in a compact form:
min
cRn,bR
1
2cTKc+CLr(1diag (y)Kcby),(2.6)
where Lris given by (2.3). By introducing an extra variable uRn, (2.6) can be
transformed into the following form:
min
cRn,bR
1
2cTKc+CLr(u)
s.t. u+ diag (y)Kc+by=1.
(2.7)
November 11, 2024 5:32 output
8S. Chen, H. Feng, R. Lin, Y. Liu
3. First-order necessary conditions of R-KSMM
To develop an algorithm to solve the problem (2.7), we first establish its first-order
optimality condition. Based on this, the definitions of the Karush-Kuhn-Tucker
(KKT) point and proximal stationary point (P-stationary point) of the problem
(2.7) are introduced as follows.
Definition 3.1. Consider the problem (2.7). A point (c, b,u)Rn×R×Rnis
called a KKT point, if there exists a Lagrangian multiplier λRnsuch that
Kc+Kdiag (y)λ=0,
y,λ= 0,
u+ diag (y)Kc+by=1,
0C∂ Lr(u) + λ,
(3.1)
where 0stands for the zero vector in Rn.
Definition 3.2. Consider the problem (2.7). A point (c, b,u)Rn×R×Rnis
called a P-stationary point, if there exists a Lagrangian multiplier λRnand a
constant γ > 0 such that
Kc+Kdiag (y)λ=0,(3.2a)
y,λ= 0,(3.2b)
u+ diag (y)Kc+by=1,(3.2c)
ProxγCLr(uγλ) = u,(3.2d)
where ProxγCLr(·) is given by (2.4).
Now we are in a position to reveal the relationship between the KKT point and
the P-stationary point.
Theorem 3.1. Consider the problem (2.7) and a point (c, b,u)Rn×R×Rn.
If (c, b,u)is a P-stationary point, it is also a KKT point; if (c, b,u)with
I1:={i[n] : u
i=1}=is a KKT point, it is also a P-stationary point.
Proof. Suppose that (c, b,u) is P-stationary point of (2.7). Then there exists
λRnand γ > 0 such that equations (3.2a)-(3.2d) hold by Definition 3.2. To
prove (c, b,u) being a KKT point, it is sufficient to argue 0C∂(Lr(u)) + λ
by Definition 3.1. From Definition (2.3) and (3.2d), it follows that
u= arg min
vRnCLr(v) + 1
2γv(uγλ)2
2.
This, together with [21, Theorem 10.1], yields that
0(CLr(u)) + 1
γ(u(uγλ)),
which implies that 0C∂Lr(u) + λ. So, the equation (3.1) holds and then
(c, b,u) being a KKT point by Definition 3.1.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 9
Conversely, now suppose that (c, b,u) is a KKT point. Then there exists
λRnsuch that the equation (3.1) holds by Definition 3.1. Obviously, to argue
(c, b,u) being P-stationary point of (2.7), it suffices to prove that there exists
a constant γ > 0 such that ProxγCLr(·)(uγλ) = uby Definition 3.2. Since
λC∂ Lr(u), we arrive at λ
iC∂ lr(u
i) for any i[n] by [21, Proposition
10.5]. This, along with Lemma 2.1 and I1=, yields that for any γ > 0,
u
iγλ
i=
u
i,if u
i<0,
γλ
i[ 0, γC ],if u
i= 0,
u
i+γC, if 0 < u
i<1,
u
i,if u
i>1.
(3.3)
Write I<:={i[n]:0< u
i<1},I>:={i[n] : u>1}.Take any γsatisfying
0< γC < min 2,min
i∈I>
2(u
i1),min
i∈I<
2(1 u
i).
Together with (2.2) and (3.3), we conclude that ProxγClr(·)(u
iγλ
i) = u
ifor each
i[n] and then ProxγCLr(·)(uγλ) = uby (2.4). Consequently, (c, b,u) is
a P-stationary point of (2.7).
Remark 3.1. From the second part proof of Theorem 3.1, we know that for any
KKT point with I1=is also a P-stationary point with the parameter γsatisfying
0< γC < 2. In fact, when γC 2, the proximal operator Prox γCLris characterized
in [28, Proposition 2]. In this case, we observe that Prox γ CLris the same as the
proximal operator of the 0-norm hinge loss max{0, t}∥0[15]. It was proved in [15,
Theorem 3.13] that the KKT point and the P-stationary point are equivalent for
SVM with the 0-norm hinge loss, while the KKT point and the P-stationary point
with the parameter γsatisfying γC 2 can not equivalent for R-KSMM.
The following conclusion characterizes the relationship between the locally op-
timal solutions and KKT points of R-KSMM (2.7).
Theorem 3.2. Consider the problem (2.7) and a point (c, b,u)Rn×R×Rn.
If (c, b,u)is a local solution, then it is also a KKT point.
Proof. Let (c, b,u) be a locally optimal solution of (2.7). So, u+diag(y)Kc+
by=1and (c, b) is a locally of optimal solution of (2.6). By [21, Theorem 10.1
& 10.6], there exists λC∂ Lr(u) such that
Kc
0+Kdiag(y)
yλ=0,
which implies (3.2a) and (3.2b) hold. Hence, the equations in (3.1) hold and then
(c, b,u) is a KKT point by Definition 3.1.
November 11, 2024 5:32 output
10 S. Chen, H. Feng, R. Lin, Y. Liu
4. ADMM algorithm
In this section, we aim to develop an ADMM algorithm [4] to solve R-KSMM (2.7).
The framework of the ADMM algorithm is given as Algorithm 1 and its convergence
analysis is obtained. Besides, the notion of support matrices for R-KSMM is defined
by the P-stationary point.
4.1. ADMM for R-KSMM
Given a parameter σ > 0. The augmented Lagrangian function of the problem (2.7)
is denoted by
Lσ(c, b, u;λ):=1
2cTKc+CLr(u) + λ,u+ diag (y)Kc+by1
+σ
2u+ diag (y)Kc+by12
2.
After a simple calculation, it can be simplified as:
Lσ(c, b, u;λ) := 1
2cTKc+CLr(u) + σ
2u(1diag(y)Kcbyλ
σ)2
2λ2
2
2σ.
Consequently, the k-th iteration of the ADMM algorithm is given as follows:
uk+1 := arg min
uRnLσ(ck, bk,u,λk),
ck+1 := arg min
cRnLσ(c, bk,uk+1,λk),
bk+1 := arg min
bRLσ(ck+1, b, uk+1 ,λk),
λk+1 := λk+ισ(uk+1 + diag (y)Kck+1 +bk+1y1),
(4.1)
where ι > 0 is a given constant.
Next, we simplify each subproblem in (4.1) as follows. Denote
ηk:= 1diag (y)Kckbkyλk
σ.(4.2)
In the following, we take any σ > 0 satisfying C
σ<2. Define three index sets Γk
0,
Γk
1and Γk
2with respect ηkat the k-th step by
Γk
0:={i[n]:0k
i<C
σ},Γk
1:={i[n] : C
σηk
i<1 + C
2σ},Γk
2:= [n]\k
0Γk
1).
(4.3)
(i) Updating uk+1. From the first equation in (4.1) and (4.2), it follows that
uk+1 = arg min
uRnCLr(u) + σ
2uηk2
2= ProxC
σLr(·)(ηk).
By (2.2), we can deduce that
uk+1
Γk
0
=0Γk
0,uk+1
Γk
1
=ηk
Γk
1C
σ1Γk
1,uk+1
Γk
2
=ηk
Γk
2.(4.4)
Here, given an index subset Γ of [n], denote by xΓthe subvector of xRnby
deleting the entries xifor any i /Γ.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 11
(ii) Updating ck+1. Write ξk:= 1 uk+1 bkyλk
σ. By the second equation
in (4.1), after a simple calculation, it holds that
ck+1 = arg min
cRnf(c) := 1
2cTKc+σ
2diag (y)Kcξk2
2.
Obviously, the above problem min
cRnf(c) is a convex quadratic function. Notice that
cf(c)=(K+σKK)cσK diag (y)ξk.
Consequently, ck+1 can be obtained by solving the following linear equation:
(K+σKK)ck+1 =σK diag (y)ξk.(4.5)
(iii) Updating bk+1. From the third equation in (4.1), it follows that
bk+1 = arg min
bRby(1uk+1 diag (y)Kck+1 λk
σ)2
2.
Similar to the update rule of ck+1, we can deduce that
bk+1 =yTrk
yTy=yTrk
n,with rk:=1uk+1 diag (y)Kck+1 λk
σ.(4.6)
(iv) Updating λk+1. Write ωk:= uk+1 + diag (y)Kck+1 +bk+1y1. We set
λk+1
Γk
0Γk
1
=λk
Γk
0Γk
1+ισωk
Γk
0Γk
1,λk+1
Γk
2
=0Γk
2.(4.7)
Overall, the ADMM algorithm for R-KSMM is stated in Algorithm 1.
Algorithm 1 ADMM for R-KSMM
Initialize (c0, b0,u0,λ0). Set parameters C, σ, ι > 0 with σ > C
2. Choose the maxi-
mum iteration N.
while the tolerance is not satisfied or kNdo
Update uk+1 as in (4.4);
Update ck+1 as in (4.5);
Update bk+1 as in (4.6);
Update λk+1 as in (4.7);
Set k=k+ 1;
end while
return the final solution (ck, bk,uk,λk) to (2.7).
In the following, we prove that if the sequence generated by Algorithm 1 has a
limit point, it must be a P-stationary point for R-KSMM (2.7).
Theorem 4.1. Consider the problem (2.7). Fix any constant σ > 0satisfying σ > C
2.
Then, every limit point of the sequence generated by ADMM by Algorithm 1 must
be a P-stationary point.
November 11, 2024 5:32 output
12 S. Chen, H. Feng, R. Lin, Y. Liu
Proof. Suppose that the sequence {(ck, bk,uk,λk)}is generated by ADMM Al-
gorithm 1 and (c, b,u,λ) is its any limit point. Let ηkbe given as (4.2) and
three index sets Γk
0, Γk
1and Γk
2be defined as in (4.3). Notice that Γk
0,Γk
1,Γk
2[n]
for all kNand the number of the subsets of [n] is finite . Hence, there exist three
subsets Γ0,Γ1,Γ2[n] and an infinite integer set JNsuch that
Γk
0Γ0,Γk
1Γ1and Γk
2[n]\0Γ1) for any kJ.
Taking the limit along with Jof (4.7), we arrive at
λ
Γ0Γ1:= λ
Γ0Γ1+ισω
Γ0Γ1,λ
Γ2:= 0Γ2,(4.8)
and
ω:= u+ diag (y)Kc+by1.(4.9)
Again taking limit along with the index set Jof (4.2) and (4.4), we can obtain
u
Γ0:= 0Γ0,u
Γ1:= η
Γ1C
σ1Γ1,u
Γ2:= η
Γ2,(4.10)
and
η:= 1diag (y)Kcbyλ
σ=ω+uλ
σ,(4.11)
where the last equality holds true by (4.9). By the above equality, we know that
η
Γ2=ω
Γ2+u
Γ2λ
Γ2
σ,which, along with the second equality in (4.8) and the third
equality in (4.10), implies that ω
Γ2=0Γ2. On the other hand, ω
Γ0Γ1=0Γ0Γ1
follows from the first equation in (4.8). Therefore, ω=0. This, along with (4.9)
and (4.11), leads to η=uλ
σand
u+ diag (y)Kc+by=1.(4.12)
Notice that η=uλ
σand the fact C
σ<2. By (4.10), (4.3) and (2.2), one has
u= ProxC
σLr(η) = ProxC
σLr(uλ
σ).(4.13)
Similarly, taking the limit along with Jof (4.6), we have
b=y,1udiag (y)Kcλ
σ
n
(4.12)
=y, byλ
σ
n=by,λ
,
which means that
y,λ= 0.(4.14)
Taking the limit along with Jof (4.5) and by (4.12), we arrive at
(K+σKK)c=σKdiag(y)(diag(y)Kcλ
σ),
which implies that
Kc+Kdiag(y)λ=0.(4.15)
Therefore, (c, b,u) is a P-stationary point by the equations (4.15), (4.14), (4.12)
and (4.13) and Definition 3.2.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 13
4.2. Support matrices
Let (c, b,u) be a proximal stationary point of R-KSMM (2.7). Then from defi-
nition 3.2, there exists a Lagrangian multiplier λRnand a constant γ > 0 such
that (3.2a)-(3.2d) hold. By Theorem 3.1, it is also a KKT point of R-KSMM (2.7)
and then 0C∂ Lr(u) + λ. This, along with Lemma 2.1, yields that λi= 0 for
any i / S where S:= {i[n]:0u
i1}.
Now, let Kbe a strictly positive definite kernel on Rp×q, see [24, Subsection
2.2.1] for its generic definition on any non-empty set. From (3.2a), it follows that
Kc=Kdiag(y)λand then c=diag (y)λ. Along with (2.5), we arrive at
w=
n
X
i=1
yiλ
iΦ(Xi) = X
i∈S
yi(λ
i)Φ(Xi).(4.16)
We call that any Xi,i S is a support matrix by (4.16). In addition, again by
(4.16), the R-KSMM hyperplane is given by
h(X) = w,Φ(X)F+b=X
i∈S
yiλ
iκ(Xi, X) + b,XRp×q.
5. Numerical experiments
In this section, we evaluate the performance of R-KSMM on real datasets in binary
classification tasks. To this end, we shall compare the proposed R-KSMM (1.4) with
the kernel SMM with the hinge loss (KSMM) [23], SMM with spectral elastic net
regularization (SMM) [16], standard kernel SVM (KSVM) [25], and sparse SMM
(SSMM) [34]. All numerical experiments are implemented in Python 3.12 on a
laptop (11th Gen Intel(R) Core(TM) i7-11850H 2.50 GHz, 32 GB RAM). All codes
and numerical results are available at https://github.com/Rongrong-Lin/RKSMM.
We shall specify kernel functions for SVM and SMM. Two matrix-input kernels
employed for R-KSMM and KSMM are the incomplete polynomial kernel [23] and
CNTK [1], which have been mentioned in the introduction. Given a matrix A, let
Aij be its (i, j)-th entry. The incomplete polynomial kernel constructed in [23] takes
the form:
κd1,d2
s(A, B) := q
X
j=1
p
X
i=1 (AB)Zd1
ij d2
,(5.1)
where A, B Rp×qare two input matrices, d1, d2, s are all positive integers, and
Z:= [(smax{|is|,|js|})+:i, j [2s1]] denotes a two-dimensional spatial filter
kernel of the size (2s1) ×(2s1). The symbol denotes the entry-wise product
and represents a two-dimensional convolution operation. CNTK is an extension
of the neural tangent kernel for convolutional neural networks, where each layer of
the network is viewed as a kernel function transformation, and the composition of
these kernel functions ultimately forms a complex kernel function. There are only
two convolutional layers for CNTK. The size of the first convolutional layer is half
November 11, 2024 5:32 output
14 S. Chen, H. Feng, R. Lin, Y. Liu
the number of rows in the input matrix, and the size of the second convolutional
layer is half the number of columns in the input matrix. A detailed derivation of
CNTK can be found in [1]. For kernel SVM, each matrix-type data is vectorized
and the Gaussian kernel exp(ρxz2
2), x,zRdis chosen as its kernel function,
where ρis set to be 1/d.
To assess the performance of the aforementioned multiple classifiers, we adopt
three evaluation criteria: Accuracy (Test ACC), F1-score, and Area Under the ROC
Curve (AUC). The accuracy is defined as the proportion of all correctly predicted
samples to all samples. The higher the F1-score and AUC, the better the classi-
fication performance of the model. For more details on classification performance
measures, refer to [32, Chapter 22].
We evaluate the performance of R-KSMM on six real-world datasets: MNISTa,
Fashion-MNISTb, ORLc, INRIAd, CIFAR-10e, and EEG alcoholismf. Table 1 de-
scribes the basic information of these datasets.
Table 1. Summary of six datasets.
Dataset Type Classes Size Dimension
MNIST grayscale image 10 70000 28 ×28
Fashion-MNIST grayscale image 10 70000 28 ×28
INRIA color image 2 3634 96 ×160
ORL grayscale image 10 400 112 ×92
CIFAR-10 color image 10 70000 32 ×32
EEG alcoholism EEG signal 2 120 256 ×64
Except for EEG alcoholism and INRIA, other datasets have multiple labels. In
multiple-label dataset, we select the first two labels and their corresponding data
as binary classification datasets. The handwritten numbers 0 and 1 are chosen for
binary classification in the example of MNIST. For all image datasets, each image’s
pixel values are first divided by 255, and each sample is then normalized to conform
to a standard normal distribution. In addition, color images will be converted into
grayscale images for training. EEG performs normalization processing separately.
It makes each feature follow a standard normal distribution.
In our experiments, we set optimal parameters C {22,...,23,24},σ
{22,...,23,24},d1 {1,2,3,4},d2 {1,2,3,4},s {2,3,4,10}, the maximum
iteration N=300, and the dual step-size ι {102,101,0.5,1,1.5}. To ensure the
ahttps://yann.lecun.com/exdb/mnist/
bhttps://github.com/zalandoresearch/fashion-mnist
chttps://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
dhttps://pascal.inrialpes.fr/data/human
ehttps://www.cs.toronto.edu/~kriz/cifar.html
fhttp://kdd.ics.uci.edu/databases/eeg/eeg.html
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 15
reliability of the experimental results, all results are subjected to repeated experi-
ments for five times.
The final results are listed in Table 2. The best results for seven methods on
each dataset are highlighted in bold. First of all, the R-KSMM (CNTK) achieves
the best performance on all the datasets in terms of Test Acc, F1-score, and AUC.
Besides, the CNTK with only two convolutional layers is more powerful than the
incomplete polynomial kernel in both R-KSMM and KSMM. At last, the ramp loss
is more robust than the hinge loss in kernel SMM models.
6. Conclusion
In this paper, we developed the theoretical and algorithmic analysis for the R-
KSMM (2.7). Theoretically, the relationship between the proximal stationary point,
the Karush-Kuhn-Tucker point and the local minimizer of R-KSMM has been ex-
posed by means of the proximal operation and the limiting subdifferential of the
ramp loss, see Theorems 3.1 and 3.2. Algorithmically, the ADMM algorithm as in
Algorithm 1 is applied to solve R-KSMM, and the conclusion is obtained that any
limit point of the sequence produced by Algorithm 1 is a proximal stationary point
to R-KSMM, see Theorem 4.1. Finally, the strength and robustness of the proposed
algorithm have been verified in Table 2 on six real datasets, which also matches our
theoretical and algorithmic analysis.
Acknowledgments
Lin was supported in part by the National Natural Science Foundation of China
(12371103) and the Center for Mathematics and Interdisciplinary Sciences, School
of Mathematics and Statistics, Guangdong University of Technology. Feng was
supported in part by the Research Grants Council of Hong Kong (11303821 and
11315522). Liu was supported in part by the Guangdong Basic and Applied Basic
Research Foundation (2023A1515012891).
ORCID
Shihai Chen - https://orcid.org/0009-0009-8259-3674
Han Feng- https://orcid.org/0000-0003-2933-6205
Rongrong Lin - https://orcid.org/0000-0002-6234-2183
Yulan Liu - https://orcid.org/0000-0003-2370-2323
References
[1] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov and R. Wang, On exact
computation with an infinitely wide neural net, Advances in Neural Information
Processing Systems 32 (2019).
[2] A. Barla, F. Odone and A. Verri, Histogram intersection kernel for image clas-
sification, in Proceedings 2003 International Conference on Image Processing,3,
(Barcelona, Spain, 2003), pp. III 513–III 516.
November 11, 2024 5:32 output
16 S. Chen, H. Feng, R. Lin, Y. Liu
Table 2. Results on real datasets.
Dataset Method Train Acc(%) Test Acc(%) F1-score AUC
MNIST
R-KSMM (κ2,2
3) 100 99.60 99.50 99.61
R-KSMM (CNTK) 100 99.70 100 99.70
KSMM (κ2,2
3) 99.03 98.20 98.24 98.27
KSMM (CNTK) 99.90 99.49 99.90 99.60
SSMM 98.90 98.50 97.97 98.50
SMM 100 99.50 99.50 99.50
SVM (Gaussian) 100 98.60 98.62 98.83
Fashion-MNIST
R-KSMM (κ2,2
3) 100 98.10 99.27 97.57
R-KSMM (CNTK) 100 98.60 100 97.80
KSMM (κ2,2
3) 100 97.40 97.56 97.53
KSMM (CNTK) 100 97.90 98.75 97.61
SSMM 100 96.50 96.55 96.54
SMM 100 97.40 97.41 97.41
SVM (Gaussian) 100 97.20 97.20 97.22
ORL
R-KSMM (κ2,2
10 ) 100 95.00 100 96.67
R-KSMM (CNTK) 100 95.00 100 96.67
KSMM (κ2,2
10 ) 100 95.00 93.33 96.67
KSMM (CNTK) 100 95.00 100 90.00
SSMM 100 85.00 80.00 90.00
SMM 100 80.00 66.67 73.33
SVM (Gaussian) 100 85.00 80.00 90.00
INRIA
R-KSMM (κ2,2
10 ) 100 80.33 96.55 82.29
R-KSMM (CNTK) 100 86.67 100 88.53
KSMM (κ2,2
10 ) 100 74.69 72.43 76.25
KSMM (CNTK) 100 81.33 88.61 83.89
SSMM 100 78.67 79.27 80.66
SMM 100 78.67 78.53 79.37
SVM (Gaussian) 100 80.66 79.27 80.66
CIFAR-10
R-KSMM (κ2,2
4) 100 73.80 100 75.79
R-KSMM (CNTK) 100 78.90 100 79.50
KSMM (κ2,2
4) 100 67.40 67.03 67.43
KSMM (CNTK) 100 73.30 82.18 75.54
SSMM 100 68.50 67.38 74.62
SMM 100 65.00 64.63 65.02
SVM (Gaussian) 100 65.10 64.84 65.14
EEG alcoholism
R-KSMM (κ2,2
4) 100 83.33 92.70 87.50
R-KSMM (CNTK) 100 96.67 100 94.29
KSMM (κ2,2
4) 100 67.40 67.03 67.43
KSMM (CNTK) 100 95.00 100 95.24
SSMM 100 83.33 82.35 87.50
SMM 100 83.33 83.88 86.67
SVM (Gaussian) 100 81.66 76.84 86.90
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 17
[3] A. Beck, First-order Methods in Optimization (SIAM, 2017).
[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., Distributed optimization
and statistical learning via the alternating direction method of multipliers, Found.
Trends Mach. Learn. 3(1) (2011) 1–122.
[5] V. L. Brailovsky, O. Barzilay and R. Shahave, On global, local, mixed and neigh-
borhood kernels for support vector machines, Pattern Recogn. Lett. 20(11-13) (1999)
1183–1190.
[6] J. P. Brooks, Support vector machines with the ramp loss and the hard margin loss,
Operations Res. 59(2) (2011) 467–479.
[7] J. Cervantes, F. Garcia-Lamont, L. Rodr´ıguez-Mazahua and A. Lopez, A compre-
hensive survey on support vector machine classification: Applications, challenges and
trends, Neurocomput. 408 (2020) 189–215.
[8] R. Feng and Y. Xu, Support matrix machine with pinball loss for classification, Neural
Comput. Appl. 34 (nov 2022) 18643–18661.
[9] M. Gu, J. Zheng, H. Pan and J. Tong, Ramp sparse support matrix machine and
its application in roller bearing fault diagnosis, Appl. Soft Comput. 113 (2021) p.
107928.
[10] X. Huang, L. Shi and J. A. Suykens, Ramp loss linear programming support vector
machine, J. Mach. Learn. Res. 15(1) (2014) 2185–2211.
[11] G. Kimeldorf and G. Wahba, Some results on tchebycheffian spline functions, J.
Math. Anal. Appl. 33(1) (1971) 82–95.
[12] A. Kumari, M. Akhtar, R. Shah and M. Tanveer, Support matrix machine: A review,
Neural Netw. (2024) p. 106767.
[13] F. Lauer and G. Bloch, Incorporating prior knowledge in support vector machines
for classification: A review, Neurocomput. 71(7-9) (2008) 1578–1594.
[14] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov and S. Arora, Enhanced
convolutional neural tangent kernels, arXiv:1911.00809 (2019).
[15] R. Lin, Y. Yao and Y. Liu, Kernel support vector machine classifiers with 0-norm
hinge loss, Neurocomput. 589 (2024) p. 127669.
[16] L. Luo, Y. Xie, Z. Zhang and W.-J. Li, Support matrix machines, in International
Conference on Machine Learning, (Lille, France, 2015), pp. 938–947.
[17] H. Pan, H. Xu, J. Zheng, J. Tong and J. Cheng, Twin robust matrix machine for
intelligent fault identification of outlier samples in roller bearing, Knowl. Based. Syst.
252 (2022) p. 109391.
[18] N. Parikh, S. Boyd et al., Proximal algorithms, Found. Trends Optim. 1(3) (2014)
127–239.
[19] H. Pirsiavash, D. Ramanan and C. Fowlkes, Bilinear classifiers for visual recognition,
in Proceedings of the 22nd International Conference on Neural Information Process-
ing Systems, (Vancouver, Canada, 2009), pp. 1482–1490.
[20] C. Qian, Q. Tran-Dinh, S. Fu, C. Zou and Y. Liu, Robust multicategory support
matrix machines, Math. Program. 176 (2019) 429–463.
[21] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, volume 317 (Springer
Science & Business Media, 2009).
[22] B. Sch¨olkopf, R. Herbrich and A. J. Smola, A generalized representer theorem, in
International Conference on Computational Learning Theory, (Amsterdam, Nether-
lands, 2001), pp. 416–426.
[23] B. Sch¨olkopf, P. Simard, A. Smola and V. Vapnik, Prior knowledge in support vector
kernels, Advances in Neural Information Processing Systems 10 (1997).
[24] B. Sch¨olkopf and A. J. Smola, Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, Cambridge, 2001).
November 11, 2024 5:32 output
18 S. Chen, H. Feng, R. Lin, Y. Liu
[25] I. Steinwart and A. Christmann, Support Vector Machines, Information Science and
Statistics (Springer, 2008).
[26] Y. Tan and H. Liu, How does a kernel based on gradients of infinite-width neural
networks come to be widely used: a review of the neural tangent kernel, Int. J.
Multimed. Inf. Retr. 13(1) (2024) p. 8.
[27] H. Wang and Y. Shao, Fast generalized ramp loss support vector machine for pattern
classification, Pattern Recogn. 146 (2024) p. 109987.
[28] H. Wang, Y. Shao and N. Xiu, Proximal operator and optimality conditions for ramp
loss svm, Optim. Lett. 16 (2020) 999–1014.
[29] L. Wolf, H. Jhuang and T. Hazan, Modeling appearances with low-rank SVM, in
2007 IEEE Conference on Computer Vision and Pattern Recognition, (Minneapo-
lis,America, 2007), pp. 1–6.
[30] L. Xu, K. Crammer and D. Schuurmans, Robust support vector machine training via
convex outlier ablation, in Proceedings of the 21st National Conference on Artificial
Intelligence, (Boston, America, 2006), pp. 536–542.
[31] Y. Ye, A nonlinear kernel support matrix machine for matrix learning, Int. J. Mach.
Learn. Cyb. 10(10, SI) (2019) 2725–2738.
[32] M. J. Zaki and W. Meira Jr, Data Mining and Machine Learning: Fundamental
Concepts and Algorithms (Cambridge University Press, 2020).
[33] Q. Zheng, F. Zhu and P.-A. Heng, Robust support matrix machine for single trial
EEG classification, Trans. Neural Syst. Rehab. Eng. 26(3) (2018) 551–562.
[34] Q. Zheng, F. Zhu, J. Qin, B. Chen and P.-A. Heng, Sparse support matrix machine,
Pattern Recogn. 76 (2018) 715–726.
[35] H. Zhou and L. Li, Regularized matrix regression, J. R. Stat. Soc., B: Stat. Methodol.
76(2) (2014) 463–483.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The neural tangent kernel (NTK) was created in the context of using the limit idea to study the theory of neural network. NTKs are defined from neural network models in the infinite-width limit trained by gradient descent. Such over-parameterized models achieved good test accuracy in experiments, and the success of the NTK emphasizes not only the importance of describing neural network models in the width limit of h→∞hh \to \infty, but also the further development of deep learning theory for gradient flow in the step limit of η→0η0\eta \to 0. And NTK can be widely used in various machine learning models. This review provides a comprehensive overview of the entire development of NTKs. Firstly, the bias–variance tradeoff in statistics, the popular over-parameterization and gradient descent in deep learning, and the widely used kernel method were introduced. Secondly, the development of research on the infinite-width limit in networks and the introduction of the concept of the NTK were introduced, and the development and latest progress of NTK theory were discussed. Finally, the researches on the migrations of NTKs to neural networks of other structures and the applications of NTKs to other fields of machine learning were presented.
Article
Full-text available
Support vector machine (SVM) is one of the highly efficient classification algorithms. Unfortunately, it is designed only for input samples expressed as vectors. In real life, most input samples are naturally in matrix form and include structural information, such as electroencephalogram (EEG) signals and gray images. Support matrix machine (SMM), which can capture the latent structure within input matrices by regularizing the regression matrix to be low rank, is more suitable for matrix-form data than the SVM. However, the SMM adopts hinge loss, which is easily sensitive to noise and unstable to re-sampling. In this paper, to tackle this issue, we propose a new SMM with pinball loss (Pin−SMM), which can simultaneously consider the intrinsic structural information of input matrices and noise insensitivity. Our Pin−SMM is defined as a spectral elastic net with pinball loss, penalizing the rightly classified points. The optimization problem of Pin−SMM is also convex, which motivates us to construct the fast alternating direction method of multipliers (Fast ADMM) to solve it. Comprehensive experiments on two popular image datasets and an EEG dataset with different noises are conducted, and the experimental results confirm the effectiveness of our presented algorithm.
Article
Full-text available
We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the support matrix machine (SMM) in Luo et al. (in: Proceedings of the 32nd international conference on machine learning, Lille, France, no 1, pp 938–947, 2015), which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal–dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal–dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix–vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.
Article
Support vector machines (SVMs) are some of the most successful machine learning models for binary classification problems. Their key idea is maximizing the margin from the data to the hyperplane subject to correct classification on training samples. In the SVM training model, hinge loss is sensitive to label noise and unstable for resampling. Moreover, binary loss is the most natural choice for modeling classification errors. Motivated by this, we focus on the kernel SVM with the ℓ 0-norm hinge loss (referred to as ℓ 0-KSVM); this is a composite function of the hinge loss and ℓ 0-norm, which has the potential to address the aforementioned challenges. In consideration of the non-convexity and non-smoothness of the ℓ 0-norm hinge loss, we first characterize the limiting subdifferential of the ℓ 0-norm hinge loss and then derive the equivalent relationship between the proximal stationary point, the Karush-Kuhn-Tucker point, and the local optimal solution of ℓ 0-KSVM. Second, we develop an alternating direction method of multipliers for ℓ 0-KSVM and find that any limit point of the sequence generated by the proposed algorithm is a locally optimal solution. Lastly, experiments on synthetic and real datasets demonstrate that ℓ 0-KSVM can achieve comparable accuracy compared to the standard kernel SVMs and that the former generally results in fewer support vectors.
Article
In the industrial processes, the intelligent fault diagnosis related to signal analysis and pattern recognition is an important step to ensure the health of mechanical equipment. A popular intelligent monitor method as it has been, support matrix machine (SMM) enables to use the two-dimensional features extracted from vibration signals to build model. The core of the SMM is to extract structure information within matrix by minimizing nuclear norm to approximate the rank of the matrix. However, the nuclear norm has limited performance to eliminate the noise contained in structure information. Furthermore, features extracted from vibration signals often become outliers, and SMM is sensitive to outliers in training data. Therefore, a novel nonparallel classifier called twin robust matrix machine (TRMM) is proposed and applied to roller bearing fault diagnosis. TRMM can not only fully leverage the low-rank structure information, but also has the following novelties. First, TRMM uses the truncated nuclear norm as the low-rank constraint, to pay more attention to the large singular values related to the main structure information. Further, the ramp loss is used in TRMM as the loss function, which reduces the loss penalty for outlier sample and make TRMM insensitive to outlier samples. Finally, the accelerated proximal gradient (APG) is devised to solve the resulting optimization problem. Experimental results show that the proposed method has excellent fault diagnosis performance, especially in the case of existing outlier samples.
Article
As an efficient matrix classifier, support matrix machine (SMM) can make full use of the spatial structure of the input matrix and show superior diagnostic performance. However, the input feature matrix may be contaminated by noise to form some outliers, which will affect the classification accuracy due to excessive loss. Therefore, this paper proposes a new matrix classification method, called Ramp sparse support matrix machine (RSSMM). In RSSMM, it compulsorily limits a loss threshold under the Ramp loss function, which solves the problem of model generalization performance degradation caused by excessive loss. Meanwhile, the generalized forward–backward algorithm (GFB) is introduced into RSSMM as a solver, and a generalized smooth Ramp loss function is designed to solve the problem that the Ramp loss function itself does not have a continuous gradient. Two roller bearing fault data sets are used to prove the effectiveness of the RSSMM method, and the analysis results show the superiority of the proposed RSSMM method in the classification of roller bearing fault signal.
Article
In recent years, an enormous amount of research has been carried out on support vector machines (SVMs) and their application in several fields of science. SVMs are one of the most powerful and robust classification and regression algorithms in multiple fields of application. The SVM has been playing a significant role in pattern recognition which is an extensively popular and active research area among the researchers. Research in some fields where SVMs do not perform well has spurred development of other applications such as SVM for large data sets, SVM for multi classification and SVM for unbalanced data sets. Further, SVM has been integrated with other advanced methods such as evolve algorithms, to enhance the ability of classification and optimize parameters. SVM algorithms have gained recognition in research and applications in several scientific and engineering areas. This paper provides a brief introduction of SVMs, describes many applications and summarizes challenges and trends. Furthermore, limitations of SVMs will be identified. The future of SVMs will be discussed in conjunction with further applications. The applications of SVMs will be reviewed as well, especially in the some fields.
Book
A comprehensive introduction to Support Vector Machines and related kernel methods. In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs—-kernels—for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics. Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.