Content uploaded by Rongrong Lin
Author content
All content in this area was uploaded by Rongrong Lin on Nov 11, 2024
Content may be subject to copyright.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss
Shihai Chen1, Han Feng2, Rongrong Lin1, and Yulan Liu1∗
1. School of Mathematics and Statistics, Guangdong University of Technology,
Guangzhou 510520, P. R. China
2. Department of Mathematics, City University of Hong Kong, Tat Chee Avenue,
Kowloon Tong, Hongkong, P.R. China
wyyxcsh@163.com; hanfeng@cityu.edu.hk; linrr@gdut.edu.cn; ylliu@gdut.edu.cn
To overcome the shortcomings of classical support vector machines in classifying matrix-
type data and outliers, we aim at studying kernel support matrix machines with ramp
loss. For this purpose, a class of proximal stationary points is introduced. First, the re-
lationship between the proximal stationary point, the Karush–Kuhn–Tucker point, and
the locally optimal solution to the proposed model is built. Second, to solve the kernel
support matrix machines with ramp loss, an alternating direction method of multipliers
algorithm is developed. Any limit point of the sequence generated by this algorithm is
shown to be a proximal stationary point. Finally, through extensive numerical simu-
lations, we showcase the superiority of the proposed model with convolutional neural
tangent kernels over existing state-of-the-art methods for matrix input data.
Keywords: Support matrix machines; proximal stationary points; Karush-Kuhn-Tucker
points; local minimizers; convolutional neural tangent kernels
Mathematics Subject Classification 2020: 90C46; 90C26; 65K05
1. Introduction
Support vector machines (SVMs) have been widely acknowledged as one of the
most successful classification methods in machine learning [7,25,32]. The classical
SVMs require the input data to be in vector form. Specifically, given a set of n
vector input data xi∈Rdlabeled with yi∈ {−1,1},i∈[n] := {1,2, . . . , n}, the
standard soft-margin linear SVM [24] can be modeled as follows:
min
w∈Rd,b∈R
1
2∥w∥2
2+C
n
X
i=1
(1 −yi(w⊤xi+b))+,(1.1)
where C > 0 is a regularization parameter and the hinge loss is denoted by
(t)+:= max{0, t}for any t∈R. However, much of the data encountered in practi-
cal applications is in matrix form, which is typically fed into SVMs by converting
these matrices into vectors such as the MNIST database of handwritten digits. The
process of vectorization eliminates the spatial relationships that are inherent in the
matrix data. Additionally, transforming matrices into vectors often leads to input
∗Corresponding author
1
November 11, 2024 5:32 output
2S. Chen, H. Feng, R. Lin, Y. Liu
data with very high dimensionality, which can significantly increase computational
complexity and weaken generalization ability.
To address these challenges in classifying data that is originally in matrix form,
numerous variants of SVMs have been introduced, for example, the support matrix
machines (SMMs) [16]. Given a matrix input dataset D:= {(Xi, yi) : i∈[n]}with
Xi∈Rp×qand yi∈ {−1,1}, SMMs generally have the following form:
min
W∈Rp×q,b∈RP(W) + C
n
X
i=1
(1 −yi(⟨W, Xi⟩Rp×q+b))+,(1.2)
where ⟨A, B⟩Rp×q:= Pp
i=1 Pq
j=1 aij bij for any matrices A= (aij), B = (bij )∈Rp×q
and P:Rp×q→[0,+∞) is a penalty function with respect to the regression matrix
W. Clearly, it is nothing that the model (1.2) can be reformed as the classical SVM
(1.1) when P(W) = 1
2⟨W, W ⟩Rp×q. The existing research on this model (1.2) can be
roughly divided into two categories. The first research approach involved the low
rank or sparsity priors of the regression matrix. A rank-kSMM with taking P(W) =
1
2⟨W, W ⟩Rp×qand an additional constraint rank(W)≤kwas studied in [29]. The
regression matrix Win [19] was factorized as W=UV with U∈Rp×d,V∈Rd×q
and d≤min{p, q}. The rank regularization on matrix regression problems, that
is, P(W) = rank(W) in (1.2), was investigated in [35]. Luo et al. [16] introduced
the spectral elastic net regularization with P(W) = 1
2⟨W, W ⟩Rp×q+λ∥W∥∗, where
λ > 0 and ∥·∥∗denotes the nuclear norm of a matrix. Later, Zheng et al. [33]
proposed a robust SMM by decomposing each sample into the sum of a low-rank
matrix and a sparse matrix and assuming the low-rankness of the learned regression
matrix. The sparse SMM with P(W) = Pq
j=1 Pp
i=1 |wij |+λ∥W∥∗was considered
in [34]. In addition, other loss functions such as the truncated hinge loss [20] and
the pinball loss [8] are applied to the sparse SMM to improve the robustness of the
model. For a thorough review, we refer the reader to [12].
This work concerns the second category, which uses the kernel function to assess
the structural similarity of two matrices or grayscale images. The generalization
capacity of (1.2) can be enhanced by a suitable kernel function incorporating prior
information underlying image-type data [13]. Obviously, the commonly used kernels
such as the Gaussian kernel and polynomial kernel, designed for the vector-form
data, can not be directly adopted for matrix-type data. To this end, B. Sch¨olkopf et
al. [23] suggested constructing a locality-improved kernel via the general polynomial
kernel and local correlations, which results in an incomplete polynomial kernel,
see (5.1) for its specific formula. In the same spirit, neighborhood kernels were
proposed by V. L. Brailovsky et al. [5] and a histogram intersection kernel to image
classification was introduced by A. Barla et al. [2]. Recently, Y. Ye [31] proposed
a class of nonlinear kernels via the matrix inner product in matrix Hilbert spaces.
Specifically, the kernel constructed in [31] is κ(A, B) := ⟨[A, B ]Rp×q, V ⟩Rq×qfor
any given weight matrix V∈Rq×q, where [·,·]Rp×q:Rp×q×Rp×q→Rq×qis a
mapping. Inspired by the great success of deep convolution neural networks with
the ReLU activation function in image classification, [1] designed a convolutional
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 3
neural tangent kernel (CNTK) that derived from a convolutional neural network
in the infinite-width limit trained by gradient descent. Later, an enhanced CNTK
via local average pooling or random image patches was suggested in [14]. A recent
review on neural tangent kernels is given in [26]. Both the CNTK and incomplete
polynomial kernel will be used as matrix-input kernels in our numerical experiments.
Motivated by the aforementioned work and training data contaminated by out-
lying observations, we focus on combining the kernel method and the ramp loss,
which is well-known for its robustness and insensitivity to noise [6,10,28,30,27]. The
ramp loss is defined as follows:
lr(t) := max{0,min{1, t}}.(1.3)
Let Fbe a Hilbert space endowed with the inner product ⟨·,·⟩Fand Φ : Rp×q→ F
be a nonlinear feature map. Define a kernel function κ:Rp×q×Rp×q→Rby
κ(A, B) := ⟨Φ(A),Φ(B)⟩Ffor any A, B ∈Rp×qas in [24] The kernel SMM with
ramp loss (R-KSMM) with respect to the dataset Dcan be described as follows:
min
w∈F,b∈R
1
2∥w∥2
F+C
n
X
i=1
lr(1 −yi(⟨w,Φ(Xi)⟩F+b)).(1.4)
The ramp loss was introduced to SMM [9,17]. However, the ramp loss was typically
approximated by using a smooth function in order to solve these SMM with the
ramp loss. To the best of our knowledge, there is a small body of work that examines
the first-order optimal conditions of the R-KSMM and solves it directly. This paper
focuses on the theory and algorithm for R-KSMM (1.4) with generic matrix-input
kernels. We develop an alternating direction method of multipliers (ADMM) algo-
rithm to solve R-KSMM and find out that any limit point of the sequence generated
by ADMM is a proximal stationary point. In addition, the relationship between the
proximal stationary point, Karush–Kuhn–Tucker point, and locally optimal point
to R-KSMM (1.4) is built.
The rest of this paper is organized as follows. In Section 2, the subdifferential
and proximal operator of the ramp loss lris characterized. In Section 3, the first-
order necessary conditions of R-KSMM (1.4) are discussed. To solve it, an ADMM
algorithm is applied in Section 4. Experiments with real datasets highlight the
advantages of R-KSMM over state-of-the-art techniques in Section 5. A summary
conclusion is given in the last section.
2. Preliminaries
In this section, the limiting subdifferential of the ramp loss is characterized, and its
proximal operator is recalled.
2.1. Subdifferential of the ramp loss
The limiting subdifferential of the ramp loss lris firstly characterized in [28,
Page 12]. But, there is a case where its expression is incorrect. In this subsec-
November 11, 2024 5:32 output
4S. Chen, H. Feng, R. Lin, Y. Liu
tion, the true expression of ∂lris characterized by the definition of the limiting
subdifferential. To this end, we first recall from the monograph [21] the regular
and limiting subdifferential of a function f:Rl→R:= R∪ {+∞} at a point
z∈domf:={z∈Rl:f(z)<∞}. Let Ndenote the set of all positive integers. The
inner product in Rlis denoted by ⟨·,·⟩.
Definition 2.1. (see [21, Definition 1.5 & Lemma 1.7]) Consider a function f:
Rl→Rand a point z∈domf. The lower limit of fat zis defined by
lim inf
z→zf(z) := lim
δ↓0( inf
z∈Uδ(z)f(z)) = min{α∈R:∃zk→zwith f(zk)→α}.
The function fis lower semi-continuous (lsc) at zif
lim inf
z→zf(z)≥f(z),or equivalently lim inf
z→zf(z) = f(z).
Definition 2.2. (see [21, Definition 8.3]) Consider a function f:Rl→Rand a
point z∈domf. The regular and limiting subdifferential of fat zare, respectively,
defined as
b
∂f (z) := nv∈Rl: lim inf
z=z→z
f(z)−f(z)− ⟨v,z−z⟩
∥z−z∥2
≥0o,
and
∂f (z):= nv∈Rl:∃zk−→
fzand vk∈b
∂f (zk) with vk→vas k→ ∞o,
where zk−→
fzmeans that zk→zwith f(zk)→f(z).
It is well known that the sets ∂f(z) and b
∂f (z) are closed, and b
∂f (z)⊆∂f (z)
with b
∂f (z) being convex.
Lemma 2.1. The regular and limit subdifferentials of ramp loss lr(·)defined as in
(1.3) are given as follows respectively:
b
∂lr(t) =
{0},if t < 0,
[ 0,1 ],if t= 0,
{1},if 0 < t < 1,
∅,if t= 1,
{0},if t > 1,
and ∂lr(t) =
{0},if t < 0,
[ 0,1 ],if t= 0,
{1},if 0 < t < 1,
{0,1},if t= 1,
{0},if t > 1.
Proof. Notice that
lr(t) = max{0,min{1, t}} =
0,if t≤0,
t, if 0 < t < 1,
1,if t≥1.
(2.1)
Obviously, from [21, Exercise 8.8], it follows that
∂lr(t) = b
∂lr(t) =
{0},if t < 0,
{1},if 0 < t < 1,
{0},if t > 1.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 5
Hence, it is sufficient to characterize ∂lr(t) and b
∂lr(t) when t= 0 and t= 1. We
divide the rest proof into the following two cases:
Case 1. t=0. In this case, we need to check that ∂lr(0) = b
∂lr(0) =[0,1]. First,
we argue that b
∂lr(0) =[0,1]. Fix any v∈b
∂lr(0). Notice that
0≤lim inf
t→0
lr(t)−lr(0) − ⟨v, t⟩
|t|
= min nlim
t→0
t>0
lr(t)−lr(0) − ⟨v, t⟩
|t|,lim
t→0
t<0
lr(t)−lr(0) − ⟨v, t⟩
|t|o= min{1−v, v},
which implies that v∈[0,1]. Hence, b
∂lr(0) ⊆[0,1]. Conversely, fix any v∈[0,1].
Notice that
lim inf
t→0
lr(t)−lr(0) − ⟨v, t⟩
|t|= min nlim
t→0
t>0
lr(t)−lr(0) − ⟨v, t⟩
|t|,lim
t→0
t<0
lr(t)−lr(0) − ⟨v, t⟩
|t|o
=min{1−v, v } ≥ 0.
Hence, it holds that v∈b
∂lr(0) by Definition 2.2 and then [0,1] ⊆b
∂lr(0) by the
arbitrariness of vin [0,1]. This, together with b
∂lr(0) ⊆[0,1], yields b
∂lr(0) =[0,1].
In what follows, we prove that ∂lr(0) = [0,1]. Fix any v∈∂lr(0). By Definition
2.2, there exist two sequences tk→0 and vk→vsuch that lr(tk)→lr(0) and
vk∈b
∂lr(tk). Note that tk→0. If there exists an infinite integer set J1⊆Nsuch
that 0 < tk<1 for any k∈J1. Together with vk∈b
∂lr(tk) and (2.1), we know
that vk= 1 for any k∈J1and then v= 1 by the fact vk→v, which implies that
v∈[0,1]. If there exists an infinite integer set J2⊆Nsuch that tk<0 for any
k∈J2. Together with vk∈b
∂lr(tk) and (2.1), we have vk= 0 for any k∈J2and
then v= 0 by the fact vk→v, which implies that v∈[0,1]. Otherwise, there exists
an integer k, such that for any k > k,tk= 0. Together with vk∈b
∂lr(tk) and (2.1),
we have vk∈[0,1] and then v∈[0,1] by the fact vk→v. In summary, v∈[0,1] and
∂lr(0) ⊆[0,1] by the arbitrariness of v. This, together with [0,1] = b
∂lr(0) ⊆∂lr(0),
leads to b
∂lr(0) = ∂lr(0) = [0,1].
Case 2. t=1. In this case, we need to check that b
∂lr(1) = ∅and ∂lr(1) = {0,1}.
First, we argue that b
∂lr(1) = ∅. Notice that
lim inf
t→1
lr(t)−lr(1) − ⟨v, t −1⟩
|t−1|
= min nlim
t→1
t>1
lr(t)−lr(1) − ⟨v, t −1⟩
|t−1|,lim
t→1
t<1
lr(t)−lr(1) − ⟨v, t −1⟩
|t−1|o
= min{−v, v −1}.
Obviously, there does not exists v∈Rsuch that min{−v, v −1} ≥ 0, that is,
b
∂lr(1) = ∅by Definition 2.2.
In the rest part, we argue that ∂lr(1) = {0,1}. It is easy to check that {0,1} ⊆
∂lr(1) by Definition 2.2. Hence, we only prove that ∂lr(1) ⊆ {0,1}. Fix any v∈
∂lr(1). By Definition 2.2, there exist two sequences tk→tand vk→vsuch that
November 11, 2024 5:32 output
6S. Chen, H. Feng, R. Lin, Y. Liu
lr(tk)→lr(t) and vk∈b
∂lr(tk). By (2.1), we can conclude that tk= 1 by the fact
vk∈b
∂lr(tk) for each k. Notice that tk→1. If there exists an infinite integer set
J1⊆Nsuch that tk>1 for any k∈J1. Together with vk∈b
∂lr(tk) and (2.1), we
know vk= 0 for each k∈J1and then v= 0 by the fact vk→v. Otherwise, there
exists an infinite integer set J2⊆Nsuch that 0 < tk<1 for any k∈J2. Together
with vk∈b
∂lr(tk) and (2.1), we see that vk= 1 for any k∈J2and then v= 1 by
the assumption vk→v. In summary, v∈ {0,1}and then ∂lr(1) ⊆ {0,1}.
Remark 2.1. The expression of ∂lr(·) in Lemma 2.1 is different from [28] when
t=1 because ∂lr(1) = [0,1] in [28].
2.2. Proximal operator
The proximal algorithm is a prevalent methodology for addressing nonconvex and
nonsmooth optimization problems, provided that all pertinent proximal operators
can be efficiently computed. Now, we recall the definition of the proximal operator.
Definition 2.3. (see [18, Definition 1.1]) Given γ > 0 and C > 0, the proximal
operator with respect to a function f:Rl→Ris defined by
ProxγCf (z) := arg min
x∈RlCf (x) + 1
2γ∥x−z∥2
2,∀z∈Rl.
The proximal operator of ramp loss lris provided by [28].
Lemma 2.2. (see [28, Proposition 1]) Given two constants γ, C > 0. If 0< γ C <
2, the proximal operator ProxγClrat t∈Ris given as
ProxγClr(t) =
{t},if t > 1 + γ C
2or t≤0,
{t, t −γC},if t= 1 + γ C
2,
{t−γC },if γC ≤t < 1 + γ C
2,
{0},if 0 < t < γC.
To guarantee the injectivity of the lr’s proximal operator, its modified version
will be adopted as follow:
ProxγClr(t) =
{t},if t≥1 + γ C
2or t≤0,
{t−γC },if γC ≤t < 1 + γ C
2,
{0},if 0 < t < γC.
(2.2)
Remark 2.2. When γC ≥2, the proximal operator Prox γClris also characterized
in [28, Proposition 2]. However, it will not be used in this paper. We leave it out.
With the ramp loss lrdefined as in (1.3), define a function Lr:Rn→Rby
Lr(z) :=
n
X
i=1
lr(zi),∀z∈Rn.(2.3)
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 7
Notice that the function Lris separable. By [3, Theorem 6.6], the proximal operator
ProxγCLrof the function Lrhas the following form:
ProxγCLr(z) = ProxγClr(z1)×ProxγC lr(z2)× · · · × Proxγ Clr(zn),∀z∈Rn.(2.4)
2.3. Kernel support matrix machine
In this section, rewrite R-KSMM (1.4) in the vector form by applying the well-
known representer theorem. Observe that the term ⟨w,Φ(Xi)⟩Fin (1.4) can not be
directly handled once Fis a high-dimensional or even infinite-dimensional Hilbert
space. To overcome this, we resort to the representer theorem for regularized learn-
ing which lies at the basis of kernel-based methods in machine learning [22,24]. It
was first proven in the context of the squared loss function [11], and later extended
to any general pointwise loss function [22]. On the basis of the orthogonal decom-
position theorem in a Hilbert space, we immediately have the following representer
theorem for the matrix input data.
Lemma 2.3. (see [24, Theorem 4.2]) For any solution (w∗, b)∈ F × Rof R-KSMM
(1.4), there exist constants cj, j ∈[n]such that the regression matrix
w∗=
n
X
j=1
cjΦ(Xj).(2.5)
Write 1:= (1,1,...,1)⊤and y:= (y1, y2, . . . , yn)⊤∈Rn. For the given matrix
input dataset D, we introduce the kernel matrix
K:= [Kij :i, j ∈[n]] := [κ(Xi, Xj) : i, j ∈[n]].
By the equation (2.5), it follows that
∥w∗∥2
F=
n
X
i=1
n
X
j=1
cicj⟨Φ(Xi),Φ(Xj)⟩F=cTKc,
and ⟨w,Φ(Xi)⟩F=
n
P
j=1
κ(Xi, Xj)cj=
n
P
j=1
Kij cj. Hence, R-KSMM (1.4) can be
written in a compact form:
min
c∈Rn,b∈R
1
2cTKc+CLr(1−diag (y)Kc−by),(2.6)
where Lris given by (2.3). By introducing an extra variable u∈Rn, (2.6) can be
transformed into the following form:
min
c∈Rn,b∈R
1
2cTKc+CLr(u)
s.t. u+ diag (y)Kc+by=1.
(2.7)
November 11, 2024 5:32 output
8S. Chen, H. Feng, R. Lin, Y. Liu
3. First-order necessary conditions of R-KSMM
To develop an algorithm to solve the problem (2.7), we first establish its first-order
optimality condition. Based on this, the definitions of the Karush-Kuhn-Tucker
(KKT) point and proximal stationary point (P-stationary point) of the problem
(2.7) are introduced as follows.
Definition 3.1. Consider the problem (2.7). A point (c∗, b∗,u∗)∈Rn×R×Rnis
called a KKT point, if there exists a Lagrangian multiplier λ∗∈Rnsuch that
Kc∗+Kdiag (y)λ∗=0,
⟨y,λ∗⟩= 0,
u∗+ diag (y)Kc∗+b∗y=1,
0∈C∂ Lr(u∗) + λ∗,
(3.1)
where 0stands for the zero vector in Rn.
Definition 3.2. Consider the problem (2.7). A point (c∗, b∗,u∗)∈Rn×R×Rnis
called a P-stationary point, if there exists a Lagrangian multiplier λ∗∈Rnand a
constant γ > 0 such that
Kc∗+Kdiag (y)λ∗=0,(3.2a)
⟨y,λ∗⟩= 0,(3.2b)
u∗+ diag (y)Kc∗+b∗y=1,(3.2c)
ProxγCLr(u∗−γλ∗) = u∗,(3.2d)
where ProxγCLr(·) is given by (2.4).
Now we are in a position to reveal the relationship between the KKT point and
the P-stationary point.
Theorem 3.1. Consider the problem (2.7) and a point (c∗, b∗,u∗)∈Rn×R×Rn.
If (c∗, b∗,u∗)is a P-stationary point, it is also a KKT point; if (c∗, b∗,u∗)with
I1:={i∈[n] : u∗
i=1}=∅is a KKT point, it is also a P-stationary point.
Proof. Suppose that (c∗, b∗,u∗) is P-stationary point of (2.7). Then there exists
λ∗∈Rnand γ > 0 such that equations (3.2a)-(3.2d) hold by Definition 3.2. To
prove (c∗, b∗,u∗) being a KKT point, it is sufficient to argue 0∈C∂(Lr(u∗)) + λ∗
by Definition 3.1. From Definition (2.3) and (3.2d), it follows that
u∗= arg min
v∈RnCLr(v) + 1
2γ∥v−(u∗−γλ∗)∥2
2.
This, together with [21, Theorem 10.1], yields that
0∈∂(CLr(u∗)) + 1
γ(u∗−(u∗−γλ∗)),
which implies that 0∈C∂Lr(u∗) + λ∗. So, the equation (3.1) holds and then
(c∗, b∗,u∗) being a KKT point by Definition 3.1.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 9
Conversely, now suppose that (c∗, b∗,u∗) is a KKT point. Then there exists
λ∗∈Rnsuch that the equation (3.1) holds by Definition 3.1. Obviously, to argue
(c∗, b∗,u∗) being P-stationary point of (2.7), it suffices to prove that there exists
a constant γ > 0 such that ProxγCLr(·)(u∗−γλ∗) = u∗by Definition 3.2. Since
−λ∗∈C∂ Lr(u∗), we arrive at −λ∗
i∈C∂ lr(u∗
i) for any i∈[n] by [21, Proposition
10.5]. This, along with Lemma 2.1 and I1=∅, yields that for any γ > 0,
u∗
i−γλ∗
i=
u∗
i,if u∗
i<0,
γλ∗
i∈[ 0, γC ],if u∗
i= 0,
u∗
i+γC, if 0 < u∗
i<1,
u∗
i,if u∗
i>1.
(3.3)
Write I<:={i∈[n]:0< u∗
i<1},I>:={i∈[n] : u∗>1}.Take any γsatisfying
0< γC < min 2,min
i∈I>
2(u∗
i−1),min
i∈I<
2(1 −u∗
i).
Together with (2.2) and (3.3), we conclude that ProxγClr(·)(u∗
i−γλ∗
i) = u∗
ifor each
i∈[n] and then ProxγCLr(·)(u∗−γλ∗) = u∗by (2.4). Consequently, (c∗, b∗,u∗) is
a P-stationary point of (2.7).
Remark 3.1. From the second part proof of Theorem 3.1, we know that for any
KKT point with I1=∅is also a P-stationary point with the parameter γsatisfying
0< γC < 2. In fact, when γC ≥2, the proximal operator Prox γCLris characterized
in [28, Proposition 2]. In this case, we observe that Prox γ CLris the same as the
proximal operator of the ℓ0-norm hinge loss ∥max{0, t}∥0[15]. It was proved in [15,
Theorem 3.13] that the KKT point and the P-stationary point are equivalent for
SVM with the ℓ0-norm hinge loss, while the KKT point and the P-stationary point
with the parameter γsatisfying γC ≥2 can not equivalent for R-KSMM.
The following conclusion characterizes the relationship between the locally op-
timal solutions and KKT points of R-KSMM (2.7).
Theorem 3.2. Consider the problem (2.7) and a point (c∗, b∗,u∗)∈Rn×R×Rn.
If (c∗, b∗,u∗)is a local solution, then it is also a KKT point.
Proof. Let (c∗, b∗,u∗) be a locally optimal solution of (2.7). So, u∗+diag(y)Kc∗+
b∗y=1and (c∗, b∗) is a locally of optimal solution of (2.6). By [21, Theorem 10.1
& 10.6], there exists −λ∗∈C∂ Lr(u∗) such that
Kc∗
0+Kdiag(y)
y⊤λ∗=0,
which implies (3.2a) and (3.2b) hold. Hence, the equations in (3.1) hold and then
(c∗, b∗,u∗) is a KKT point by Definition 3.1.
November 11, 2024 5:32 output
10 S. Chen, H. Feng, R. Lin, Y. Liu
4. ADMM algorithm
In this section, we aim to develop an ADMM algorithm [4] to solve R-KSMM (2.7).
The framework of the ADMM algorithm is given as Algorithm 1 and its convergence
analysis is obtained. Besides, the notion of support matrices for R-KSMM is defined
by the P-stationary point.
4.1. ADMM for R-KSMM
Given a parameter σ > 0. The augmented Lagrangian function of the problem (2.7)
is denoted by
Lσ(c, b, u;λ):=1
2cTKc+CLr(u) + ⟨λ,u+ diag (y)Kc+by−1⟩
+σ
2∥u+ diag (y)Kc+by−1∥2
2.
After a simple calculation, it can be simplified as:
Lσ(c, b, u;λ) := 1
2cTKc+CLr(u) + σ
2∥u−(1−diag(y)Kc−by−λ
σ)∥2
2−∥λ∥2
2
2σ.
Consequently, the k-th iteration of the ADMM algorithm is given as follows:
uk+1 := arg min
u∈RnLσ(ck, bk,u,λk),
ck+1 := arg min
c∈RnLσ(c, bk,uk+1,λk),
bk+1 := arg min
b∈RLσ(ck+1, b, uk+1 ,λk),
λk+1 := λk+ισ(uk+1 + diag (y)Kck+1 +bk+1y−1),
(4.1)
where ι > 0 is a given constant.
Next, we simplify each subproblem in (4.1) as follows. Denote
ηk:= 1−diag (y)Kck−bky−λk
σ.(4.2)
In the following, we take any σ > 0 satisfying C
σ<2. Define three index sets Γk
0,
Γk
1and Γk
2with respect ηkat the k-th step by
Γk
0:={i∈[n]:0<ηk
i<C
σ},Γk
1:={i∈[n] : C
σ≤ηk
i<1 + C
2σ},Γk
2:= [n]\(Γk
0∪Γk
1).
(4.3)
(i) Updating uk+1. From the first equation in (4.1) and (4.2), it follows that
uk+1 = arg min
u∈RnCLr(u) + σ
2∥u−ηk∥2
2= ProxC
σLr(·)(ηk).
By (2.2), we can deduce that
uk+1
Γk
0
=0Γk
0,uk+1
Γk
1
=ηk
Γk
1−C
σ1Γk
1,uk+1
Γk
2
=ηk
Γk
2.(4.4)
Here, given an index subset Γ of [n], denote by xΓthe subvector of x∈Rnby
deleting the entries xifor any i /∈Γ.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 11
(ii) Updating ck+1. Write ξk:= 1 −uk+1 −bky−λk
σ. By the second equation
in (4.1), after a simple calculation, it holds that
ck+1 = arg min
c∈Rnf(c) := 1
2cTKc+σ
2∥diag (y)Kc−ξk∥2
2.
Obviously, the above problem min
c∈Rnf(c) is a convex quadratic function. Notice that
∇cf(c)=(K+σKK)c−σK diag (y)ξk.
Consequently, ck+1 can be obtained by solving the following linear equation:
(K+σKK)ck+1 =σK diag (y)ξk.(4.5)
(iii) Updating bk+1. From the third equation in (4.1), it follows that
bk+1 = arg min
b∈R∥by−(1−uk+1 −diag (y)Kck+1 −λk
σ)∥2
2.
Similar to the update rule of ck+1, we can deduce that
bk+1 =yTrk
yTy=yTrk
n,with rk:=1−uk+1 −diag (y)Kck+1 −λk
σ.(4.6)
(iv) Updating λk+1. Write ωk:= uk+1 + diag (y)Kck+1 +bk+1y−1. We set
λk+1
Γk
0∪Γk
1
=λk
Γk
0∪Γk
1+ισωk
Γk
0∪Γk
1,λk+1
Γk
2
=0Γk
2.(4.7)
Overall, the ADMM algorithm for R-KSMM is stated in Algorithm 1.
Algorithm 1 ADMM for R-KSMM
Initialize (c0, b0,u0,λ0). Set parameters C, σ, ι > 0 with σ > C
2. Choose the maxi-
mum iteration N.
while the tolerance is not satisfied or k≤Ndo
Update uk+1 as in (4.4);
Update ck+1 as in (4.5);
Update bk+1 as in (4.6);
Update λk+1 as in (4.7);
Set k=k+ 1;
end while
return the final solution (ck, bk,uk,λk) to (2.7).
In the following, we prove that if the sequence generated by Algorithm 1 has a
limit point, it must be a P-stationary point for R-KSMM (2.7).
Theorem 4.1. Consider the problem (2.7). Fix any constant σ > 0satisfying σ > C
2.
Then, every limit point of the sequence generated by ADMM by Algorithm 1 must
be a P-stationary point.
November 11, 2024 5:32 output
12 S. Chen, H. Feng, R. Lin, Y. Liu
Proof. Suppose that the sequence {(ck, bk,uk,λk)}is generated by ADMM Al-
gorithm 1 and (c∗, b∗,u∗,λ∗) is its any limit point. Let ηkbe given as (4.2) and
three index sets Γk
0, Γk
1and Γk
2be defined as in (4.3). Notice that Γk
0,Γk
1,Γk
2⊆[n]
for all k∈Nand the number of the subsets of [n] is finite . Hence, there exist three
subsets Γ0,Γ1,Γ2⊆[n] and an infinite integer set J⊆Nsuch that
Γk
0≡Γ0,Γk
1≡Γ1and Γk
2≡[n]\(Γ0∪Γ1) for any k∈J.
Taking the limit along with Jof (4.7), we arrive at
λ∗
Γ0∪Γ1:= λ∗
Γ0∪Γ1+ισω∗
Γ0∪Γ1,λ∗
Γ2:= 0Γ2,(4.8)
and
ω∗:= u∗+ diag (y)Kc∗+b∗y−1.(4.9)
Again taking limit along with the index set Jof (4.2) and (4.4), we can obtain
u∗
Γ0:= 0Γ0,u∗
Γ1:= η∗
Γ1−C
σ1Γ1,u∗
Γ2:= η∗
Γ2,(4.10)
and
η∗:= 1−diag (y)Kc∗−b∗y−λ∗
σ=−ω∗+u∗−λ∗
σ,(4.11)
where the last equality holds true by (4.9). By the above equality, we know that
η∗
Γ2=−ω∗
Γ2+u∗
Γ2−λ∗
Γ2
σ,which, along with the second equality in (4.8) and the third
equality in (4.10), implies that ω∗
Γ2=0Γ2. On the other hand, ω∗
Γ0∪Γ1=0Γ0∪Γ1
follows from the first equation in (4.8). Therefore, ω∗=0. This, along with (4.9)
and (4.11), leads to η∗=u∗−λ∗
σand
u∗+ diag (y)Kc∗+b∗y=1.(4.12)
Notice that η∗=u∗−λ∗
σand the fact C
σ<2. By (4.10), (4.3) and (2.2), one has
u∗= ProxC
σLr(η∗) = ProxC
σLr(u∗−λ∗
σ).(4.13)
Similarly, taking the limit along with Jof (4.6), we have
b∗=⟨y,1−u∗−diag (y)Kc∗−λ∗
σ⟩
n
(4.12)
=⟨y, b∗y−λ∗
σ⟩
n=b∗−⟨y,λ∗⟩
nσ ,
which means that
⟨y,λ∗⟩= 0.(4.14)
Taking the limit along with Jof (4.5) and by (4.12), we arrive at
(K+σKK)c∗=σKdiag(y)(diag(y)Kc∗−λ∗
σ),
which implies that
Kc∗+Kdiag(y)λ∗=0.(4.15)
Therefore, (c∗, b∗,u∗) is a P-stationary point by the equations (4.15), (4.14), (4.12)
and (4.13) and Definition 3.2.
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 13
4.2. Support matrices
Let (c∗, b∗,u∗) be a proximal stationary point of R-KSMM (2.7). Then from defi-
nition 3.2, there exists a Lagrangian multiplier λ∗∈Rnand a constant γ > 0 such
that (3.2a)-(3.2d) hold. By Theorem 3.1, it is also a KKT point of R-KSMM (2.7)
and then 0∈C∂ Lr(u∗) + λ∗. This, along with Lemma 2.1, yields that λi= 0 for
any i /∈ S where S:= {i∈[n]:0≤u∗
i≤1}.
Now, let Kbe a strictly positive definite kernel on Rp×q, see [24, Subsection
2.2.1] for its generic definition on any non-empty set. From (3.2a), it follows that
Kc∗=−Kdiag(y)λ∗and then c∗=−diag (y)λ∗. Along with (2.5), we arrive at
w∗=−
n
X
i=1
yiλ∗
iΦ(Xi) = X
i∈S
yi(−λ∗
i)Φ(Xi).(4.16)
We call that any Xi,i∈ S is a support matrix by (4.16). In addition, again by
(4.16), the R-KSMM hyperplane is given by
h(X) = ⟨w∗,Φ(X)⟩F+b∗=−X
i∈S
yiλ∗
iκ(Xi, X) + b∗,∀X∈Rp×q.
5. Numerical experiments
In this section, we evaluate the performance of R-KSMM on real datasets in binary
classification tasks. To this end, we shall compare the proposed R-KSMM (1.4) with
the kernel SMM with the hinge loss (KSMM) [23], SMM with spectral elastic net
regularization (SMM) [16], standard kernel SVM (KSVM) [25], and sparse SMM
(SSMM) [34]. All numerical experiments are implemented in Python 3.12 on a
laptop (11th Gen Intel(R) Core(TM) i7-11850H 2.50 GHz, 32 GB RAM). All codes
and numerical results are available at https://github.com/Rongrong-Lin/RKSMM.
We shall specify kernel functions for SVM and SMM. Two matrix-input kernels
employed for R-KSMM and KSMM are the incomplete polynomial kernel [23] and
CNTK [1], which have been mentioned in the introduction. Given a matrix A, let
Aij be its (i, j)-th entry. The incomplete polynomial kernel constructed in [23] takes
the form:
κd1,d2
s(A, B) := q
X
j=1
p
X
i=1 (A⊙B)∗Zd1
ij d2
,(5.1)
where A, B ∈Rp×qare two input matrices, d1, d2, s are all positive integers, and
Z:= [(s−max{|i−s|,|j−s|})+:i, j ∈[2s−1]] denotes a two-dimensional spatial filter
kernel of the size (2s−1) ×(2s−1). The symbol ⊙denotes the entry-wise product
and ∗represents a two-dimensional convolution operation. CNTK is an extension
of the neural tangent kernel for convolutional neural networks, where each layer of
the network is viewed as a kernel function transformation, and the composition of
these kernel functions ultimately forms a complex kernel function. There are only
two convolutional layers for CNTK. The size of the first convolutional layer is half
November 11, 2024 5:32 output
14 S. Chen, H. Feng, R. Lin, Y. Liu
the number of rows in the input matrix, and the size of the second convolutional
layer is half the number of columns in the input matrix. A detailed derivation of
CNTK can be found in [1]. For kernel SVM, each matrix-type data is vectorized
and the Gaussian kernel exp(−ρ∥x−z∥2
2), x,z∈Rdis chosen as its kernel function,
where ρis set to be 1/d.
To assess the performance of the aforementioned multiple classifiers, we adopt
three evaluation criteria: Accuracy (Test ACC), F1-score, and Area Under the ROC
Curve (AUC). The accuracy is defined as the proportion of all correctly predicted
samples to all samples. The higher the F1-score and AUC, the better the classi-
fication performance of the model. For more details on classification performance
measures, refer to [32, Chapter 22].
We evaluate the performance of R-KSMM on six real-world datasets: MNISTa,
Fashion-MNISTb, ORLc, INRIAd, CIFAR-10e, and EEG alcoholismf. Table 1 de-
scribes the basic information of these datasets.
Table 1. Summary of six datasets.
Dataset Type Classes Size Dimension
MNIST grayscale image 10 70000 28 ×28
Fashion-MNIST grayscale image 10 70000 28 ×28
INRIA color image 2 3634 96 ×160
ORL grayscale image 10 400 112 ×92
CIFAR-10 color image 10 70000 32 ×32
EEG alcoholism EEG signal 2 120 256 ×64
Except for EEG alcoholism and INRIA, other datasets have multiple labels. In
multiple-label dataset, we select the first two labels and their corresponding data
as binary classification datasets. The handwritten numbers 0 and 1 are chosen for
binary classification in the example of MNIST. For all image datasets, each image’s
pixel values are first divided by 255, and each sample is then normalized to conform
to a standard normal distribution. In addition, color images will be converted into
grayscale images for training. EEG performs normalization processing separately.
It makes each feature follow a standard normal distribution.
In our experiments, we set optimal parameters C∈ {2−2,...,23,24},σ∈
{2−2,...,23,24},d1∈ {1,2,3,4},d2∈ {1,2,3,4},s∈ {2,3,4,10}, the maximum
iteration N=300, and the dual step-size ι∈ {10−2,10−1,0.5,1,1.5}. To ensure the
ahttps://yann.lecun.com/exdb/mnist/
bhttps://github.com/zalandoresearch/fashion-mnist
chttps://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
dhttps://pascal.inrialpes.fr/data/human
ehttps://www.cs.toronto.edu/~kriz/cifar.html
fhttp://kdd.ics.uci.edu/databases/eeg/eeg.html
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 15
reliability of the experimental results, all results are subjected to repeated experi-
ments for five times.
The final results are listed in Table 2. The best results for seven methods on
each dataset are highlighted in bold. First of all, the R-KSMM (CNTK) achieves
the best performance on all the datasets in terms of Test Acc, F1-score, and AUC.
Besides, the CNTK with only two convolutional layers is more powerful than the
incomplete polynomial kernel in both R-KSMM and KSMM. At last, the ramp loss
is more robust than the hinge loss in kernel SMM models.
6. Conclusion
In this paper, we developed the theoretical and algorithmic analysis for the R-
KSMM (2.7). Theoretically, the relationship between the proximal stationary point,
the Karush-Kuhn-Tucker point and the local minimizer of R-KSMM has been ex-
posed by means of the proximal operation and the limiting subdifferential of the
ramp loss, see Theorems 3.1 and 3.2. Algorithmically, the ADMM algorithm as in
Algorithm 1 is applied to solve R-KSMM, and the conclusion is obtained that any
limit point of the sequence produced by Algorithm 1 is a proximal stationary point
to R-KSMM, see Theorem 4.1. Finally, the strength and robustness of the proposed
algorithm have been verified in Table 2 on six real datasets, which also matches our
theoretical and algorithmic analysis.
Acknowledgments
Lin was supported in part by the National Natural Science Foundation of China
(12371103) and the Center for Mathematics and Interdisciplinary Sciences, School
of Mathematics and Statistics, Guangdong University of Technology. Feng was
supported in part by the Research Grants Council of Hong Kong (11303821 and
11315522). Liu was supported in part by the Guangdong Basic and Applied Basic
Research Foundation (2023A1515012891).
ORCID
Shihai Chen - https://orcid.org/0009-0009-8259-3674
Han Feng- https://orcid.org/0000-0003-2933-6205
Rongrong Lin - https://orcid.org/0000-0002-6234-2183
Yulan Liu - https://orcid.org/0000-0003-2370-2323
References
[1] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov and R. Wang, On exact
computation with an infinitely wide neural net, Advances in Neural Information
Processing Systems 32 (2019).
[2] A. Barla, F. Odone and A. Verri, Histogram intersection kernel for image clas-
sification, in Proceedings 2003 International Conference on Image Processing,3,
(Barcelona, Spain, 2003), pp. III 513–III 516.
November 11, 2024 5:32 output
16 S. Chen, H. Feng, R. Lin, Y. Liu
Table 2. Results on real datasets.
Dataset Method Train Acc(%) Test Acc(%) F1-score AUC
MNIST
R-KSMM (κ2,2
3) 100 99.60 99.50 99.61
R-KSMM (CNTK) 100 99.70 100 99.70
KSMM (κ2,2
3) 99.03 98.20 98.24 98.27
KSMM (CNTK) 99.90 99.49 99.90 99.60
SSMM 98.90 98.50 97.97 98.50
SMM 100 99.50 99.50 99.50
SVM (Gaussian) 100 98.60 98.62 98.83
Fashion-MNIST
R-KSMM (κ2,2
3) 100 98.10 99.27 97.57
R-KSMM (CNTK) 100 98.60 100 97.80
KSMM (κ2,2
3) 100 97.40 97.56 97.53
KSMM (CNTK) 100 97.90 98.75 97.61
SSMM 100 96.50 96.55 96.54
SMM 100 97.40 97.41 97.41
SVM (Gaussian) 100 97.20 97.20 97.22
ORL
R-KSMM (κ2,2
10 ) 100 95.00 100 96.67
R-KSMM (CNTK) 100 95.00 100 96.67
KSMM (κ2,2
10 ) 100 95.00 93.33 96.67
KSMM (CNTK) 100 95.00 100 90.00
SSMM 100 85.00 80.00 90.00
SMM 100 80.00 66.67 73.33
SVM (Gaussian) 100 85.00 80.00 90.00
INRIA
R-KSMM (κ2,2
10 ) 100 80.33 96.55 82.29
R-KSMM (CNTK) 100 86.67 100 88.53
KSMM (κ2,2
10 ) 100 74.69 72.43 76.25
KSMM (CNTK) 100 81.33 88.61 83.89
SSMM 100 78.67 79.27 80.66
SMM 100 78.67 78.53 79.37
SVM (Gaussian) 100 80.66 79.27 80.66
CIFAR-10
R-KSMM (κ2,2
4) 100 73.80 100 75.79
R-KSMM (CNTK) 100 78.90 100 79.50
KSMM (κ2,2
4) 100 67.40 67.03 67.43
KSMM (CNTK) 100 73.30 82.18 75.54
SSMM 100 68.50 67.38 74.62
SMM 100 65.00 64.63 65.02
SVM (Gaussian) 100 65.10 64.84 65.14
EEG alcoholism
R-KSMM (κ2,2
4) 100 83.33 92.70 87.50
R-KSMM (CNTK) 100 96.67 100 94.29
KSMM (κ2,2
4) 100 67.40 67.03 67.43
KSMM (CNTK) 100 95.00 100 95.24
SSMM 100 83.33 82.35 87.50
SMM 100 83.33 83.88 86.67
SVM (Gaussian) 100 81.66 76.84 86.90
November 11, 2024 5:32 output
Kernel Support Matrix Machines with Ramp Loss 17
[3] A. Beck, First-order Methods in Optimization (SIAM, 2017).
[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., Distributed optimization
and statistical learning via the alternating direction method of multipliers, Found.
Trends Mach. Learn. 3(1) (2011) 1–122.
[5] V. L. Brailovsky, O. Barzilay and R. Shahave, On global, local, mixed and neigh-
borhood kernels for support vector machines, Pattern Recogn. Lett. 20(11-13) (1999)
1183–1190.
[6] J. P. Brooks, Support vector machines with the ramp loss and the hard margin loss,
Operations Res. 59(2) (2011) 467–479.
[7] J. Cervantes, F. Garcia-Lamont, L. Rodr´ıguez-Mazahua and A. Lopez, A compre-
hensive survey on support vector machine classification: Applications, challenges and
trends, Neurocomput. 408 (2020) 189–215.
[8] R. Feng and Y. Xu, Support matrix machine with pinball loss for classification, Neural
Comput. Appl. 34 (nov 2022) 18643–18661.
[9] M. Gu, J. Zheng, H. Pan and J. Tong, Ramp sparse support matrix machine and
its application in roller bearing fault diagnosis, Appl. Soft Comput. 113 (2021) p.
107928.
[10] X. Huang, L. Shi and J. A. Suykens, Ramp loss linear programming support vector
machine, J. Mach. Learn. Res. 15(1) (2014) 2185–2211.
[11] G. Kimeldorf and G. Wahba, Some results on tchebycheffian spline functions, J.
Math. Anal. Appl. 33(1) (1971) 82–95.
[12] A. Kumari, M. Akhtar, R. Shah and M. Tanveer, Support matrix machine: A review,
Neural Netw. (2024) p. 106767.
[13] F. Lauer and G. Bloch, Incorporating prior knowledge in support vector machines
for classification: A review, Neurocomput. 71(7-9) (2008) 1578–1594.
[14] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov and S. Arora, Enhanced
convolutional neural tangent kernels, arXiv:1911.00809 (2019).
[15] R. Lin, Y. Yao and Y. Liu, Kernel support vector machine classifiers with ℓ0-norm
hinge loss, Neurocomput. 589 (2024) p. 127669.
[16] L. Luo, Y. Xie, Z. Zhang and W.-J. Li, Support matrix machines, in International
Conference on Machine Learning, (Lille, France, 2015), pp. 938–947.
[17] H. Pan, H. Xu, J. Zheng, J. Tong and J. Cheng, Twin robust matrix machine for
intelligent fault identification of outlier samples in roller bearing, Knowl. Based. Syst.
252 (2022) p. 109391.
[18] N. Parikh, S. Boyd et al., Proximal algorithms, Found. Trends Optim. 1(3) (2014)
127–239.
[19] H. Pirsiavash, D. Ramanan and C. Fowlkes, Bilinear classifiers for visual recognition,
in Proceedings of the 22nd International Conference on Neural Information Process-
ing Systems, (Vancouver, Canada, 2009), pp. 1482–1490.
[20] C. Qian, Q. Tran-Dinh, S. Fu, C. Zou and Y. Liu, Robust multicategory support
matrix machines, Math. Program. 176 (2019) 429–463.
[21] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, volume 317 (Springer
Science & Business Media, 2009).
[22] B. Sch¨olkopf, R. Herbrich and A. J. Smola, A generalized representer theorem, in
International Conference on Computational Learning Theory, (Amsterdam, Nether-
lands, 2001), pp. 416–426.
[23] B. Sch¨olkopf, P. Simard, A. Smola and V. Vapnik, Prior knowledge in support vector
kernels, Advances in Neural Information Processing Systems 10 (1997).
[24] B. Sch¨olkopf and A. J. Smola, Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, Cambridge, 2001).
November 11, 2024 5:32 output
18 S. Chen, H. Feng, R. Lin, Y. Liu
[25] I. Steinwart and A. Christmann, Support Vector Machines, Information Science and
Statistics (Springer, 2008).
[26] Y. Tan and H. Liu, How does a kernel based on gradients of infinite-width neural
networks come to be widely used: a review of the neural tangent kernel, Int. J.
Multimed. Inf. Retr. 13(1) (2024) p. 8.
[27] H. Wang and Y. Shao, Fast generalized ramp loss support vector machine for pattern
classification, Pattern Recogn. 146 (2024) p. 109987.
[28] H. Wang, Y. Shao and N. Xiu, Proximal operator and optimality conditions for ramp
loss svm, Optim. Lett. 16 (2020) 999–1014.
[29] L. Wolf, H. Jhuang and T. Hazan, Modeling appearances with low-rank SVM, in
2007 IEEE Conference on Computer Vision and Pattern Recognition, (Minneapo-
lis,America, 2007), pp. 1–6.
[30] L. Xu, K. Crammer and D. Schuurmans, Robust support vector machine training via
convex outlier ablation, in Proceedings of the 21st National Conference on Artificial
Intelligence, (Boston, America, 2006), pp. 536–542.
[31] Y. Ye, A nonlinear kernel support matrix machine for matrix learning, Int. J. Mach.
Learn. Cyb. 10(10, SI) (2019) 2725–2738.
[32] M. J. Zaki and W. Meira Jr, Data Mining and Machine Learning: Fundamental
Concepts and Algorithms (Cambridge University Press, 2020).
[33] Q. Zheng, F. Zhu and P.-A. Heng, Robust support matrix machine for single trial
EEG classification, Trans. Neural Syst. Rehab. Eng. 26(3) (2018) 551–562.
[34] Q. Zheng, F. Zhu, J. Qin, B. Chen and P.-A. Heng, Sparse support matrix machine,
Pattern Recogn. 76 (2018) 715–726.
[35] H. Zhou and L. Li, Regularized matrix regression, J. R. Stat. Soc., B: Stat. Methodol.
76(2) (2014) 463–483.