ChapterPDF Available

Sparsification of Indefinite Learning Models: Joint IAPR International Workshop, S+SSPR 2018, Beijing, China, August 17–19, 2018, Proceedings

Authors:
Sparsification of Indefinite Learning Models
Frank-Michael Schleif1,2, Christoph Raab1, and Peter Tino2
University of Appl. Sc. W¨urzburg-Schweinfurt
Department of Computer Science
97074 W¨urzburg, Germany
{frank-michael.schleif,christoph.raab}@fhws.de
University of Birmingham,
School of Computer Science, B15 2TT, Birmingham, UK
{schleify,p.tino}@cs.bham.ac.uk
Abstract. The recently proposed Kr˘ein space Support Vector Machine
(KSVM) is an efficient classifier for indefinite learning problems, but
with a non-sparse decision function. This very dense decision function
prevents practical applications due to a costly out of sample extension.
In this paper we provide a post processing technique to sparsify the ob-
tained decision function of a Kr˘ein space SVM and variants thereof. We
evaluate the influence of different levels of sparsity and employ a Nystr¨om
approach to address large scale problems. Experiments show that our al-
gorithm is similar efficient as the non-sparse Kr˘ein space Support Vector
Machine but with substantially lower costs, such that also large scale
problems can be processed.
Keywords: non-positive kernel ·krein space ·sparse model
1 Introduction
Learning of classification models for indefinite kernels received substantial in-
terest with the advent of domain specific similarity measures. Indefinite kernels
are a severe problem for most kernel based learning algorithms because classical
mathematical assumptions such as positive definiteness, used in the underlying
optimization frameworks are violated. As a consequence e.g. the classical Sup-
port Vector Machine (SVM) [24] has no longer a convex solution - in fact, most
standard solvers will not even converge for this problem [9]. Researchers in the
field of e.g. psychology [7], vision [17] and machine learning [2] have criticized the
typical restriction to metric similarity measures. In fact in [2] it is shown that
many real life problems are better addressed by e.g. kernel functions which are
not restricted to be based on a metric. Non-metric measures (leading to kernels
which are not positive semi-definite (non-psd)) are common in many disciplines.
The use of divergence measures [20] is very popular for spectral data analysis
in chemistry, geo- and medical sciences [11], and are in general not metric. Also
the popular Dynamic Time Warping (DTW) algorithm provides a non-metric
alignment score which is often used as a proximity measure between two one-
dimensional functions of different length. In image processing and shape retrieval
2 F.-M. Schleif et al.
indefinite proximities are often obtained by means of the inner distance [8] - an-
other non-metric measure. Further prominent examples for genuine non-metric
proximity measures can be found in the field of bioinformatics where classical se-
quence alignment algorithms (e.g. smith-waterman score [5]) produce non-metric
proximity values. Multiple authors argue that the non-metric part of the data
contains valuable information and should not be removed [17].
Furthermore, it has been shown [9, 18] that work-arounds such as eigenspec-
trum modifications are often inappropriate or undesirable, due to a loss of in-
formation and problems with the out-of sample extension. A recent survey on
indefinite learning is given in [18]. In [9] a stabilization approach was proposed
to calculate a valid SVM model in the Kr˘ein space which can be directly applied
on indefinite kernel matrices. This approach has shown great promise in a num-
ber of learning problems but has intrinsically quadratic to cubic complexity and
provides a dense decision model. The approach can also be used for the recently
proposed indefinite Core Vector Machine (iCVM) [19] which has better scala-
bility but still suffers from the dense model. The initial sparsification approach
of the iCVM proposed in [19] is not always applicable and we will provide an
alternative in this paper.
Another indefinite SVM formulation was provided in [1], but it is based on
an empirical feature space technique, which changes the feature space represen-
tation. Additionally, the imposed input dimensionality scales with the number
of input samples, which is unattractive in out of sample extensions.
The present paper improves the work of [19] by providing a sparsification ap-
proach such that the otherwise very dense decision model becomes sparse again.
The new decision function approximates the original one with high accuracy and
makes the application of the model practical.
The principle of sparsity constitutes a common paradigm in nature-inspired
learning, as discussed e.g. in the seminal work [12]. Interestingly, apart from an
improved complexity, sparsity can often serve as a catalyzer for the extraction
of semantically meaningful entities from data. It is well known that the problem
of finding smallest subsets of coefficients such that a set of linear equations can
still be fulfilled constitutes an NP hard problem, being directly related to NP-
complete subset selection. We now review the main parts of the Kr˘ein space
SVM provided in [9] showing why the obtained α-vector is dense. The effect is
the same for to the Core Vector Machine as shown in [19]. For details on the
iCVM derivation we refer the reader to [19].
2 Kr˘ein space SVM
The Kr˘ein Space SVM (KSVM) [9], replaced the classical SVM minimization
problem by a stabilization problem in the Kr˘ein space. The respective equiv-
alence between the stabilization problem and a standard convex optimization
problem was shown in [9]. Let xiX, i ∈ {1, . . . , N }be training points in the
input space X, with labels yi∈ {−1,1}, representing the class of each point.
The input space Xis often considered to be Rd, but can be any suitable space
Sparsification of Indefinite Learning Models 3
due to the kernel trick. For a given positive C, SVM is the minimum of the
following regularized empirical risk functional
JC(f, b) = min
f∈H,bR
1
2kfk2
H+CH (f, b) (1)
H(f, b) =
N
X
i=1
max(0,1yi(f(xi) + b))
Using the solution of Equation (1) as (f
C, b
c) := arg min JC(f, b) one can intro-
duce τ=H(f
C, b
C) and the respective convex quadratic program (QP)
min
f∈H,bR
1
2kfk2
Hs.t.
N
X
i=1
max(0,1yi(f(xi) + b)) τ(2)
where we detail the notation in the following. This QP can be also seen as the
problem of retrieving the orthogonal projection of the null function in a Hilbert
space Honto the convex feasible set. The view as a projection will help to link
the original SVM formulation in the Hilbert space to a KSVM formulation in
the Krein space. First we need a few definitions, widely following [9]. A Kr˘ein
space is an indefinite inner product space endowed with a Hilbertian topology.
Definition 1 (Inner products and inner product space). Let Kbe a real
vector space. An inner product space with an indefinite inner product ,·iKon K
is a bi-linear form where all f, g, h ∈ K and αRobey the following conditions:
Symmetry: hf, giK=hg, f iK, linearity: hαf +g, hiK=αhf, hiK+hg, hiKand
hf, giK= 0 g∈ K implies f= 0.
An inner product is positive definite if f∈ K,hf , fiK0, negative definite
if f∈ K,hf, f iK0, otherwise it is indefinite. A vector space Kwith inner
product ,·iKis called inner product space.
Definition 2 (Kr˘ein space and pseudo Euclidean space). An inner prod-
uct space (K,,·iK)is a Kr˘ein space if there exist two Hilbert spaces H+and
Hspanning Ksuch that f∈ K,f=f++fwith f+∈ H+,f∈ Hand
f, g ∈ K,hf, giK=hf+, g+iH+− hf, giH. A finite-dimensional Kr˘ein-space
is a so called pseudo Euclidean space (pE).
If H+and Hare reproducing kernel hilbert spaces (RKHS), Kis a repro-
ducing kernel Kr˘ein space (RKKS). For details on RKHS and RKKS see e.g.
[15]. In this case the uniqueness of the functional decomposition (the nature of
the RKHSs H+and H) is not guaranteed. In [13] the reproducing property is
shown for a RKKS K. There is a unique symmetric kernel k(x, x) with k(x, ·)∈ K
such that the reproducing property holds (for all f∈ K, f (x) = hf, k(x, ·)iK) and
k=k+kwhere k+and kare the reproducing kernels of the RKHSs H+
and H.
As shown in [13] for any symmetric non-positive kernel k that can be de-
composed as the difference of two positive kernels k+and k, a RKKS can be
4 F.-M. Schleif et al.
associated to it. In [9] it was shown how the classical SVM problem can be
reformulated by means of a stabilization problem. This is necessary because a
classical norm as used in Eq. (2) does not exist in the RKKS but instead the
norm is reinterpreted as a projection which still holds in RKKS and is used as
a regularization technique [9]. This allows to define SVM in RKKS (viewed as
Hilbert space) as the orthogonal projection of the null element onto the set [9]:
S={f∈ K, b R|H(f, b)τ}and 0 bH(f , b)
where bdenotes the sub differential with respect to b. The set Sleads to a unique
solution for SVM in a Kr˘ein space [9]. As detailed in [9] one finally obtains a
stabilization problem which allows one to formulate an SVM in a Kr˘ein space.
stabf∈K,bR
1
2hf, f iKs.t.
l
X
i=1
max(0,1yi(f(xi) + b)) τ(3)
where stab means stabilize as detailed in the following: In a classical SVM in
RKHS the solution is regularized by minimizing the norm of the function f.
In Kr˘ein spaces however minimizing such a norm is meaningless since the dot-
product contains both the positive and negative components. Thats why the reg-
ularization in the original SVM through minimizing the norm fhas to be trans-
formed in the case of Kr˘ein spaces into a min-max formulation, where we jointly
minimize the positive part and maximize the negative part of the norm. The au-
thors of [13] termed this operation the stabilization projection, or stabilization.
Further mathematical details can also be found in [6]. An example illustrat-
ing the relations between minimum, maximum and the projection/stabilization
problem in the Kr˘ein space is illustrated in [9].
In [9] it is further shown that the stabilization problem Eq. (3) can be written
as a minimization problem using a semi-definite kernel matrix. By defining a
projection operator with transition matrices it is also shown how the dual RKKS
problem for the SVM can be related to the dual in the RKHS. We refer the
interested reader to [9]. One - finally - ends up with a flipping operator applied to
the eigenvalues of the indefinite kernel matrix1Kas well as to the αparameters
obtained from the stabilization problem in the Kr˘ein space, which can be solved
using classical optimization tools on the flipped kernel matrix. This permits to
apply the obtained model from the Kr˘ein space directly on the non-positive input
kernel without any further modifications. The algorithm is shown in Alg. 1. There
are four major steps: 1) an eigen-decomposition of the full kernel matrix, with
cubic costs (which can be potentially restricted to a few dominating eigenvalues
- referred to as KSVM-L); 2) a flipping operation; 3) the solution of an SVM
solver on the modified input matrix; 4) the application of the projection operator
obtained from the eigen-decomposition on the αvector of the SVM model. U
in Alg. 1 contains the eigenvectors, Dis a diagonal matrix of the eigenvalues
and Sis a matrix containing only {1,1}on the diagonal as obtained from the
respective function sign.
1Obtained by evaluating k(x, y) for training points x,y.
Sparsification of Indefinite Learning Models 5
Algorithm 1 Kr˘ein Space SVM (KSVM) - adapted from [9].
Kr˘ein SVM:
[U, D] := EigenDecomposition(K)
ˆ
K:= USDU >with S:= sign(D)
[α, b] := SVMSolver( ˆ
K, Y, C)
˜α:= US U >α(now ˜αis dense)
return ˜α, b;
As pointed out in [9], this solver produces an exact solution for the stabi-
lization problem. The main weakness of this Algorithm is, that it requires the
user to pre-compute the whole kernel matrix and to decompose it into eigen-
vectors/eigenvalues. Further today’s SVM solvers have a theoretical, worst case
complexity of ≈ O(N2). The other point to mention is that the final solution ˜α
is not sparse. The iCVM from [19] has a similar derivation and leads to a related
decision function, again with a dense ˜α, but the model fitting costs are ≈ O(N).
3 Sparsification of iCVM
3.1 Sparsification of iCVM by OMP
We can formalize the objective to approximate the decision function, which
is defined by the ˜αvector, obtained by KSVM or iCVM (both are structural
identical), by a sparse alternative with the following mathematical problem:
min |˜α|0
such that Pm˜αmΦ(xm)>Φ(x)f(x)
It is well-known that this problem is NP hard in general, and a variety of approx-
imate solution strategies exist in the literature. Here, we rely on a popular and
very efficient approximation offered by orthogonal matching pursuit (OMP)
[3, 14]. Given an acceptable error  > 0 or a maximum number nof non-vanishing
components of the approximation, a greedy approach is taken: the algorithm it-
eratively determines the most relevant direction and the optimum coefficient for
this axes to minimize the remaining residual error.
Algorithm 2 Orthogonal Matching Pursuit to approximate the αvector.
1: OMP:
2: I:= ;
3: r:= y:= K˜α; % initial residuum (evaluated decision function)
4: while |I|< n do
5: l0:= argmaxl|[Kr]l|; % find most relevant direction + index
6: I:= I∪ {l0}% track relevant indices
7: ˜γ:= (K·I)+·y% restricted (inverse) projection
8: r:= y(K·I)·˜γ% residuum of the approximated decision function
9: end while
10: return ˜γ(as the new sparse ˜α)
6 F.-M. Schleif et al.
In line 3 of Alg. 2 we define the inital residuum to be the vector K˜αas part
of the decision function. In line 5 we identify the most contributing dimension
(assuming an empirical feature space representation of our kernel - it becomes
the dictionary). Then in line 7 we find the current approximation of the sparse
˜α-vector - called ˜γto avoid confusion, where +indicates the pseudo inverse.
In line 8 we update the residuum by removing the approximated K˜αfrom the
original unapproximated one. A Nystroem based approximation of the algorithm
2 is straight forward using the concepts provided in [4].
3.2 Sparsification of iCVM by late subsampling
The parameters ˜αare dense as already noticed in [9]. A naive sparsification by
using only ˜αiwith large absolute magnitude is not possible as can be easily
checked by counter examples. One may now approximate ˜αby using the (for
this scenario slightly modified) OMP algorithm from the former section or by
the following strategy, both compared in the experiments.
As a second sparsification strategy we can used the approach suggested by
Tino et al. [19], to restrict the projection operator and hence the transforma-
tion matrix of iCVM to a subset of the original training data. We refer to this
approach as ICVM-sparse-sub.
To get a consistent solution we have to recalculate parts of the eigen-decomposition
as shown in Alg. 3. To obtain the respective subset of the training data we use
the samples which are core vectors2. The number of core vectors is guaranteed to
be very small [22] and hence even for a larger number of classes the solution re-
mains widely sparse. The suggested approach is given in Alg. 3. We assume that
Algorithm 3 Sparsification of iCVM by late subsampling
1: Sparse iCVM:
2: Apply iCVM - see [19]
3: ζ- vector of projection points by using the core set points
4: construct a reduced K0using indices ζas ¯
K
5: [U,D] := EigenDecomposition( ¯
K)
6: ¯α:= U S U >αwith S:= sign(D) and Urestricted to the core set indices
7: ˜α:= 0 ˜αζ:= ¯α% assign ¯αto ˜αusing indices of ζ
8: b:= Y˜α>% recalculate the bias using the (now) sparse ˜α
9: return ˜α, b;
the original projection function (line 6 of Algorithm 3, detailed in [9]), is smooth
and can be potentially restricted to a small number of construction points with
low error. We observed that in general few construction points are sufficient to
keep high accuracy, as seen in the experiments.
2A similar strategy for KSVM may be possible but is much more complicated because
typically quite many points are support vectors and special sparse SVM solvers would
be necessary.
Sparsification of Indefinite Learning Models 7
4 Experiments
This part contains a series of experiments that show that our approach leads
to a substantially lower complexity, while keeping similar prediction accuracy
compared to the non-sparse approach. To allow for large datasets with two much
hassle we provide sparse results only for the iCVM. The modified OMP approach
will work also for sparse KSVM but the late sampling sparsification is not well
suited if many support vectors are given in the original model, asking for a sparse
SVM implementation. We follow the experimental design given in [9]. Methods
that require to modify test data are excluded as also done in [9]. Finally we
compare the experimental complexity of the different solvers. The used data are
Dataset #samples proximity measure and data source
Sonatas 1068 normalized compression distance on midi files [18]
Delft 1500 dynamic time warping [18]
a1a 1605 tanh kernel [10]
zongker 2000 template matching on handwritten digits [16]
prodom 2604 pairwise structural alignment on proteins [16]
PolydistH57 4000 Hausdorff distance [16]
chromo 4200 edit distance on chromosomes [16]
Mushrooms 8124 tanh kernel [21]
swiss-10k 10ksmith waterman alignment on protein sequences [18]
checker-100k 100.000 tanh kernel (indefinite)
skin 245.057 tanh kernel (indefinite)[23]
checker 1 Mill tanh kernel (indefinite)
Table 1. Overview of the different datasets. We provide the dataset size (N) and the
origin of the indefiniteness. For vectorial data the indefiniteness is caused artificial by
using the tanh kernel.
explained in Table 1. Additional larger data sets have been added to motivate
our approach in the line of learning with large scale indefinite kernels.
4.1 Experimental setting
For each dataset, we have run 20 times the following procedure: a random split
to produce a training and a testing set, a 5-fold cross validation to tune each
parameter (the number of parameters depending on the method) on the training
set, and the evaluation on the testing set. If N > 1000 we use m= 200 randomly
chosen landmarks from the given classes. If the input data are vectorial data we
used a tanh kernel with parameters [1,1] to obtain an indefinite kernel.
4.2 Results
Significant differences of iCVM to the best result are indicated by a ?(anova,
p < 5%). In Table 2 we show the results for large scale data (having at least
8 F.-M. Schleif et al.
1000 points) using iCVM with sparsification. We observe much smaller models,
especially for larger datasets with often comparable prediction accuracy with
respect to the non-sparse model. The runtimes are similar to the non-sparse
case but in general slightly higher due to the extra eigen-decompositions on a
reduce set of the data as shown in Algorithm 3.
iCVM (sparse-sub) pts iCVM (sparse-OMP) iCVM (non-sparse)
Sonatas 12.64 ±1.71 76.84% 22.56 ±4.1613.01 ±3.82
Delft 16.53 ±2.7952.48% 3.27 ±0.63.20 ±0.84
a1a 39.50 ±2.881.25% 27.85 ±2.820.56 ±1.34
zongker 29.20 ±2.4852.81% 7.50 ±1.76.40 ±2.11
prodom 2.89 ±1.17 26.31% 3.12 ±0.11 0.87 ±0.64
PolydistH57 6.12 ±1.38 12.92% 29.35 ±80.70 ±0.19
chromo 11.50 ±1.17 33.76% 3.74 ±0.58 6.10 ±0.63
Mushrooms 7.84 ±2.21 6.46% 18.39 ±5.72.54 ±0.56
swiss-10k 35.90 ±2.5217.03% 6.73 ±0.72 12.08 ±3.47
checker-100k 8.54 ±2.35 2.26% 19.54 ±2.19.66 ±2.32
skin 9.38 ±3.30 0.06% 9,43 ±2.41 4.22 ±1.11
checker 8.94 ±0.84 0.24% 1.44 ±0.39.38 ±2.73
Table 2. Prediction errors on the test sets. The percentage of projection points (pts)
is calculated using the unique set over core vectors over all classes in comparison to all
training points. All sparse-OMP models use only 10 points in the final models. Best re-
sults are shown in bold. Best sparse results are underlined. Datasets with substantially
reduced prediction accuracy are marked by .
A typical result for the protein data set using the OMP-sparsity technique
and various values for sparsity is shown in Figure 1.
4.3 Complexity analysis
The original KSVM has runtime costs (with full eigen-decomposition) of O(N3)
and memory storage O(N2), where Nis the number of points. The iCVM in-
volves an extra Nystr¨om approximation of the kernel matrix to obtain K(N,m)
and K1
(m,m), if not already given. If we have mlandmarks, mN, this gives
memory costs of O(mN) for the first matrix and O(m3) for the second, due to
the matrix inversion. Further a Nystr¨om approximated eigendecomposition has
to be done to apply the eigenspectrum flipping operator. This leads to runtime
costs of O(N×m2). The runtime costs for the sparse iCVM are O(N×m2)
and the memory complexity is the same as for iCVM. Due to the used Nystr¨om
approximation the prior costs only hold if mN, which is the case for many
datasets as shown in the experiments.
The application of a new point to a KSVM or iCVM model requires the
calculation of kernel similarities to all Ntraining points, for the sparse iCVM
this holds only in the worst case. In general the sparse iCVM provides a simpler
out of sample extension as shown in Table 2, but is data dependent.
Sparsification of Indefinite Learning Models 9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sparsity
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Test accuracy
Sparse model
Non-sparse model
Fig. 1. Prediction results for the protein dataset using a varying level of sparsity and
the OMP sparsity methods. For comparison the prediction accuracy of the non-sparse
model is shown by a straight line.
The (i)CVM model generation has not more than Niterations or even a
constant number of 59 points, if the probabilistic sampling trick is used [22]. As
show in [22] the classical CVM has runtime costs of O(1/2). The evaluation of
a kernel function using the Nystr¨om approximated kernel can be done with cost
of O(m2) in contrast to constant costs if the full kernel is available. Accordingly,
If we assume mNthe overall runtime and memory complexity of iCVM is
linear in N, this is two magnitudes less as for KSVM for reasonable large N and
for low rank input kernels.
5 Discussions and Conclusions
As discussed in [9], there is no good reason to enforce positive-definiteness in ker-
nel methods. A very detailed discussion on reasons for using KSVM or iCVM is
given in [9], explaining why a number of alternatives or pre-processing techniques
are in general inappropriate. Our experimental results show that an appropriate
Kr˘ein space model provides very good prediction results and using one of the
proposed sparsification strategies this can also be achieved for a sparse model
in most cases. The proposed iCVM-sparse-OMP is only slightly better than the
former iCVM-sparse-sub model with respect to the prediction accuracy but has
very few final modelling vectors, with an at least competitive prediction accu-
racy in the vast majority of data sets. As is the case for KSVM, the presented
approach can be applied without the need for transformation of test points,
which is a desirable property for practical applications. In future work we will
analyse other indefinite kernel approaches like kernel regression and one-class
classification.
Acknowledgment: We would like to thank Gaelle Bonnet-Loosli for pro-
viding support with the Kr˘ein Space SVM.
10 F.-M. Schleif et al.
References
1. Ibrahim M. Alabdulmohsin, Moustapha Ciss´e, Xin Gao, and Xiangliang Zhang.
Large margin classification with indefinite similarities. Machine Learning,
103(2):215–237, 2016.
2. Robert P. W. Duin and Elzbieta Pekalska. Non-euclidean dissimilarities: Causes
and informativeness. In SSPR&SPR 2010, pages 324–333, 2010.
3. Zhifeng Zhang Geoffrey M. Davis, Stephane G. Mallat. Adaptive time-frequency
decompositions. SPIE Journal of Optical Engineering, 33(1):2183–2191, 1994.
4. Andrej Gisbrecht and Frank-Michael Schleif. Metric and non-metric proximity
transformations at linear costs. Neurocomputing, 167:643–657, 2015.
5. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press, 1997.
6. Babak Hassibi. Indefinite metric spaces in estimation, control and adaptive filter-
ing. PhD thesis, Stanford Univ., Dept. of Elec. Eng., Stanford, 1996.
7. C.J. Hodgetts and U. Hahn. Similarity-based asymmetries in perceptual matching.
Acta Psychologica, 139(2):291–299, 2012.
8. Haibin Ling and David W. Jacobs. Shape classification using the inner-distance.
IEEE Trans. Pattern Anal. Mach. Intell., 29(2):286–299, 2007.
9. G. Loosli, S. Canu, and C. S. Ong. Learning svm in krein spaces. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 38(6):1204–1216, June 2016.
10. R. Luss and A. d’Aspremont. Support vector machine classification with indefinite
kernels. Mathematical Programming Computation, 1(2-3):97–118, 2009.
11. E. Mwebaze, P. Schneider, and F.-M. Schleif et al. Divergence based classification
in learning vector quantization. Neurocomputing, 74:1429–1435, 2010.
12. Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis
set: A strategy employed by v1? Vision Research, 37(23):3311 – 3325, 1997.
13. Cheng Soon Ong, Xavier Mary, St´ephane Canu, and Alexander J. Smola. Learning
with non-positive kernels. In (ICML 2004), 2004.
14. Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit:
recursive function approximation with applications to wavelet decomposition. In
Proc. 27th Asilomar Conf. on Signals, Sys. & Comp., pages 40–44 vol.1, Nov 1993.
15. E. Pekalska and R. Duin. The dissimilarity representation for pattern recognition.
World Scientific, 2005.
16. E. Pekalska and B. Haasdonk. Kernel discriminant analysis for positive definite
and indefinite kernels. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 31(6):1017–1031, 2009.
17. Walter J. Scheirer, Michael J. Wilber, Michael Eckmann, and Terrance E. Boult.
Good recognition is non-metric. Pattern Recognition, 47(8):2721–2731, 2014.
18. Frank-Michael Schleif and Peter Ti˜no. Indefinite proximity learning: A review.
Neural Computation, 27(10):2039–2096, 2015.
19. Frank-Michael Schleif and Peter Ti˜no. Indefinite core vector machine. Pattern
Recognition, 71:187–195, 2017.
20. D. Schnitzer, A. Flexer, and G. Widmer. A fast audio similarity retrieval method
for millions of music tracks. Multimedia Tools and Applications, 58(1):23–40, 2012.
21. A. Srisuphab and J.L. Mitrpanont. Gaussian kernel approx. algorithm for feedfor-
ward nn design. Appl. Math. and Comp., 215(7):2686–2693, 2009.
22. Ivor Wai-Hung Tsang, James Tin-Yau Kwok, and Jacek M. Zurada. Generalized
core vector machines. IEEE TNN, 17(5):1126–1140, 2006.
23. UCI. Skin segmentation database, march 2016.
24. V.N. Vapnik. The nature of statistical learning theory. Statistics for engineering
and information science. Springer, 2000.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Classification with indefinite similarities has attracted attention in the machine learning community. This is partly due to the fact that many similarity functions that arise in practice are not symmetric positive semidefinite, i.e. the Mercer condition is not satisfied, or the Mercer condition is difficult to verify. Examples of such indefinite similarities in machine learning applications are ample including, for instance, the BLAST similarity score between protein sequences, human-judged similarities between concepts and words, and the tangent distance or the shape matching distance in computer vision. Nevertheless, previous works on classification with indefinite similarities are not fully satisfactory. They have either introduced sources of inconsistency in handling past and future examples using kernel approximation, settled for local-minimum solutions using non-convex optimization, or produced non-sparse solutions by learning in Krein spaces. Despite the large volume of research devoted to this subject lately, we demonstrate in this paper how an old idea, namely the 1-norm support vector machine (SVM) proposed more than 15 years ago, has several advantages over more recent work. In particular, the 1-norm SVM method is conceptually simpler, which makes it easier to implement and maintain. It is competitive, if not superior to, all other methods in terms of predictive accuracy. Moreover, it produces solutions that are often sparser than more recent methods by several orders of magnitude. In addition, we provide various theoretical justifications by relating 1-norm SVM to well-established learning algorithms such as neural networks, SVM, and nearest neighbor classifiers. Finally, we conduct a thorough experimental evaluation, which reveals that the evidence in favor of 1-norm SVM is statistically significant.
Article
Full-text available
This paper presents a theoretical foundation for an SVM solver in Kre˘ın spaces. Up to now, all methods are based either on the matrix correction, or on non-convex minimization, or on feature-space embedding. Here we justify and evaluate a solution that uses the original (indefinite) similarity measure, in the original Kre˘ın space. This solution is the result of a stabilization procedure. We establish the correspondence between the stabilization problem (which has to be solved) and a classical SVM based on minimization (which is easy to solve). We provide simple equations to go from one to the other (in both directions). This link between stabilization and minimization problems is the key to obtain a solution in the original Kre˘ın space. Using KSVM, one can solve SVM with usually troublesome kernels (large negative eigenvalues or large numbers of negative eigenvalues). We show experiments showing that our algorithm KSVM outperforms all previously proposed approaches to deal with indefinite matrices in SVM-like kernel methods.
Article
Full-text available
Efficient learning of a data analysis task strongly depends on the data representation. Most methods rely on (symmetric) similarity or dissimilarity representations by means of metric inner products or distances, providing easy access to powerful mathematical formalisms like kernel or branch-and-bound approaches. Similarities and dissimilarities are, however, often naturally obtained by nonmetric proximity measures that cannot easily be handled by classical learning algorithms. Major efforts have been undertaken to provide approaches that can either directly be used for such data or to make standard methods available for these types of data. We provide a comprehensive survey for the field of learning with nonmetric proximities. First, we introduce the formalism used in nonmetric spaces and motivate specific treatments for nonmetric proximity data. Second, we provide a systematization of the various approaches. For each category of approaches, we provide a comparative discussion of the individual algorithms and address complexity issues and generalization properties. In a summarizing section, we provide a larger experimental study for the majority of the algorithms on standard data sets. We also address the problem of large-scale proximity learning, which is often overlooked in this context and of major importance to make the method relevant in practice. The algorithms we discuss are in general applicable for proximity-based clustering, one-class classification, classification, regression, and embedding approaches. In the experimental part, we focus on classification tasks.
Article
The recently proposed Krĕin space Support Vector Machine (KSVM) is an efficient classifier for indefinite learning problems, but with quadratic to cubic complexity and a non-sparse decision function. In this paper a Krĕin space Core Vector Machine (iCVM) solver is derived. A sparse model with linear runtime complexity can be obtained under a low rank assumption. The obtained iCVM models can be applied to indefinite kernels without additional preprocessing. Using iCVM one can solve CVM with usually troublesome kernels having large negative eigenvalues or large numbers of negative eigenvalues. Experiments show that our algorithm is similar efficient as the Krĕin space Support Vector Machine but with substantially lower costs, such that also large scale problems can be processed.
Article
Domain specific (dis-)similarity or proximity measures used e.g. in alignment algorithms of sequence data, are popular to analyze complex data objects and to cover domain specific data properties. Without an underlying vector space these data are given as pairwise (dis-)similarities only. The few available methods for such data focus widely on similarities and do not scale to large data sets. Kernel methods are very effective for metric similarity matrices, also at large scale, but costly transformations are necessary starting with non-metric (dis-) similarities. We propose an integrative combination of Nystroem approximation, potential double centering and eigenvalue correction to obtain valid kernel matrices at linear costs in the number of samples. By the proposed approach effective kernel approaches, become accessible. Experiments with several larger (dis-)similarity data sets show that the proposed method achieves much better runtime performance than the standard strategy while keeping competitive model accuracy. The main contribution is an efficient and accurate technique, to convert (potentially non-metric) large scale dissimilarity matrices into approximated positive semi-definite kernel matrices at linear costs.
Article
Computing the optimal expansion of a signal in a redundant dictionary of waveforms is an NP-hard problem. We introduce a greedy algorithm, called a matching pursuit, which computes a suboptimal expansion. The dictionary waveforms that best match a signal's structures are chosen iteratively. An orthogonalized version of the matching pursuit is also developed. Matching pursuits are general procedures for computing adaptive signal representations. With a dictionary of Gabor functions, a matching pursuit defines an adaptive time-frequency transform. Matching pursuits are chaotic maps whose attractors define a generic noise with respect to the dictionary. We derive an algorithm that isolates the coherent structures of a signal and describe an application to pattern extraction from noisy signals.
Article
Recognition is the fundamental task of visual cognition, yet how to formalize the general recognition problem for computer vision remains an open issue. The problem is sometimes reduced to the simplest case of recognizing matching pairs, often structured to allow for metric constraints. However, visual recognition is broader than just pair matching -- especially when we consider multi-class training data and large sets of features in a learning context. What we learn and how we learn it has important implications for effective algorithms. In this paper, we reconsider the assumption of recognition as a pair matching test, and introduce a new formal definition that captures the broader context of the problem. Through a meta-analysis and an experimental assessment of the top algorithms on popular data sets, we gain a sense of how often metric properties are violated by good recognition algorithms. By studying these violations, useful insights come to light: we make the case that locally metric algorithms should leverage outside information to solve the general recognition problem.