Conference PaperPDF Available

Gaussian process classification for segmenting and annotating sequences

Authors:

Abstract and Figures

Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the context of annotating and segmenting observation sequences. This paper generalizes Gaussian Process classification to predict multiple labels by taking dependencies between neighboring labels into account. Our approach is motivated by the desire to retain rigorous probabilistic semantics, while overcoming limitations of parametric methods like Conditional Random Fields, which exhibit conceptual and computational difficulties in high-dimensional input spaces. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach.
Content may be subject to copyright.
Gaussian Process Classification for
Segmenting and Annotating Sequences
Yasemin Altun altun@cs.brown.edu
Department of Computer Science, Brown University, Providence, RI 02912 USA
Thomas Hofmann th@cs.brown.edu
Department of Computer Science, Brown University, Providence, RI 02912 USA
Max Planck Institute for Biological Cybernetics, 72076 T¨ubingen, Germany
Alexander J. Smola Alex.Smola@anu.edu.au
Machine Learning Group, RSISE , Australian National University, Canberra, ACT 0200, Australia
Abstract
Many real-world classification tasks involve
the prediction of multiple, inter-dependent
class labels. A prototypical case of this sort
deals with prediction of a sequence of la-
bels for a sequence of observations. Such
problems arise naturally in the context of
annotating and segmenting observation se-
quences. This paper generalizes Gaussian
Process classification to predict multiple la-
bels by taking dependencies between neigh-
boring labels into account. Our approach
is motivated by the desire to retain rigor-
ous probabilistic semantics, while overcom-
ing limitations of parametric methods like
Conditional Random Fields, which exhibit
conceptual and computational difficulties in
high-dimensional input spaces. Experiments
on named entity recognition and pitch ac-
cent prediction tasks demonstrate the com-
petitiveness of our approach.
1. Introduction
Multiclass classification refers to the problem of as-
signing class labels to instances where labels belong
to some finite set of elements. Often, however, the
instances to be labeled do not occur in isolation, but
rather in observation sequences. One is then interested
in predicting the joint label configuration, i.e. the se-
quence of labels corresponding to a sequence of ob-
Appearing in Proceedings of the 21 st International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the first author.
servations, using models that take possible interde-
pendencies between label variables into account. This
scenario subsumes problems of sequence segmentation
and annotation, which are ubiquitous in areas such as
natural language processing, speech recognition, and
computational biology.
The most common approach to sequence labeling is
based on Hidden Markov Models (HMMs), which de-
fine a generative probabilistic model for labeled obser-
vation sequences. In recent years, the state-of-the-art
method for sequence learning is Conditional Random
Fields (CRFs) introduced by Lafferty et al. (Lafferty
et al., 2001). In most general terms, CRFs define a
conditional model over label sequences given an ob-
servation sequence in terms of an exponential family;
they are thus a natural generalization of logistic re-
gression to the problem of label sequence prediction.
Other related work on this subject includes Maximum
Entropy Markov models (McCallum et al., 2000) and
the Markovian model of (Punyakanok & Roth, 2000).
There have also been attempts to extend other dis-
criminative methods such as AdaBoost (Altun et al.,
2003a), perceptron learning (Collins, 2002), and Sup-
port Vector Machines (SVMs) (Altun et al., 2003b;
Taskar et al., 2004) to the label sequence learning prob-
lem. The latter have experimentally compared favor-
ably to other discriminative methods, including CRFs.
Moreover, they have the conceptual advantage of be-
ing compatible with implicit data representations via
kernel functions.
In this paper, we investigate the use of Gaussian
Process (GP) classification (Gibbs & MacKay, 2000;
Williams & Barber, 1998) for label sequences. The
main motivation for pursuing this direction is to com-
bine the best of both worlds from CRFs and SVMs.
More specifically, we would like to preserve the main
strength of CRFs, which we see in its rigorous prob-
abilistic semantics. There are two important advan-
tages of a probabilistic model. First, it is very intuitive
to incorporate prior knowledge within a probabilistic
framework. Second, in addition to predicting the best
labels, one can compute posterior label probabilities
and thus derive confidence scores for predictions. This
is a valuable property in particular for applications
requiring a cascaded architecture of classifiers. Con-
fidence scores can be propagated to subsequent pro-
cessing stages or used to abstain on certain predic-
tions. The other design goal is the ability to use kernel
functions in order to construct and learn in Reproduc-
ing Kernel Hilbert Spaces (RKHS), thereby overcom-
ing the limitations of (finite-dimensional) parametric
statistical models.
A second, independent objective of our work is to
gain clarification with respect to two aspects on which
CRFs and the SVM-based methods differ, the first as-
pect being the loss function (logistic loss vs. hinge
loss), and the second aspect being the mechanism
used for constructing the hypothesis space (parametric
vs. RKHS).
GPs are non-parametric tools to perform Bayesian in-
ference, which – like SVMs – make use of the kernel
trick to work in high (possibly infinite) dimensional
spaces. Like other discriminative methods, GPs pre-
dict single variables and do not take into account any
dependency structure in case of multiple label predic-
tions. Our goal is to generalize GPs to predict label
sequences. While computationally demanding, recent
progress on sparse approximation methods for GPs,
e.g. (Csat’o & Opper, 2002; Smola & Bartlett, 2000;
Seeger et al., 2003), suggest that scalable GP label se-
quence learning may be an achievable goal. Exploiting
the compositionality of the kernel function, we derive a
gradient-based optimization method for GP sequence
classification. Moreover, we present a column genera-
tion algorithm that performs a sparse approximation
of the solution.
The rest of the paper is organized as follows: In Sec-
tion 2, we introduce Gaussian Process classification.
Then, we present our formulation of Gaussian Pro-
cess sequence classification (GPSC) in Section 3 and
describe the proposed optimization algorithms in Sec-
tion 4. Finally, we report some experimental results
using real-world data for named entity classification
and pitch accent prediction in Section 5.
2. Gaussian Process Classification
In supervised classification, we are given a training set
of nlabeled instances or observations (xi, yi) with yi
{1,...,m}, drawn i.i.d. from an unknown, but fixed,
joint probability distribution p(x, y). We denote the
training observations and labels by X= (x1,...,xn)
and y= (y1,...,yn), respectively.
GP classification constructs a two-stage model for the
conditional probability distribution p(y|x) by intro-
ducing an intermediate, unobserved stochastic process
u(u(x, y)) where u(x, y) can be considered a com-
patibility measure of an observation xand a label y.
Given an instantiation of the stochastic process, we
assume that the conditional probability p(y|x,u) only
depends on the values of uat the input xvia a multi-
nomial response model, i.e.
p(y|x,u) = p(y|u(x,·)) = exp(u(x, y))
Pm
y0=1 exp(u(x, y0)) (1)
It is furthermore assumed that the stochastic process u
is a zero mean Gaussian process with covariance func-
tion C, typically a kernel function. An additional as-
sumption typically made in multiclass GP classifica-
tion is that the processes u(·, y) and u(·, y 0) are uncor-
related for y6=y0(Williams & Barber, 1998).
For notational convenience, we will identify uwith the
relevant restriction of uto the training patterns Xand
represent it as a n×mmatrix. For simplicity we will
(in slight abuse of notation) also think of uas a vec-
tor with multi-index (i, y). Moreover we will denote
by Kthe kernel matrix with entries1K(i,y),(j,y0)=
C((xi, y),(xj, y0)). Notice that under the above as-
sumptions Khas a block diagonal structure with
blocks K(y) = (Kij (y)), Kij (y)Cy(xi,xj), where
Cyis a class-specific covariance function.
Following a Bayesian approach, the prediction of a la-
bel for a new observation xis obtained by computing
the posterior probability distribution over labels and
selecting the label that has the highest probability:
p(y|X,y,x) = Zp(y|u(x,·)) p(u|X,y)du(2)
Thus, one needs to integrate out all n·mlatent
variables of u. Since this is in general intractable,
it is common to perform a saddle-point approxima-
tion of the integral around the optimal point esti-
1Here and below, we will make extensive use of multi-
indices. We will put parentheses around a comma-
separated list of indices to denote a multi-index and use
two comma-separated multi-indices to refer to matrix ele-
ments.
mate, which is the maximum a posterior (MAP) es-
timate: p(y|X,y,x)p(y|umap(x,·)) where umap =
argmaxulog p(u|X,y). Exploiting the conditional in-
dependence assumptions, the posterior of ucan – up
to a multiplicative constant – be written as
p(u|X,y)p(u)
n
Y
i=1
p(yi|u(xi,·)) (3)
Combining the GP prior over uand the conditional
model in (1) yields the more specific expression
log p(u|X,y) =
n
X
i=1 "u(xi, yi)log X
y
exp(u(xi, y))#
1
2uTK1u+ const. (4)
The Representer Theorem (Kimeldorf & Wahba, 1971)
guarantees that the maximizer of (4) is of the form
umap(xi, y) =
n
X
j=1
m
X
y0=1
α(j,y0)K(i,y),(j,y0)(5)
with suitably chosen coefficients α. In the block diago-
nal case, K(i,y),(j,y0)= 0 for y6=y0and this reduces to
the simpler form umap(xi, y) = Pn
j=1 α(j,y)Cy(xi,xj).
Using the representation in (5), we can rewrite the
optimization problem as an objective Rparameter-
ized by α. Let e(i,y)be the (i, y)-th unit vector, then
αTKe(i,y)=Pj,y0α(j,y0)K(i,y ),(j,y0)and the negative
of Eq. (4) can be written as follows:
R(α|X,y) = αTKα
n
X
i=1
log p(yi|xi, α) (6)
=αTKα
n
X
i=1
αTKe(i,yi)+
n
X
i=1
log X
y
exp(αTKe(i,y))
A comparison between (6) and a similar multiclass
SVM formulation (Crammer & Singer, 2001; Weston
& Watkins, 1999) clarifies the connection between GP
classification and SVMs. Their difference lies primar-
ily in the utilized loss functions: logistic loss vs. hinge
loss. Because the hinge loss truncates values smaller
than ²to 0, it enforces sparseness in terms of the α
parameters. This is not the case for logistic regression
as well as other choices of loss functions.2
For non-linear link functions like the one induced by
Eq. (1), umap cannot be found analytically and one
2Several studies focused on finding sparse solutions of
Eq. (6) or optimization problems similar to Eq. (6) (Ben-
nett et al., 2002; Girosi, 1997; Smola & Sch¨olkopf, 2000).
has to resort to approximate solutions. Various ap-
proximation schemes have been studied to that ex-
tent: Laplace approximation (Williams & Barber,
1998; Williams & Seeger, 2000), variational methods
(Jaakkola & Jordan, 1996), mean field approximations
(Opper & Winther, 2000), and expectation propaga-
tion (Minka, 2001; Seeger et al., 2003). Performing
these methods usually involves the computation of the
Hessian matrix as well as the inversion of K, a nm×nm
matrix, which is not tractable for large data sets (of
size n) and/or large label sets (of size m). Several tech-
niques have been proposed to approximate Ksuch that
the inversion of the approximating matrix is tractable
(cf. (Sch¨olkopf & Smola, 2002) for references on such
methods). One can also try to solve (6) using greedy
optimization methods as proposed in (Bennett et al.,
2002).
3. GP Sequence Classification (GPSC)
3.1. Sequence Labeling and GPC
In sequence classification, our goal is to learn a dis-
criminant function for sequences, i.e. a mapping from
observation sequences X= (x1,x2,...,xt,...,xT) to
label sequences y= (y1, y2,...,yt,...,yT). There ex-
ists a label ytΣ = {1,...,r}for every observation
xtin the sequence. Thus, we have Tmulticlass classifi-
cation problems. Because of the sequence structure of
the labels ( i.e. every label ytdepends on its neighbor-
ing labels ), one needs to solve these Tclassification
problems jointly. Then, the problem can be consid-
ered as a multiclass classification where for an obser-
vation sequence of length l, the possible label set Yis
of size m=rl.3We call Σ label set of observations
or micro-label set, and Ythe set of label sequences of
observation sequences or macro-label set.
We assume that a training set of nlabeled sequences
Z≡ {(Xi,yi)|i= 1,...,n}is available. Using the
notation introduced in the context of GP classification,
we define p(yi|u(Xi)) as in (1), treating every macro
label as a separate label in GP multiclass classification
and using the whole sequence Xias the input.
3.2. Kernels for Labeled Sequences
The fundamental design decision is then the engineer-
ing of the kernel function kthat determines the kernel
matrix K. Notice that the use of a block diagonal ker-
nel matrix is not an option in the current setting, since
it would prohibit generalizing across label sequences
that differ in as little as a single micro-label.
3For notational convenience we will assume that all
training sequences are of the same length l.
We define the kernel function for labeled sequences
with respect to the feature representation. Inspired
by HMMs, we use two types of features: Features that
capture the dependency of the micro-labels on the at-
tributes of the observations Φ(xs) and features that
capture the inter-dependency of micro-labels. As in
other discriminative methods, Φ(xs) can include over-
lapping attributes of xsas well as attributes of obser-
vations xtwhere t6=s. Using stationarity, the inner
product between the feature vectors of two observation
sequences can be stated as: k=k1+k2, where
k1((X,y),(¯
X,¯
y))X
s,t
[[ys= ¯yt]]hΦ(xs),Φ(¯
xt)i(7a)
k2((X,y),(¯
X,¯
y))X
s,t
[[ys= ¯ytys+1 = ¯yt+1]] (7b)
k1couples observations in both sequences that are
classified with the same micro-labels at respective po-
sitions. k2simply counts the number of consecutive
label pairs both label sequences have in common (ir-
respective of the inputs). One can generalize (7) in
various ways, e.g. by using higher order terms between
micro-labels in both contributions, without posing ma-
jor conceptual challenges.
kis a linear kernel function for labeled sequences. This
can be generalized to non-linear kernel functions for
labeled sequences by replacing hΦ(xs),Φ(¯
xt)iwith a
standard kernel function defined over input patterns.
We can naively follow the same line of argumentation
as in the GPC case of Section 2, evoke the Representer
Theorem and ultimately arrive at the objective in (6).
Since we need it for subsequent derivations, we will
restate the objective here
R(α|Z) = αTKα
n
X
i=1
αTKe(i,yi)
+
n
X
i=1
log X
y∈Y
exp ¡αTKe(i,y)¢(8)
Notice that in the third term, the sum ranges over
the macro-label set, Y, which grows exponentially in
the sequence length. Therefore, this view suffers from
the large cardinality of Y. In order to re-establish
tractability of this formulation, we use a trick simi-
lar to the one deployed in (Taskar et al., 2004) and
reparametrize the objective in terms of an equivalent
lower dimensional set of parameters. The crucial ob-
servation is that the definition of kin (7) is homo-
geneous (or stationary). Thus, the absolute positions
of patterns and labels in the sequence are irrelevant.
This observation can be exploited by re-arranging the
sums inside the kernel function with the outer sums,
i.e. the sums in the objective function.
3.3. Exploiting Kernel Structure
In order to carry out this reparameterization more for-
mally we proceed in two steps. The first step consists
of finding an appropriate low-dimensional summary of
α. In particular, we are looking for a parameteriza-
tion that does not scale with m=rl. The second step
consists of re-writing the objective function in terms
of these new parameters.
As we will prove subsequently, the following linear map
Λextracts the information in αthat is relevant for
solving (8):
γΛα, Λ∈ {0,1}n·l·r2×n·m(9)
where
λ(j,t,σ,τ),(i,y)δij [[yt=σyt+1 =τ]] (10)
Notice that each variable λ(j,t,σ,τ),(i,y )encodes
whether the input sequence is the j-th training se-
quence and whether the label sequence ycontains
micro-labels σand τat position tand t+ 1, respec-
tively. Hence, γ(j,t,σ,τ )is simply the sum of all α(j,y)
over label sequences ythat contain the στ-motif at
position t.
We define two reductions derived from γvia further
linear dimension reduction,
γ(1) Pγ, with P(i,s,σ ),(j,t,τ,ρ)=δij δst δστ ,(11a)
γ(2) Qγ, with Q(i,σ,ζ ),(j,t,τ,ρ)=δij δστ δζ ρ .(11b)
Intuitively, γ(2)
(i,σ,τ)is the sum of all α(i,y)over ev-
ery position in the sequence ythat contains στ -motif.
γ(1)
i,s,σ, on the other hand, is the sum of all α(i,y)that
has σmicro-label at position sin macro-label y.
We can now show how to represent the kernel matrix
using the previously defined matrices Λ,P,Qand the
gram matrix Gwith G(i,s),(j,t)=g(xs
i,xt
j).
Proposition 1. With the definitions from above:
K=ΛTK0Λ,K0¡PTHP +QTQ¢
where H= diag(G,...,G).
Proof. By elementary comparison of coefficients.
We now have r2parameters for every observation xsin
the training data (nlr2parameters) and we can rewrite
the objective function in terms of these variables:
R(γ|X,y) = γTK0γ
n
X
i=1
γTK0Λe(i,yi)
+
n
X
i=1
log X
y∈Y
exp ¡γTK0Λe(i,y)¢(12)
3.4. GPSC and Other Label Sequence
Learning Methods
We now briefly point out the relationship between our
approach and the previous discriminative methods of
sequence learning, in particular, CRFs, HM-SVMs and
MMMs.
CRF is a natural generalization of logistic regression to
label sequence learning. The probability distribution
over label sequences given an observation sequence is
given in Eq. (1), where u(X,y) = hθ, Ψ(X,y)iis a
linear discriminative function over some feature repre-
sentation Ψ parameterized with θ. The objective func-
tion of CRFs is the minimization of the negative condi-
tional likelihood of training data. To avoid overfitting,
it is common to multiply the conditional likelihood by
a Gaussian with zero mean and diagonal covariance
matrix K, resulting in an additive term in log scale.
log p(θ|X,y) =
n
X
i=1
log p(yi|Xi, θ) + θTKθ(13)
From a Bayesian point of view, CRFs assume a uni-
form prior p(u), if there is no regularization term.
When regularized, CRFs define a Gaussian distribu-
tion over a finite vector space θ. In GPSC, on the
other hand, the prior is defined as a Gaussian dis-
tribution over the function space of possibly infinite
dimension. Thus, GPSC generalizes CRFs by defin-
ing a more sophisticated prior on the discriminative
function u. This prior leads to the ability of using
kernel function in order to construct and learn over
Reproducing Kernel Hilbert Spaces. So, GPSC, a non-
parametric Bayesian inference tool for sequence label-
ing, can overcome the limitations of CRFs, parametric
(linear) statistical models. When the kernel that de-
fines the covariance matrix Kin GPSC is linear, uin
both models become equivalent.
The difference between SVM and GP approaches to
sequence learning is the utilized loss function over the
training data, i.e. hinge loss vs. log loss. GPSC ob-
jective function parameterized with α(Eq. (8)) corre-
sponds to HM-SVMs where the number of parameters
scale exponentially with the length of sequences. The
objective function parameterized with γ(Eq. (12)) cor-
responds to MMMs, where the number of parameters
scale only linearly.
4. GPSC Optimization Algorithm
4.1. A Dense Algorithm
Using optimization methods described in Section 2 re-
quires the computation of the Hessian matrix. In se-
quence labeling, this corresponds to computing the ex-
pections of micro-labels within different cliques, which
is not tractable to compute exactly for large training
sets. In order to minimize Rwith respect to γ, we pro-
pose a 1storder exact optimization method, which we
call Dense Gaussian Process Sequence Classification
(DGPS).
It is well-known that the derivatives of the log partition
function with respect to γis simply the expectation of
sufficient statistics:
γ
log X
y∈Y
exp ¡γTK0Λe(i,y)¢
=EY£Λe(i,Y )¤(14)
where EYdenotes an expectation with respect to the
conditional distribution of the label sequence ygiven
the observation sequence Xi. Then, the gradients of
Ris trivially given by:
γR=2K0γ
n
X
i=1
K0Λe(i,yi)+
n
X
i=1
K0EY£Λe(i,Y )¤
(15)
The remaining challenge is to come-up with an efficient
way to compute the expectations. First of all, let us
more explicitly examine these quantities:
EY[(Λe(i,Y ))(j,t,σ,τ)]=δij EY£[[Yt=σYt+1 =τ]]¤(16)
In order to compute the above expectations one can
once again exploit the structure of the kernel and is
left with the problem of computing probabilities for
every neighboring micro-label pair (σ, τ ) at positions
(t, t + 1) for all training sequences Xi. The latter can
be accomplished by performing the forward-backward
algorithm over the training data using the transition
probability matrix Tand the observation probability
matrices O(i), which are simply decompositions and
reshapings of K0:
¯γ(2) Rγ(2) ,with R(σ,ζ),(i,τ,ρ)=δσ τ δζρ (17a)
Tγ(2) ]r,r (17b)
O(i)= [γ(1)]n·l,r G(i,.),(.,.)(17c)
where [x]m,n denotes the reshaping operation of a vec-
tor xinto an mnmatrix, AI,J denotes the |I| ∗ |J|
sub-matrix of Aand (.) denotes the set of all possible
indices.
A single optimization step of DGPS is described in
Algorithm 1. The complexity of one optimization step
is O(t2) dominated by the forward-backward algorithm
Algorithm 1 One optimization step of Dense Gaus-
sian Process Sequence Classification (DGPS)
Require: Training data (Xi,yi)i=1:n; Proposed pa-
rameter values γc
1: Initialize γ(1)
c, γ(2)
c(Eq. (11)).
2: Compute Twrt γ(2)
c(Eq. (17a), Eq. (17b)).
3: for i= 1,...,n do
4: Compute O(i)wrt γ(1)
c(Eq. (17c)).
5: Compute p(yi|Xi, γc) and
EY£[[Yt=σYt+1 =τ]]¤for all t, σ, τ via
forward-backward algorithm using O(i)and T
6: end for
7: Compute γR(Eq. (15)).
over all instances where t=nlr2. We propose to use
a quasi-Newton method for the optimization process.
Then, the overall complexity is given by O(ηt2) where
η < t2. The memory requirement is given by the size
of γ,O(t).
During inference, one can find the most likely label
sequence for an observation sequence Xby performing
Viterbi decoding using the transition and observation
probability matrices described above.
4.2. A Sparse Algorithm
While the above method is attractive for small data
sets, the computation or the storage of K0poses a
serious problem when the data set is large. Also, clas-
sification of a new observation involves evaluating the
covariance function at nl data points, which is more
than acceptable for many applications. Hence, as in
the case of standard Gaussian Process Classification
discussed in Section 2, one has to find a method for
sparse solutions in terms of the γparameters to speed
up the training and prediction stages.
We propose a sparse greedy method, Sparse Gaussian
Process Sequence Classification (SGPS), that is simi-
lar to the method presented by (Bennett et al., 2002).
SGPS starts with an empty matrix ˆ
K. At each iter-
ation, SGPS selects a training instance Xiand com-
putes the gradients of the parameters associated with
Xi,γ(i,.), to select the steepest descent direction(s)
of Rover this subspace. Then ˆ
Kis augmented with
these columns and SGPS performs optimization of the
current problem using a Quasi-Newton method. This
process is repeated until the gradients vanish (i.e. they
are smaller than a threshold value η) or a maximum
number of γcoordinates, p, are selected (i.e. some
sparseness level is achieved). Since the bottleneck of
this method is the computation of the expectations,
EY£[[Yt=σYt+1 =τ]]¤, we pick the steepest ddirec-
tions, once the expectations are computed.
One has two options to compute the optimal γat every
iteration: by updating all of the γparameters selected
until now, or alternatively, by updating only the pa-
rameters selected in the last iteration. We prefer the
latter because of its less expensive iterations. This ap-
proach is in the spirit of a boosting algorithm or the
cyclic coordinate optimization method.
Algorithm 2 Sparse Gaussian Process Sequence Clas-
sification (SGPS) algorithm.
Require: Training data (Xi,yi)i=1:n; Maximum
number of coordinates to be selected, p,p < nlr2;
Threshold value ηfor gradients
1: K[]
2: for i= 1,...,n do
3: Compute γ(i,.)R(Equation 15).
4: sSteepest ddirections of γ(i,.)R
5: ˆ
K[ˆ
K;Kes]
6: Optimize Rwrt s.
7: Return if γ< η or pcoordinates selected.
8: end for
SGPS is described in Algorithm 2. Its complexity is
O(p2t) where pis the maximum number of coordinates
allowed.
5. Experiments
5.1. Pitch Accent Prediction
Pitch Accent Prediction is the task of identifying more
prominent words in a sentence. The micro-label set is
of size 2, accented and not-accented. We used phonet-
ically hand-transcribed Switchboard corpus consisting
of 1824 sentences (13K words) (Greenberg et al., 1996).
We extracted probabilistic, acoustic and textual infor-
mation from the current, previous and next words for
every position in the training data. We used 1storder
Markov features to capture the dependencies between
neighboring labels.
We compared the performance of CRFs and HM-
SVMs with the GPSC dense and sparse methods ac-
cording to their test accuracy in 5-fold cross valida-
tion. CRFs were regularized and optimized using lim-
ited memory BFGS, a limited memory Quasi-Newton
optimization method. When performing experiments
on DGPS, we used polynomial kernels with different
degrees (denoted with DGPSXin Figure 1a where
X∈ {1,2,3}is the degree of the polynomial kernel).
We used third order polynomial kernel in HM-SVMs
(denoted with SVM3 in Figure 1). As expected, CRFs
Test Accuracy
0 1 2 3 4 5 6 7 8
0.72
0.73
0.74
0.75
0.76
0.77
Accuracy
Sparseness %
DGPS2
SGPS2
0.5 0.6 0.7 0.8 0.9 1
0.75
0.8
0.85
0.9
0.95
1
Threshold
Precision
Recall
Figure 1. Pitch Accent Prediction task results a) Test accuracy of Pitch Accent Prediction task over a window of size
3 using 5-fold cross validation. b) Test accuracy of Pitch Accent Prediction w.r.t. the sparseness of the solution. c)
Precision-Recall curves for different threshold probabilities to abstain.
and DGPS1 performed very similar. When 2ndor-
der features were incorporated implicitly using second
degree polynomial kernel (DGPS2), the performance
increased dramatically. Extracting 2ndorder features
explicitly results in a 12 million dimensional feature
space, where CRFs slow down dramatically. We ob-
served that 3rdorder features do not provide signifi-
cant improvement over DGPS2. HM-SVM3 performs
slightly worse than DGPS2.
To investigate how the sparsity of SGPS affects its per-
formance, we report the test accuracy with respect to
the sparseness of SGPS solution in Figure 1b. Sparse-
ness is measured by the percentage of the parameters
selected by SGPS. The straight line is the performance
of DGPS using second degree polynomial kernel. Us-
ing 1% of the parameters, SGPS achieves 75% accu-
racy (1.48% less than the accuracy of DGPS). When
7.8% of the parameters are selected, the accuracy is
76.18% which is not significantly different than the
performance of DGPS (76.48%). We observed that
these parameters were related to 6.2% of the obser-
vations along with 1.13 label pairs on average. Thus,
during inference one needs to evaluate the kernel func-
tion only at 6% of the observations which reduces the
inference time dramatically.
In order to experimentally verify how useful the pre-
dictive probabilities are as confidence scores, we forced
DGPS to abstain from predicting a label when the
probability of a micro-label is lower than a threshold
value. In Figure 1c, we plot precision-recall values
for different thresholds. We observed that the error
rate for DGPS decreased 8.54%, abstaining on 14.93%
of the test data. The improvement on the error rate
shows the validity of the probabilities generated by
DGPS.
5.2. Named Entity Recognition
Named Entity Recognition (NER), a subtask of In-
formation Extraction, is finding phrases containing
names in a sentences. The micro-label set consists
of the beginning and continuation of person, location,
organization and miscellaneous names and non-name.
We used a Spanish newswire corpus, which was pro-
vided for the Special Session of CoNLL 2002 on NER,
to randomly select 1000 sentences (21K words). We
used the word and its spelling properties of the cur-
rent, previous and next observations.
DGPS1 DGPS2 SGPS2 CRF CRF-B
Error 4.58 4.39 4.48 4.92 4.56
Table 1. Test error of NER over a window of size 3 using
5-fold cross validation.
The experimental setup was similar to pitch accent
prediction task. We compared the performance of
CRFs with and without the regularizer term (CRF-
R, CRF) with the GPSC dense and sparse methods.
Qualitatively, the behavior of the different optimiza-
tion methods is comparable to the pitch accent predic-
tion task. The results are summarized in Table 1. Sec-
ond degree polynomial DGPS outperformed the other
methods. We set the sparseness parameter of SGPS to
25%, i.e. p= 0.25nlr2, where r= 9 and nl = 21K on
average. SGPS with 25% sparseness achieves an accu-
racy that is only 0.1% below DGPS. We observed that
19% of the observations are selected along with 1.32
label pairs on average, which means that one needs to
compute only one fifth of the gram matrix.
We also tried a sparse algorithm that does not exploit
the kernel structure and optimizes Equation 8 to ob-
tain sparse solutions in terms of observation sequences
Xand label sequence y, as opposed to SPGS, where
the sparse solution is in terms of observations and label
pairs. This method achieved 92.7% of accuracy, hence,
was clearly outperformed by all the other methods.
6. Conclusion and Future Work
We presented GPSC, a generalization of Gaussian Pro-
cess classification to label sequence learning problem.
This method combines the advantages of the rigor-
ous probabilistic semantics of CRFs and overcomes
the curse of dimensionality problem using kernels in
order to construct and learn over RKHS. The experi-
ments on named entity recognition show the compet-
itiveness and the experiments on pitch accent predic-
tion show the superiority of our approach in terms of
the achieved error rate. We also experimentally veri-
fied the usefulness of the probabilities obtained from
GPSC.
References
Altun, Y., Hofmann, T., & Johnson, M. (2003a). Discrim-
inative learning for label sequences via boosting. Ad-
vances in Neural Information Processing Systems.
Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003b). Hid-
den markov support vector machines. 20th International
Conference on Machine Learning (ICML).
Bennett, K., Momma, M., & Embrechts, J. (2002). Mark:
A boosting algorithm for heterogeneous kernel mod-
els. Proceedings of SIGKDD International Conference
on Knowledge Discovery and Data Mining.
Collins, M. (2002). Discriminative training methods for
Hidden Markov Models: Theory and experiments with
perceptron algorithms. Empirical Methods of Natural
Language Processing (EMNLP).
Crammer, K., & Singer, Y. (2001). On the algorithmic im-
plementation of multiclass kernel-based vector machines.
Journal of Machine Learning Research,2, 265–292.
Csat’o, L., & Opper, M. (2002). Sparse on-line Gaussian
Processes. Neural Computation,14, 641–668.
Gibbs, M. N., & MacKay, D. J. C. (2000). Variational
Gaussian Process Classifiers. IEEE-NN,11, 1458.
Girosi, F. (1997). An equivalence between sparse approxi-
mation and support vector machines (Technical Report
AIM-1606).
Greenberg, S., Ellis, D., & Hollenback, J. (1996). Insights
into spoken language gleaned from phonetic transcripti
on of the Switchboard corpus. ICSLP96 (pp. S24–27).
Philadelphia, PA.
Jaakkola, T. S., & Jordan, M. I. (1996). Computing upper
and lower bounds on likelihoods in intractable networks.
In Proceedings of the Twelfth Conference on Uncertainty
in AI.
Kimeldorf, G., & Wahba, G. (1971). A correspondence
between Bayesian estimation and on stochastic processes
and smoothing by splines. Annals of Math. Stat.,41(2),
495–502.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Con-
ditional Random Fields: Probabilistic models for seg-
menting and labeling sequence data. Proc. 18th Interna-
tional Conf. on Machine Learning (pp. 282–289). Mor-
gan Kaufmann, San Francisco, CA.
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum
Entropy Markov Models for Information Extraction and
Segmentation. Machine Learning: Proceedings of the
Seventeenth International Conference (ICML 2000) (pp.
591–598). Stanford, California.
Minka, T. (2001). A family of algorithms for approximate
Bayesian inference. PhD thesis, MIT Media Lab.
Opper, M., & Winther, O. (2000). Gaussian Processes for
classification: Mean-field algorithms. Neural Computa-
tion,12, 2655–2684.
Punyakanok, V., & Roth, D. (2000). The use of classifiers
in sequential inference. Advances in Neural Information
Processing Systems (pp. 995–1001).
Sch¨olkopf, B., & Smola, A. J. (2002). Learning with ker-
nels. MIT Press.
Seeger, M., Lawrence, N. D., & Herbrich, R. (2003). Fast
sparse Gaussian Process methods: The informative vec-
tor machine. Advances in Neural Information Processing
Systems.
Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy Gaus-
sian Process regression. Advances in Neural Information
Processing Systems (pp. 619–625).
Smola, A. J., & Sch¨olkopf, B. (2000). Sparse greedy ma-
trix approximation for machine learning. Proc. 17th In-
ternational Conf. on Machine Learning (pp. 911–918).
Morgan Kaufmann, San Francisco, CA.
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin
markov networks. Advances in Neural Information Pro-
cessing Systems.
Weston, J., & Watkins, C. (1999). Support vector ma-
chines for multi-class pattern recognition. Proceedings
European Symposium on Artificial Neural Networks.
Williams, C. K. I., & Barber, D. (1998). Bayesian clas-
sification with Gaussian Processes. IEEE Transactions
on Pattern Analysis and Machine Intelligence,20, 1342–
1351.
Williams, C. K. I., & Seeger, M. (2000). Using the nystrom
method to speed up kernel machines. Advances in Neural
Information Processing Systems.
... Discriminant models such as GP models and SVM are the most popular approaches to classify video event [4], [5], [6], [7] because of their advantage in terms of classification accuracy. ...
... It is time consuming and especially inefficient in an online model. GP models have been applied for human motion analysis and activities recognition [4], [13] because of its robustness and high accuracy in classification. However, GP models are supervised. ...
Article
Full-text available
In this paper, we present an unsupervised learning framework for analyzing activities and interactions in surveillance videos. In our framework, three levels of video events are connected by Hierarchical Dirichlet Process (HDP) model: low-level visual features, simple atomic activities, and multi-agent interactions. Atomic activities are represented as distribution of low-level features, while complicated interactions are represented as distribution of atomic activities. This learning process is unsupervised. Given a training video sequence, low-level visual features are extracted based on optic flow and then clustered into different atomic activities and video clips are clustered into different interactions. The HDP model automatically decide the number of clusters, i.e. the categories of atomic activities and interactions. Based on the learned atomic activities and interactions, a training dataset is generated to train the Gaussian Process (GP) classifier. Then the trained GP models work in newly captured video to classify interactions and detect abnormal events in real time. Furthermore, the temporal dependencies between video events learned by HDP-Hidden Markov Models (HMM) are effectively integrated into GP classifier to enhance the accuracy of the classification in newly captured videos. Our framework couples the benefits of the generative model (HDP) with the discriminant model (GP). We provide detailed experiments showing that our framework enjoys favorable performance in video event classification in real-time in a crowded traffic scene.
... Due to the development of advanced algorithms, predictive analytics could be done largely on the user's private data in the near future with the fastest computation rate. Moreover, privacy of the data could be enhanced by adapting encryption tools, but it also constraints the functionality of data operation [22,23]. Deep learning neural networks adapts both the encryption and decryption techniques, which had considered in many research fields. ...
Article
Full-text available
The immense growth of the cloud infrastructure leads to the deployment of several machine learning as a service (MLaaS) in which the training and the development of machine learning models are ultimately performed in the cloud providers’ environment. However, this could also cause potential security threats and privacy risk as the deep learning algorithms need to access generated data collection, which lacks security in nature. This paper predominately focuses on developing a secure deep learning system design with the threat analysis involved within the smart farming technologies as they are acquiring more attention towards the global food supply needs with their intensifying demands. Smart farming is known to be a combination of data-driven technology and agricultural applications that helps in yielding quality food products with the enhancing crop yield. Nowadays, many use cases had been developed by executing smart farming paradigm and promote high impacts on the agricultural lands.
... For instance, models for classification (Rasmussen and Williams, 2006, Chap. 3), ordinal regression (Chu and Ghahramani, 2005) and structured prediction (Altun et al., 2004;Bratières et al., 2013) were proposed in the literature. Since the likelihood is independent of the kernel, a natural future step is to apply the kernels and models introduced in this paper to different NLP tasks. ...
Article
Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.
Article
Pedestrian lane detection is an important task in many assistive and autonomous navigation systems. This article presents a new approach for pedestrian lane detection in unstructured environments, where the pedestrian lanes can have arbitrary surfaces with no painted markers. In this approach, a hybrid deep learning-Gaussian process (DL-GP) network is proposed to segment a scene image into lane and background regions. The network combines a compact convolutional encoder-decoder net and a powerful nonparametric hierarchical GP classifier. The resulting network with a smaller number of trainable parameters helps mitigate the overfitting problem while maintaining the modeling power. In addition to the segmentation output for each test image, the network also generates a map of uncertainty--a measure that is negatively correlated with the confidence level with which we can trust the segmentation. This measure is important for pedestrian lane-detection applications, since its prediction affects the safety of its users. We also introduce a new data set of 5000 images for training and evaluating the pedestrian lane-detection algorithms. This data set is expected to facilitate research in pedestrian lane detection, especially the application of DL in this area. Evaluated on this data set, the proposed network shows significant performance improvements compared with several existing methods.
Article
Gaussian process (GP) models are powerful tools for Bayesian classification, but their limitation is the high computational cost. Existing approximation methods to reduce the cost of GP classification can be categorized into either global or local approaches. Global approximations, which summarize training data with inducing points, cannot account for non-stationarity and locality in complex datasets. Local approximations, which fit a GP for each sub-region of the input space, are prone to overfitting. This paper proposes a GP classification method that effectively utilizes both global and local information through a hierarchical model. The upper layer consists of a global sparse GP to coarsely model the entire dataset. The lower layer is composed of a mixture of GP experts which use local information to learn a fine-grained model. The key idea to avoid overfitting and to enforce correlation among the experts is to incorporate global information into their shared prior mean function. A variational inference algorithm is developed for simultaneous learning of the global GP, the experts, and the gating network by maximizing a lower bound of the log marginal likelihood. We explicitly represent the variational distributions of the global variables so that the model conditioned on these variables factorizes in the observations. This way, stochastic optimization can be employed during learning to cater for large-scale problems. Experiments on a wide range of benchmark datasets demonstrate the advantages of the model, as a stand-alone classifier or as the top layer of a deep neural network, in terms of scalability and predictive power.
Article
Natural language processing (NLP) went through a profound transformation in the mid-1980s when it shifted to make heavy use of corpora and data-driven techniques to analyze language. Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples. In this book, we cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed »in-house» in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Bayesian estimation, and nonparametric modeling. In response to rapid changes in the field, this second edition of the book includes a new chapter on representation learning and neural networks in the Bayesian context. We also cover fundamental concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling. Finally, we review some of the fundamental modeling techniques in NLP, such as grammar modeling, neural networks and representation learning, and their use with Bayesian analysis.
Conference Paper
This paper presents an on-going collaborative research project with UK based Extra Care Home provider. An innovative intelligent well-being monitoring system for extra care homes has been proposed in this paper. The novelty of this research lies in the selection of different sensors in extra care homes, how data from these sensors is used to build an intelligent well-being representation model, which can be used to monitor residents' well-being status and detect abnormality. The overall architecture of the system has been presented in the paper along with machine learning techniques, Wireless Sensing Method and system validation approach that will be used in this research.
Article
This paper proposes a new method of predicting the future state of a ballistic target trajectory. There have been a number of estimation methods that utilize the variations of Kalman filters, and the prediction of the future states followed the simple propagations of the target dynamic equations. However, these simple propagations suffered from no observation of the future state, so this propagation could not estimate a key parameter of the dynamics equation, such as the ballistic coefficient. We resolved this limitation by applying a data-driven approach to predict the ballistic coefficient. From this learning of the ballistic coefficient, we calculated the future state with the future ballistic parameter that differs over time. Our proposed model shows the better performance than the traditional simple propagation method in this state prediction task. The value of this research could be recognized as an application of machine learning techniques to the aerodynamics domains. Our framework suggests how to maximize the synergy by linking the traditional filtering aproaches and diverse machine learning techniques, i.e., Gaussian process regression, support vector regression and regularized linear regression.
Conference Paper
Full-text available
Abstract We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d{sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory,requirements. The scaling of our method is at most O(n d,), and in large real-world classication experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be signican tly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation.
Conference Paper
Full-text available
. The solution of binary classification problems using support vector machines (SVMs) is well developed, but multi-class problems with more than two classes have typically been solved by combining independently produced binary classifiers. We propose a formulation of the SVM that enables a multi-class pattern recognition problem to be solved in a single optimisation. We also propose a similar generalization of linear programming machines. We report experiments using bench-mark datasets in which ...
Article
In this paper we describe the algorithmic implementation of multiclass kernel-based vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem we are able to incorporate kernels with a compact set of constraints and decompose the dual problem into multiple optimization problems of reduced size. We describe an efficient fixed-point algorithm for solving the reduced optimization problems and prove its convergence. We then discuss technical details that yield significant running time improvements for large datasets. Finally, we describe various experiments with our approach comparing it to previously studied kernel-based methods. Our experiments indicate that for multiclass problems we attain state-of-the-art accuracy.
Conference Paper
In kernel based methods such as RegularizationNetworks large datasets pose signi-cant problems since the number of basis functionsrequired for an optimal solution equalsthe number of samples. We present a sparsegreedy approximation technique to constructa compressed representation of the designmatrix. Experimental results are given andconnections to Kernel-PCA, Sparse KernelFeature Analysis, and Matching Pursuit arepointed out.1. IntroductionMany recent advances in...