Conference PaperPDF Available

Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty.

Authors:

Abstract and Figures

Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-regularization, which is be- coming popular in natural language pro- cessing because of its ability to pro- duce compact models, cannot be effi- ciently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gra- dients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging. Experimental re- sults demonstrate that our method can pro- duce compact and accurate models much more quickly than a state-of-the-art quasi- Newton method for L1-regularized log- linear models.
Content may be subject to copyright.
Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 477–485,
Suntec, Singapore, 2-7 August 2009. c
2009 ACL and AFNLP
Stochastic Gradient Descent Training for
L1-regularized Log-linear Models with Cumulative Penalty
Yoshimasa Tsuruoka†‡ Jun’ichi Tsujii†‡∗ Sophia Ananiadou†‡
School of Computer Science, University of Manchester, UK
National Centre for Text Mining (NaCTeM), UK
Department of Computer Science, University of Tokyo, Japan
{yoshimasa.tsuruoka,j.tsujii,sophia.ananiadou}@manchester.ac.uk
Abstract
Stochastic gradient descent (SGD) uses
approximate gradients estimated from
subsets of the training data and updates
the parameters in an online fashion. This
learning framework is attractive because
it often requires much less training time
in practice than batch training algorithms.
However, L1-regularization, which is be-
coming popular in natural language pro-
cessing because of its ability to pro-
duce compact models, cannot be effi-
ciently applied in SGD training, due to
the large dimensions of feature vectors
and the fluctuations of approximate gra-
dients. We present a simple method to
solve these problems by penalizing the
weights according to cumulative values for
L1 penalty. We evaluate the effectiveness
of our method in three applications: text
chunking, named entity recognition, and
part-of-speech tagging. Experimental re-
sults demonstrate that our method can pro-
duce compact and accurate models much
more quickly than a state-of-the-art quasi-
Newton method for L1-regularized log-
linear models.
1 Introduction
Log-linear models (a.k.a maximum entropy mod-
els) are one of the most widely-used probabilistic
models in the field of natural language process-
ing (NLP). The applications range from simple
classification tasks such as text classification and
history-based tagging (Ratnaparkhi, 1996) to more
complex structured prediction tasks such as part-
of-speech (POS) tagging (Lafferty et al., 2001),
syntactic parsing (Clark and Curran, 2004) and se-
mantic role labeling (Toutanova et al., 2005). Log-
linear models have a major advantage over other
discriminative machine learning models such as
support vector machines—their probabilistic out-
put allows the information on the confidence of
the decision to be used by other components in the
text processing pipeline.
The training of log-liner models is typically per-
formed based on the maximum likelihood crite-
rion, which aims to obtain the weights of the fea-
tures that maximize the conditional likelihood of
the training data. In maximum likelihood training,
regularization is normally needed to prevent the
model from overfitting the training data,
The two most common regularization methods
are called L1 and L2 regularization. L1 regular-
ization penalizes the weight vector for its L1-norm
(i.e. the sum of the absolute values of the weights),
whereas L2 regularization uses its L2-norm. There
is usually not a considerable difference between
the two methods in terms of the accuracy of the
resulting model (Gao et al., 2007), but L1 regu-
larization has a significant advantage in practice.
Because many of the weights of the features be-
come zero as a result of L1-regularized training,
the size of the model can be much smaller than that
produced by L2-regularization. Compact models
require less space on memory and storage, and en-
able the application to start up quickly. These mer-
its can be of vital importance when the application
is deployed in resource-tight environments such as
cell-phones.
A common way to train a large-scale L1-
regularized model is to use a quasi-Newton
method. Kazama and Tsujii (2003) describe a
method for training a L1-regularized log-linear
model with a bound constrained version of the
BFGS algorithm (Nocedal, 1980). Andrew and
Gao (2007) present an algorithm called Orthant-
Wise Limited-memory Quasi-Newton (OWL-
QN), which can work on the BFGS algorithm
without bound constraints and achieve faster con-
vergence.
477
An alternative approach to training a log-linear
model is to use stochastic gradient descent (SGD)
methods. SGD uses approximate gradients esti-
mated from subsets of the training data and up-
dates the weights of the features in an online
fashion—the weights are updated much more fre-
quently than batch training algorithms. This learn-
ing framework is attracting attention because it of-
ten requires much less training time in practice
than batch training algorithms, especially when
the training data is large and redundant. SGD was
recently used for NLP tasks including machine
translation (Tillmann and Zhang, 2006) and syn-
tactic parsing (Smith and Eisner, 2008; Finkel et
al., 2008). Also, SGD is very easy to implement
because it does not need to use the Hessian infor-
mation on the objective function. The implemen-
tation could be as simple as the perceptron algo-
rithm.
Although SGD is a very attractive learning
framework, the direct application of L1 regular-
ization in this learning framework does not result
in efficient training. The first problem is the inef-
ficiency of applying the L1 penalty to the weights
of all features. In NLP applications, the dimen-
sion of the feature space tends to be very large—it
can easily become several millions, so the appli-
cation of L1 penalty to all features significantly
slows down the weight updating process. The sec-
ond problem is that the naive application of L1
penalty in SGD does not always lead to compact
models, because the approximate gradient used at
each update is very noisy, so the weights of the
features can be easily moved away from zero by
those fluctuations.
In this paper, we present a simple method for
solving these two problems in SGD learning. The
main idea is to keep track of the total penalty and
the penalty that has been applied to each weight,
so that the L1 penalty is applied based on the dif-
ference between those cumulative values. That
way, the application of L1 penalty is needed only
for the features that are used in the current sample,
and also the effect of noisy gradient is smoothed
away.
We evaluate the effectiveness of our method
by using linear-chain conditional random fields
(CRFs) and three traditional NLP tasks, namely,
text chunking (shallow parsing), named entity
recognition, and POS tagging. We show that our
enhanced SGD learning method can produce com-
pact and accurate models much more quickly than
the OWL-QN algorithm.
This paper is organized as follows. Section 2
provides a general description of log-linear mod-
els used in NLP. Section 3 describes our stochastic
gradient descent method for L1-regularized log-
linear models. Experimental results are presented
in Section 4. Some related work is discussed in
Section 5. Section 6 gives some concluding re-
marks.
2 Log-Linear Models
In this section, we briefly describe log-linear mod-
els used in NLP tasks and L1 regularization.
A log-linear model defines the following prob-
abilistic distribution over possible structure yfor
input x:
p(y|x) = 1
Z(x)exp X
i
wifi(y,x),
where fi(y,x)is a function indicating the occur-
rence of feature i,wiis the weight of the feature,
and Z(x)is a partition (normalization) function:
Z(x) = X
y
exp X
i
wifi(y,x).
If the structure is a sequence, the model is called
a linear-chain CRF model, and the marginal prob-
abilities of the features and the partition function
can be efficiently computed by using the forward-
backward algorithm. The model is used for a va-
riety of sequence labeling tasks such as POS tag-
ging, chunking, and named entity recognition.
If the structure is a tree, the model is called a
tree CRF model, and the marginal probabilities
can be computed by using the inside-outside algo-
rithm. The model can be used for tasks like syn-
tactic parsing (Finkel et al., 2008) and semantic
role labeling (Cohn and Blunsom, 2005).
2.1 Training
The weights of the features in a log-linear model
are optimized in such a way that they maximize
the regularized conditional log-likelihood of the
training data:
Lw=
N
X
j=1
log p(yj|xj;w)R(w),(1)
where Nis the number of training samples, yjis
the correct output for input xj, and R(w)is the
478
regularization term which prevents the model from
overfitting the training data. In the case of L1 reg-
ularization, the term is defined as:
R(w) = CX
i
|wi|,
where Cis the meta-parameter that controls the
degree of regularization, which is usually tuned by
cross-validation or using the heldout data.
In what follows, we denote by L(j, w)
the conditional log-likelihood of each sample
log p(yj|xj;w). Equation 1 is rewritten as:
Lw=
N
X
j=1
L(j, w)CX
i
|wi|.(2)
3 Stochastic Gradient Descent
SGD uses a small randomly-selected subset of the
training samples to approximate the gradient of
the objective function given by Equation 2. The
number of training samples used for this approx-
imation is called the batch size. When the batch
size is N, the SGD training simply translates into
gradient descent (hence is very slow to converge).
By using a small batch size, one can update the
parameters more frequently than gradient descent
and speed up the convergence. The extreme case
is a batch size of 1, and it gives the maximum
frequency of updates and leads to a very simple
perceptron-like algorithm, which we adopt in this
work.1
Apart from using a single training sample to
approximate the gradient, the optimization proce-
dure is the same as simple gradient descent,2so
the weights of the features are updated at training
sample jas follows:
wk+1 =wk+ηk
w(L(j, w)C
NX
i
|wi|),
where kis the iteration counter and ηkis the learn-
ing rate, which is normally designed to decrease
as the iteration proceeds. The actual learning rate
scheduling methods used in our experiments are
described later in Section 3.3.
1In the actual implementation, we randomly shuffled the
training samples at the beginning of each pass, and then
picked them up sequentially.
2What we actually do here is gradient ascent, but we stick
to the term “gradient descent”.
3.1 L1 regularization
The update equation for the weight of each feature
iis as follows:
wik+1 =wik+ηk
∂wi
(L(j, w)C
N|wi|).
The difficulty with L1 regularization is that the
last term on the right-hand side of the above equa-
tion is not differentiable when the weight is zero.
One straightforward solution to this problem is to
consider a subgradient at zero and use the follow-
ing update equation:
wik+1 =wik+ηk
∂L(j, w)
∂wi
C
Nηksign(wk
i),
where sign(x) = 1 if x > 0, sign(x) = 1if x <
0, and sign(x) = 0 if x= 0. In this paper, we call
this weight updating method “SGD-L1 (Naive)”.
This naive method has two serious problems.
The first problem is that, at each update, we need
to perform the application of L1 penalty to all fea-
tures, including the features that are not used in
the current training sample. Since the dimension
of the feature space can be very large, it can sig-
nificantly slow down the weight update process.
The second problem is that it does not produce
a compact model, i.e. most of the weights of the
features do not become zero as a result of train-
ing. Note that the weight of a feature does not be-
come zero unless it happens to fall on zero exactly,
which rarely happens in practice.
Carpenter (2008) describes an alternative ap-
proach. The weight updating process is divided
into two steps. First, the weight is updated with-
out considering the L1 penalty term. Then, the
L1 penalty is applied to the weight to the extent
that it does not change its sign. In other words,
the weight is clipped when it crosses zero. Their
weight update procedure is as follows:
wk+1
2
i=wk
i+ηk
∂L(j, w)
∂wi
w=wk
,
if wk+1
2
i>0then
wk+1
i= max(0, wk+1
2
iC
Nηk),
else if wk+1
2
i<0then
wk+1
i= min(0, wk+1
2
i+C
Nηk).
In this paper, we call this update method “SGD-
L1 (Clipping)”. It should be noted that this method
479
-0.1
-0.05
0
0.05
0.1
0 1000 2000 3000 4000 5000 6000
Weight
Updates
Figure 1: An example of weight updates.
is actually a special case of the FOLOS algorithm
(Duchi and Singer, 2008) and the truncated gradi-
ent method (Langford et al., 2009).
The obvious advantage of using this method is
that we can expect many of the weights of the
features to become zero during training. Another
merit is that it allows us to perform the applica-
tion of L1 penalty in a lazy fashion, so that we
do not need to update the weights of the features
that are not used in the current sample, which leads
to much faster training when the dimension of the
feature space is large. See the aforementioned pa-
pers for the details. In this paper, we call this effi-
cient implementation “SGD-L1 (Clipping + Lazy-
Update)”.
3.2 L1 regularization with cumulative
penalty
Unfortunately, the clipping-at-zero approach does
not solve all problems. Still, we often end up with
many features whose weights are not zero. Re-
call that the gradient used in SGD is a crude ap-
proximation to the true gradient and is very noisy.
The weight of a feature is, therefore, easily moved
away from zero when the feature is used in the
current sample.
Figure 1 gives an illustrative example in which
the weight of a feature fails to become zero. The
figure shows how the weight of a feature changes
during training. The weight goes up sharply when
it is used in the sample and then is pulled back
toward zero gradually by the L1 penalty. There-
fore, the weight fails to become zero if the feature
is used toward the end of training, which is the
case in this example. Note that the weight would
become zero if the true (fluctuationless) gradient
were used—at each update the weight would go
up a little and be pulled back to zero straightaway.
Here, we present a different strategy for apply-
ing the L1 penalty to the weights of the features.
The key idea is to smooth out the effect of fluctu-
ating gradients by considering the cumulative ef-
fects from L1 penalty.
Let ukbe the absolute value of the total L1-
penalty that each weight could have received up
to the point. Since the absolute value of the L1
penalty does not depend on the weight and we are
using the same regularization constant Cfor all
weights, it is simply accumulated as:
uk=C
N
k
X
t=1
ηt.(3)
At each training sample, we update the weights
of the features that are used in the sample as fol-
lows:
wk+1
2
i=wk
i+ηk
∂L(j, w)
∂wi
w=wk
,
if wk+1
2
i>0then
wk+1
i= max(0, wk+1
2
i(uk+qk1
i)),
else if wk+1
2
i<0then
wk+1
i= min(0, wk+1
2
i+ (ukqk1
i)),
where qk
iis the total L1-penalty that wihas actu-
ally received up to the point:
qk
i=
k
X
t=1
(wt+1
iwt+1
2
i).(4)
This weight updating method penalizes the
weight according to the difference between ukand
qk1
i. In effect, it forces the weight to receive the
total L1 penalty that would have been applied if
the weight had been updated by the true gradients,
assuming that the current weight vector resides in
the same orthant as the true weight vector.
It should be noted that this method is basi-
cally equivalent to a “SGD-L1 (Clipping + Lazy-
Update)” method if we were able to use the true
gradients instead of the stochastic gradients.
In this paper, we call this weight updating
method “SGD-L1 (Cumulative)”. The implemen-
tation of this method is very simple. Figure 2
shows the whole SGD training algorithm with this
strategy in pseudo-code.
480
1: procedure TRAIN(C)
2: u0
3: Initialize wiand qiwith zero for all i
4: for k= 0 to MaxIterations
5: ηLEARNINGRATE(k)
6: uu+ηC/N
7: Select sample jrandomly
8: UPDATEWEIGHTS(j)
9:
10: procedure UPDATEWEIGHTS(j)
11: for ifeatures used in sample j
12: wiwi+η∂L(j,w)
∂wi
13: APPLYPENALTY(i)
14:
15: procedure APPLYPENALTY(i)
16: zwi
17: if wi>0then
18: wimax(0, wi(u+qi))
19: else if wi<0then
20: wimin(0, wi+ (uqi))
21: qiqi+ (wiz)
22:
Figure 2: Stochastic gradient descent training with
cumulative L1 penalty. zis a temporary variable.
3.3 Learning Rate
The scheduling of learning rates often has a major
impact on the convergence speed in SGD training.
A typical choice of learning rate scheduling can
be found in (Collins et al., 2008):
ηk=η0
1 + k/N ,(5)
where η0is a constant. Although this scheduling
guarantees ultimate convergence, the actual speed
of convergence can be poor in practice (Darken
and Moody, 1990).
In this work, we also tested simple exponential
decay:
ηk=η0αk/N ,(6)
where αis a constant. In our experiments, we
found this scheduling more practical than that
given in Equation 5. This is mainly because ex-
ponential decay sweeps the range of learning rates
more smoothly—the learning rate given in Equa-
tion 5 drops too fast at the beginning and too
slowly at the end.
It should be noted that exponential decay is not
a good choice from a theoretical point of view, be-
cause it does not satisfy one of the necessary con-
ditions for convergence—the sum of the learning
rates must diverge to infinity (Spall, 2005). How-
ever, this is probably not a big issue for practition-
ers because normally the training has to be termi-
nated at a certain number of iterations in practice.3
4 Experiments
We evaluate the effectiveness our training algo-
rithm using linear-chain CRF models and three
NLP tasks: text chunking, named entity recogni-
tion, and POS tagging.
To compare our algorithm with the state-of-the-
art, we present the performance of the OWL-QN
algorithm on the same data. We used the publicly
available OWL-QN optimizer developed by An-
drew and Gao.4The meta-parameters for learning
were left unchanged from the default settings of
the software: the convergence tolerance was 1e-4;
and the L-BFGS memory parameter was 10.
4.1 Text Chunking
The first set of experiments used the text chunk-
ing data set provided for the CoNLL 2000 shared
task.5The training data consists of 8,936 sen-
tences in which each token is annotated with the
“IOB” tags representing text chunks such as noun
and verb phrases. We separated 1,000 sentences
from the training data and used them as the held-
out data. The test data provided by the shared task
was used only for the final accuracy report.
The features used in this experiment were uni-
grams and bigrams of neighboring words, and un-
igrams, bigrams and trigrams of neighboring POS
tags.
To avoid giving any advantage to our SGD al-
gorithms over the OWL-QN algorithm in terms of
the accuracy of the resulting model, the OWL-QN
algorithm was used when tuning the regularization
parameter C. The tuning was performed in such a
way that it maximized the likelihood of the heldout
data. The learning rate parameters for SGD were
then tuned in such a way that they maximized the
value of the objective function in 30 passes. We
first determined η0by testing 1.0, 0.5, 0.2, and 0.1.
We then determined αby testing 0.9, 0.85, and 0.8
with the fixed η0.
3This issue could also be sidestepped by, for example,
adding a small O(1/k)term to the learning rate.
4Available from the original developers’ websites:
http://research.microsoft.com/en-us/people/galena/ or
http://research.microsoft.com/en-us/um/people/jfgao/
5http://www.cnts.ua.ac.be/conll2000/chunking/
481
Passes Lw/N # Features Time (sec) F-score
OWL-QN 160 -1.583 18,109 598 93.62
SGD-L1 (Naive) 30 -1.671 455,651 1,117 93.64
SGD-L1 (Clipping + Lazy-Update) 30 -1.671 87,792 144 93.65
SGD-L1 (Cumulative) 30 -1.653 28,189 149 93.68
SGD-L1 (Cumulative + Exponential-Decay) 30 -1.622 23,584 148 93.66
Table 1: CoNLL-2000 Chunking task. Training time and accuracy of the trained model on the test data.
-2.4
-2.2
-2
-1.8
-1.6
0 10 20 30 40 50
Objective function
Passes
OWL-QN
SGD-L1 (Clipping)
SGD-L1 (Cumulative)
SGD-L1 (Cumulative + ED)
Figure 3: CoNLL 2000 chunking task: Objective
0
50000
100000
150000
200000
0 10 20 30 40 50
# Active features
Passes
OWL-QN
SGD-L1 (Clipping)
SGD-L1 (Cumulative)
SGD-L1 (Cumulative + ED)
Figure 4: CoNLL 2000 chunking task: Number of
active features.
Figures 3 and 4 show the training process of
the model. Each figure contains four curves repre-
senting the results of the OWL-QN algorithm and
three SGD-based algorithms. “SGD-L1 (Cumu-
lative + ED)” represents the results of our cumu-
lative penalty-based method that uses exponential
decay (ED) for learning rate scheduling.
Figure 3 shows how the value of the objec-
tive function changed as the training proceeded.
SGD-based algorithms show much faster conver-
gence than the OWL-QN algorithm. Notice also
that “SGD-L1 (Cumulative)” improves the objec-
tive slightly faster than “SGD-L1 (Clipping)”. The
result of “SGD-L1 (Naive)” is not shown in this
figure, but the curve was almost identical to that
of “SGD-L1 (Clipping)”.
Figure 4 shows the numbers of active features
(the features whose weight are not zero). It is
clearly seen that the clipping-at-zero approach
fails to reduce the number of active features, while
our algorithms succeeded in reducing the number
of active features to the same level as OWL-QN.
We then trained the models using the whole
training data (including the heldout data) and eval-
uated the accuracy of the chunker on the test data.
The number of passes performed over the train-
ing data in SGD was set to 30. The results are
shown in Table 1. The second column shows the
number of passes performed in the training. The
third column shows the final value of the objective
function per sample. The fourth column shows
the number of resulting active features. The fifth
column show the training time. The last column
shows the f-score (harmonic mean of recall and
precision) of the chunking results. There was no
significant difference between the models in terms
of accuracy. The naive SGD training took much
longer than OWL-QN because of the overhead of
applying L1 penalty to all dimensions.
Our SGD algorithms finished training in 150
seconds on Xeon 2.13GHz processors. The
CRF++ version 0.50, a popular CRF library de-
veloped by Taku Kudo,6is reported to take 4,021
seconds on Xeon 3.0GHz processors to train the
model using a richer feature set.7CRFsuite ver-
sion 0.4, a much faster library for CRFs, is re-
ported to take 382 seconds on Xeon 3.0GHz, using
the same feature set as ours.8Their library uses the
OWL-QN algorithm for optimization. Although
direct comparison of training times is not impor-
6http://crfpp.sourceforge.net/
7http://www.chokkan.org/software/crfsuite/benchmark.html
8ditto
482
tant due to the differences in implementation and
hardware platforms, these results demonstrate that
our algorithm can actually result in a very fast im-
plementation of a CRF trainer.
4.2 Named Entity Recognition
The second set of experiments used the named
entity recognition data set provided for the
BioNLP/NLPBA 2004 shared task (Kim et al.,
2004).9The training data consist of 18,546 sen-
tences in which each token is annotated with the
“IOB” tags representing biomedical named enti-
ties such as the names of proteins and RNAs.
The training and test data were preprocessed
by the GENIA tagger,10 which provided POS tags
and chunk tags. We did not use any information on
the named entity tags output by the GENIA tagger.
For the features, we used unigrams of neighboring
chunk tags, substrings (shorter than 10 characters)
of the current word, and the shape of the word (e.g.
“IL-2” is converted into “AA-#”), on top of the
features used in the text chunking experiments.
The results are shown in Figure 5 and Table
2. The trend in the results is the same as that of
the text chunking task: our SGD algorithms show
much faster convergence than the OWL-QN algo-
rithm and produce compact models.
Okanohara et al. (2006) report an f-score of
71.48 on the same data, using semi-Markov CRFs.
4.3 Part-Of-Speech Tagging
The third set of experiments used the POS tag-
ging data in the Penn Treebank (Marcus et al.,
1994). Following (Collins, 2002), we used sec-
tions 0-18 of the Wall Street Journal (WSJ) corpus
for training, sections 19-21 for development, and
sections 22-24 for final evaluation. The POS tags
were extracted from the parse trees in the corpus.
All experiments for this work, including the tun-
ing of features and parameters for regularization,
were carried out using the training and develop-
ment sets. The test set was used only for the final
accuracy report.
It should be noted that training a CRF-based
POS tagger using the whole WSJ corpus is not a
trivial task and was once even deemed impractical
in previous studies. For example, Wellner and Vi-
lain (2006) abandoned maximum likelihood train-
9The data is available for download at http://www-
tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html
10http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
-3.8
-3.6
-3.4
-3.2
-3
-2.8
-2.6
-2.4
-2.2
0 10 20 30 40 50
Objective function
Passes
OWL-QN
SGD-L1 (Clipping)
SGD-L1 (Cumulative)
SGD-L1 (Cumulative + ED)
Figure 5: NLPBA 2004 named entity recognition
task: Objective.
-2.8
-2.7
-2.6
-2.5
-2.4
-2.3
-2.2
-2.1
-2
-1.9
-1.8
0 10 20 30 40 50
Objective function
Passes
OWL-QN
SGD-L1 (Clipping)
SGD-L1 (Cumulative)
SGD-L1 (Cumulative + ED)
Figure 6: POS tagging task: Objective.
ing because it was “prohibitive” (7-8 days for sec-
tions 0-18 of the WSJ corpus).
For the features, we used unigrams and bigrams
of neighboring words, prefixes and suffixes of
the current word, and some characteristics of the
word. We also normalized the current word by
lowering capital letters and converting all the nu-
merals into ‘#’, and used the normalized word as a
feature.
The results are shown in Figure 6 and Table 3.
Again, the trend is the same. Our algorithms fin-
ished training in about 30 minutes, producing ac-
curate models that are as compact as that produced
by OWL-QN.
Shen et al., (2007) report an accuracy of 97.33%
on the same data set using a perceptron-based bidi-
rectional tagging model.
5 Discussion
An alternative approach to producing compact
models for log-linear models is to reformulate the
483
Passes Lw/N # Features Time (sec) F-score
OWL-QN 161 -2.448 30,710 2,253 71.76
SGD-L1 (Naive) 30 -2.537 1,032,962 4,528 71.20
SGD-L1 (Clipping + Lazy-Update) 30 -2.538 279,886 585 71.20
SGD-L1 (Cumulative) 30 -2.479 31,986 631 71.40
SGD-L1 (Cumulative + Exponential-Decay) 30 -2.443 25,965 631 71.63
Table 2: NLPBA 2004 Named entity recognition task. Training time and accuracy of the trained model
on the test data.
Passes Lw/N # Features Time (sec) Accuracy
OWL-QN 124 -1.941 50,870 5,623 97.16%
SGD-L1 (Naive) 30 -2.013 2,142,130 18,471 97.18%
SGD-L1 (Clipping + Lazy-Update) 30 -2.013 323,199 1,680 97.18%
SGD-L1 (Cumulative) 30 -1.987 62,043 1,777 97.19%
SGD-L1 (Cumulative + Exponential-Decay) 30 -1.954 51,857 1,774 97.17%
Table 3: POS tagging on the WSJ corpus. Training time and accuracy of the trained model on the test
data.
problem as a L1-constrained problem (Lee et al.,
2006), where the conditional log-likelihood of the
training data is maximized under a fixed constraint
of the L1-norm of the weight vector. Duchi et
al. (2008) describe efficient algorithms for pro-
jecting a weight vector onto the L1-ball. Although
L1-regularized and L1-constrained learning algo-
rithms are not directly comparable because the ob-
jective functions are different, it would be inter-
esting to compare the two approaches in terms
of practicality. It should be noted, however, that
the efficient algorithm presented in (Duchi et al.,
2008) needs to employ a red-black tree and is
rather complex.
In SGD learning, the need for tuning the meta-
parameters for learning rate scheduling can be an-
noying. In the case of exponential decay, the set-
ting of α= 0.85 turned out to be a good rule
of thumb in our experiments—it always produced
near best results in 30 passes, but the other param-
eter η0needed to be tuned. It would be very useful
if those meta-parameters could be tuned in a fully
automatic way.
There are some sophisticated algorithms for
adaptive learning rate scheduling in SGD learning
(Vishwanathan et al., 2006; Huang et al., 2007).
However, those algorithms use second-order infor-
mation (i.e. Hessian information) and thus need
access to the weights of the features that are not
used in the current sample, which should slow
down the weight updating process for the same
reason discussed earlier. It would be interesting
to investigate whether those sophisticated learning
scheduling algorithms can actually result in fast
training in large-scale NLP tasks.
6 Conclusion
We have presented a new variant of SGD that can
efficiently train L1-regularized log-linear models.
The algorithm is simple and extremely easy to im-
plement.
We have conducted experiments using CRFs
and three NLP tasks, and demonstrated empiri-
cally that our training algorithm can produce com-
pact and accurate models much more quickly than
a state-of-the-art quasi-Newton method for L1-
regularization.
Acknowledgments
We thank N. Okazaki, N. Yoshinaga, D.
Okanohara and the anonymous reviewers for their
useful comments and suggestions. The work de-
scribed in this paper has been funded by the
Biotechnology and Biological Sciences Research
Council (BBSRC; BB/E004431/1). The research
team is hosted by the JISC/BBSRC/EPSRC spon-
sored National Centre for Text Mining.
References
Galen Andrew and Jianfeng Gao. 2007. Scalable train-
ing of L1-regularized log-linear models. In Pro-
ceedings of ICML, pages 33–40.
484
Bob Carpenter. 2008. Lazy sparse stochastic gradient
descent for regularized multinomial logistic regres-
sion. Technical report, Alias-i.
Stephen Clark and James R. Curran. 2004. Parsing the
WSJ using CCG and log-linear models. In Proceed-
ings of COLING 2004, pages 103–110.
Trevor Cohn and Philip Blunsom. 2005. Semantic role
labeling with tree conditional random fields. In Pro-
ceedings of CoNLL, pages 169–172.
Michael Collins, Amir Globerson, Terry Koo, Xavier
Carreras, and Peter L. Bartlett. 2008. Exponen-
tiated gradient algorithms for conditional random
fields and max-margin markov networks. The Jour-
nal of Machine Learning Research (JMLR), 9:1775–
1822.
Michael Collins. 2002. Discriminative training meth-
ods for hidden markov models: Theory and exper-
iments with perceptron algorithms. In Proceedings
of EMNLP, pages 1–8.
Christian Darken and John Moody. 1990. Note on
learning rate schedules for stochastic optimization.
In Proceedings of NIPS, pages 832–838.
Juhn Duchi and Yoram Singer. 2008. Online and
batch learning using forward-looking subgradients.
In NIPS Workshop: OPT 2008 Optimization for Ma-
chine Learning.
Juhn Duchi, Shai Shalev-Shwartz, Yoram Singer, and
Tushar Chandra. 2008. Efficient projections onto
the l1-ball for learning in high dimensions. In Pro-
ceedings of ICML, pages 272–279.
Jenny Rose Finkel, Alex Kleeman, and Christopher D.
Manning. 2008. Efficient, feature-based, condi-
tional random field parsing. In Proceedings of ACL-
08:HLT, pages 959–967.
Jianfeng Gao, Galen Andrew, Mark Johnson, and
Kristina Toutanova. 2007. A comparative study of
parameter estimation methods for statistical natural
language processing. In Proceedings of ACL, pages
824–831.
Han-Shen Huang, Yu-Ming Chang, and Chun-Nan
Hsu. 2007. Training conditional random fields by
periodic step size adaptation for large-scale text min-
ing. In Proceedings of ICDM, pages 511–516.
Jun’ichi Kazama and Jun’ichi Tsujii. 2003. Evalua-
tion and extension of maximum entropy models with
inequality constraints. In Proceedings of EMNLP
2003.
J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Col-
lier. 2004. Introduction to the bio-entity recognition
task at JNLPBA. In Proceedings of the International
Joint Workshop on Natural Language Processing in
Biomedicine and its Applications (JNLPBA), pages
70–75.
John Lafferty, Andrew McCallum, and Fernando
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se-
quence data. In Proceedings of ICML, pages 282–
289.
John Langford, Lihong Li, and Tong Zhang. 2009.
Sparse online learning via truncated gradient. The
Journal of Machine Learning Research (JMLR),
10:777–801.
Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y.
Ng. 2006. Efficient l1 regularized logistic regres-
sion. In Proceedings of AAAI-06, pages 401–408.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1994. Building a large annotated
corpus of English: The Penn Treebank. Computa-
tional Linguistics, 19(2):313–330.
Jorge Nocedal. 1980. Updating quasi-newton matrices
with limited storage. Mathematics of Computation,
35(151):773–782.
Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsu-
ruoka, and Jun’ichi Tsujii. 2006. Improving
the scalability of semi-markov conditional random
fields for named entity recognition. In Proceedings
of COLING/ACL, pages 465–472.
Adwait Ratnaparkhi. 1996. A maximum entropy
model for part-of-speech tagging. In Proceedings
of EMNLP 1996, pages 133–142.
Libin Shen, Giorgio Satta, and Aravind Joshi. 2007.
Guided learning for bidirectional sequence classifi-
cation. In Proceedings of ACL, pages 760–767.
David Smith and Jason Eisner. 2008. Dependency
parsing by belief propagation. In Proceedings of
EMNLP, pages 145–156.
James C. Spall. 2005. Introduction to Stochastic
Search and Optimization. Wiley-IEEE.
Christoph Tillmann and Tong Zhang. 2006. A discrim-
inative global training algorithm for statistical MT.
In Proceedings of COLING/ACL, pages 721–728.
Kristina Toutanova, Aria Haghighi, and Christopher
Manning. 2005. Joint learning improves semantic
role labeling. In Proceedings of ACL, pages 589–
596.
S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W.
Schmidt, and Kevin P. Murphy. 2006. Accelerated
training of conditional random fields with stochastic
gradient methods. In Proceedings of ICML, pages
969–976.
Ben Wellner and Marc Vilain. 2006. Leveraging
machine readable dictionaries in discriminative se-
quence models. In Proceedings of LREC 2006.
485
... 12 Stochastic Gradient Descent [15], [16] Uses a gradient descent optimization algorithm by iteratively replacing the actual gradient with estimated gradient value. ...
... Image processing 16 Convolutional Neural Networks (CNNs) [20], [21] A deep learning algorithm, extracts features by using moving windows (convolutions) and selecting important features of in each window. ...
... In this paper, the study explores various methods of machine learning interpretability with a literature review and methods and their programming implementations. 16 Carvalho, D.et al. [55] 2019 This paper provides a review of the current state of the research field on the interpretation of machine learning, focusing on social impact and developed methods and metrics. 17 Angelov, P., & Soares, E. [58] 2020 ...
Article
Full-text available
Purpose: When Artificial Intelligence is penetrating every walk of our affairs and business, we face enormous challenges and opportunities to adopt this revolution. Machine learning models are used to make the important decisions in critical areas such as medical diagnosis, financial transactions. We need to know how they make decisions to trust the systems powered by these models. However, there are challenges in this area of explaining predictions or decisions made by machine learning model. Ensembles like Random Forest, Deep learning algorithms make the matter worst in terms of explaining the outcomes of decision even though these models produce more accurate results. We cannot accept the black box nature of AI models as we encounter the consequences of those decisions. In this paper, we would like to open this Pandora box and review the current challenges and opportunities to explain the decisions or outcome of AI model. There has been lot of debate on this topic with headlines as Explainable Artificial Intelligence (XAI), Interpreting ML models, Explainable ML models etc. This paper does the literature review of latest findings and surveys published in various reputed journals and publications. Towards the end, we try to bring some open research agenda in these findings and future directions. Methodology: The literature survey on the chosen topic has been exhaustively covered to include fundamental concepts of the research topic. Journals from multiple secondary data sources such as books and research papers published in various reputable publications which are relevant for the work were chosen in the methodology. Findings/Result: While there are no single approaches currently solve the explainable ML model challenges, some model algorithms such as Decision Trees, KNN algorithm provides built in interpretations. However there is no common approach and they cannot be used in all the problems. Developing model specific interpretations will be complex and difficult for the user to make them adopt. Model specific explanations may lead to multiple explanations on same predictions which will lead to ambiguity of the outcome. In this paper, we have conceptualized a common approach to build explainable models that may fulfill current challenges of XAI. Originality: After the literature review, the knowledge gathered in the form of findings were used to model a theoretical framework for the research topic. Then concerted effort was made to develop a conceptual model to support the future research work. Paper Type: Literature Review.
... where K is the repetition counter and k is the learning rate (Tsuruoka et al. 2009). ...
... Furthermore, using a momentum factor helps avoid local minima and the associated instability, although high momentum rates can cause instability. This factor is only active if the adaptive learning rate is disabled (Tsuruoka et al. 2009). In this study, this parameter was considered equal to 0.005. ...
Article
Weirs are structures that are important for measuring flow and controlling water levels. Research has shown that the discharge coefficient is not constant and depends on the crest length, the height of the weir, the upstream head, and the upstream and downstream slopes. In this study, the effect of these parameters on the discharge coefficient ( C d ) is investigated by numerical simulation. The current study present numerical simulation using the ANSYS FLUENT software. The total number of simulations is 432 which includes: 4 upstream slopes, 4 downstream slopes, 3 weir heights, 3 upstream heads ( h 1 ) and 3 weir crest lengths. It was found that the downstream face slope has little effect on C d . For 0.1 < H 1 / w < 0.4 by decreasing the upstream slope, C d increases, where H 1 is the water head on the weir crest and w is the length of the crest. Also, for the same range, by decreasing the height of the weir ( p ), the C d increases. For 0.16 < H 1 / p < 2, as the length of the crest decreases, the C d increases. By comparing the numerical simulation results to physical measurements, multi-variable regression equations for estimating C d are presented. In addition to C d , extraction of other more detailed information such as water level profiles and velocity profiles at different locations is provided.
... where K is the repetition counter and k is the learning rate (Tsuruoka et al. 2009). ...
... Furthermore, using a momentum factor helps avoid local minima and the associated instability, although high momentum rates can cause instability. This factor is only active if the adaptive learning rate is disabled (Tsuruoka et al. 2009). In this study, this parameter was considered equal to 0.005. ...
Article
Full-text available
Slide gates are generally used to adjust the flow in open canals. The discharge coefficient (Cd) for a slide gate is a function of the geometric and hydraulic characteristics of the gate. For free flow conditions, Cd depends on the upstream water level and opening of the gate, while for submerged flow, it also depends on the downstream water level. The main aim of this study is to conduct a series of laboratory experiments to determine Cd for inclined slide gates and assess data-driven models for estimating Cd. These tests and models were used to evaluate both free and submerged flows. Experiments with inclination angles of 0 (a vertical slide gate), 15°, 30° and 45° were studied with different gate openings. Results show that inclination of slide gates has a progressive effect on Cd and increases the flow capacity under the gate. The increase in Cd relates to the convergence of the flow through the gate opening. The Gradient Boosted Tree (GBT), Deep Learning (MLP-SGD) and Support Vector Machine (SVM) models as well as linear and non-linear regressions were evaluated for estimating Cd. The DL (MLP-SGD) model with Nash–Sutcliffe efficiency (NSE), root mean square error (RMSE) and mean absolute error (MAE) of 0.913, 0.015, and 0.012, respectively, had optimal efficiency compared to the SVM and GBT models. In addition, a nonlinear regression equation, with NSE, RMSE and MAE values of 0.961, 0.012 and 0.007 performs better than AI models in free-flow conditions. The GBT model values of NSE, RMSE and MAE were calculated to be 0.954, 0.029 and 0.024, respectively. This model had a higher efficiency in submerged flow conditions.
... For the L 1 penalty term, we adopt the implementation developed by Tsuruoka et al. (2009), which can achieve more stable sparsity structures. As pointed out by Tsuruoka et al. (2009), the traditional implementation of L 1 penalty in gradient descent algorithm does not always lead to sparse models because the approximate gradient used at each update is very noisy, which deviates the updates away from zero. ...
... For the L 1 penalty term, we adopt the implementation developed by Tsuruoka et al. (2009), which can achieve more stable sparsity structures. As pointed out by Tsuruoka et al. (2009), the traditional implementation of L 1 penalty in gradient descent algorithm does not always lead to sparse models because the approximate gradient used at each update is very noisy, which deviates the updates away from zero. ...
Article
Full-text available
Estimation of the large Q-matrix in cognitive diagnosis models (CDMs) with many items and latent attributes from observational data has been a huge challenge due to its high computational cost. Borrowing ideas from deep learning literature, we propose to learn the large Q-matrix by restricted Boltzmann machines (RBMs) to overcome the computational difficulties. In this paper, key relationships between RBMs and CDMs are identified. Consistent and robust learning of the Q-matrix in various CDMs is shown to be valid under certain conditions. Our simulation studies under different CDM settings show that RBMs not only outperform the existing methods in terms of learning speed, but also maintain good recovery accuracy of the Q-matrix. In the end, we illustrate the applicability and effectiveness of our method through a TIMSS mathematics data set.
... Gradient boosting works so that each model considers the previous model's mistakes and learns to predict them better [48]. SGD uses a small randomlyselected subset of the training samples to approximate the objective function's gradient [49]. It is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) SVM and LR [50]. ...
Article
Full-text available
This study’s main purpose is to provide helpful information using blood samples from COVID-19 patients as a non-medical approach for helping healthcare systems during the pandemic. Also, this paper aims to evaluate machine learning algorithms for predicting the survival or death of COVID-19 patients. We use a blood sample dataset of 306 infected patients in Wuhan, China, compiled by Tangji Hospital. The dataset consists of blood’s clinical indicators and information about whether patients are recovering or not. The used methods include K-nearest neighbor (KNN), decision tree (DT), logistic regression (LR), support vector machine (SVM), random forest (RF), stochastic gradient descent (SGD), bagging classifier (BC), and adaptive boosting (AdaBoost). We compare the performance of machine learning algorithms using statistical hypothesis testing. The results show that the most critical feature is age, and there is a high correlation between LD and CRP, and leukocytes and CRP. Furthermore, RF, SVM, DT, AdaBoost, DT, and KNN outperform other machine learning algorithms in predicting the survival or death of COVID-19 patients.
... Regularisation strength ( LR models have outperformed other more complex classifiers such as classification trees, random forests, SVM in several other clinical prediction tasks [3,54,55]. The gradient descent weight regularisation as well as lasso (l1 penalty) and ridge (l2 penalty) estimators [56,57] were the hyperparameters, listed in Table 3, optimised inside the nested crossvalidation during training. SVM classifiers have also performed well in both detecting [58,59] and classifying [60] cough events in the past. ...
Article
Full-text available
We present an automatic non-invasive way of detecting cough events based on both accelerometer and audio signals. The acceleration signals are captured by a smartphone firmly attached to the patient's bed, using its integrated accelerometer. The audio signals are captured simultaneously by the same smartphone using an external microphone. We have compiled a manually-annotated dataset containing such simultaneously-captured acceleration and audio signals for approximately 6000 cough and 68000 non-cough events from 14 adult male patients. Logistic regression (LR), support vector machine (SVM) and multilayer perceptron (MLP) classifiers provide a baseline and are compared with three deep architectures, convolutional neural network (CNN), long short-term memory (LSTM) network, and residual-based architecture (Resnet50) using a leave-one-out cross-validation scheme. We find that it is possible to use either acceleration or audio signals to distinguish between coughing and other activities including sneezing, throat-clearing, and movement on the bed with high accuracy. However, in all cases, the deep neural networks outperform the shallow classifiers by a clear margin and the Resnet50 offers the best performance, achieving an area under the ROC curve (AUC) exceeding 0.98 and 0.99 for acceleration and audio signals respectively. While audio-based classification consistently offers better performance than acceleration-based classification, we observe that the difference is very small for the best systems. Since the acceleration signal requires less processing power, and since the need to record audio is sidestepped and thus privacy is inherently secured, and since the recording device is attached to the bed and not worn, an accelerometer-based highly accurate non-invasive cough detector may represent a more convenient and readily accepted method in long-term cough monitoring.
... In our case, we want the regularization to apply to each channel's learned filter w 0 1p and act on the filter as a whole, rather than on every sample of all filters independently, so we take η p = w 0 1p 2 in the above equation. We implement the L1 regularization update following the technique of [48]. ...
Preprint
As our ability to sense increases, we are experiencing a transition from data-poor problems, in which the central issue is a lack of relevant data, to data-rich problems, in which the central issue is to identify a few relevant features in a sea of observations. Motivated by applications in gravitational-wave astrophysics, we study the problem of predicting the presence of transient noise artifacts in a gravitational wave detector from a rich collection of measurements from the detector and its environment. We argue that feature learning--in which relevant features are optimized from data--is critical to achieving high accuracy. We introduce models that reduce the error rate by over 60\% compared to the previous state of the art, which used fixed, hand-crafted features. Feature learning is useful not only because it improves performance on prediction tasks; the results provide valuable information about patterns associated with phenomena of interest that would otherwise be undiscoverable. In our application, features found to be associated with transient noise provide diagnostic information about its origin and suggest mitigation strategies. Learning in high-dimensional settings is challenging. Through experiments with a variety of architectures, we identify two key factors in successful models: sparsity, for selecting relevant variables within the high-dimensional observations; and depth, which confers flexibility for handling complex interactions and robustness with respect to temporal variations. We illustrate their significance through systematic experiments on real detector data. Our results provide experimental corroboration of common assumptions in the machine-learning community and have direct applicability to improving our ability to sense gravitational waves, as well as to many other problem settings with similarly high-dimensional, noisy, or partly irrelevant data.
... Logistic regression (LR) models have been found to outperform other state-of-the-art classifiers such as classification trees, random forests, artificial neural networks and support vector machines in some clinical prediction tasks [47][48][49]. We have used gradient descent weight regularisation as well as lasso (l1 penalty) and ridge (l2 penalty) estimators during training [50,51]. These regularisation hyperparameters are optimised during cross-validation, as described in Section 6.2. ...
Preprint
Full-text available
We present a machine learning based long-term cough monitoring system by detecting patient's bed occupancy from a bed-attached smartphone-inbuilt accelerometer automatically. Previously this system was used to detect cough events successfully and long-term cough monitoring requires bed occupancy detection, as the initial experiments show that patients leave their bed very often for long period of time and using video-monitoring or pressure sensors are not patient-favourite alternatives. We have compiled a 249-hour dataset of manually-labelled acceleration signals gathered from seven adult male patients undergoing treatment for tuberculosis (TB). The bed occupancy detection process consists of three detectors, among which the first one classifies occupancy-change with high sensitivity, low specificity and the second one classifies occupancy-interval with high specificity, low sensitivity. The final state detector corrects the miss-classified sections. After using a leave-one-patient-out cross-validation scheme to train and evaluate four classifiers such as LR, MLP, CNN and LSTM; LSTM produces the highest area under the curve (AUC) of 0.94 while comparing the predicted bed occupancy as the output from the final state detector with the actual bed occupancy sample by sample. We have also calculated colony forming unit and time to positivity of the sputum samples of TB positive patients who were monitored for 14 days and the proposed system was used to predict daily cough rates. The results show that patients who improve under TB treatment have decreasing daily cough rates, indicating the proposed automatic, quick, non-invasive, non-intrusive, cost-effective long-term cough monitoring system can be extremely useful in monitoring patients' recovery rate.
Article
Full-text available
A maximum entropy (ME) model is usually estimated so that it conforms to equality constraints on feature expectations. However, the equality constraint is inappropriate for sparse and therefore unreliable features. This study explores an ME model with box-type inequality constraints, where the equality can be violated to reflect this unreliability. We evaluate the inequality ME model using text categorization datasets. We also propose an extension of the inequality ME model, which results in a natural integration with the Gaussian MAP estimation. Experimental results demonstrate the advantage of the inequality models and the proposed extension.
Article
Full-text available
In this paper we apply conditional random fields (CRFs) to the semantic role labelling task. We define a random field over the structure of each sentence's syntactic parse tree. For each node of the tree, the model must predict a semantic role label, which is interpreted as the labelling for the corresponding syntactic constituent. We show how modelling the task as a tree labelling problem allows for the use of efficient CRF inference algorithms, while also increasing generalisation performance when compared to the equivalent maximum entropy classifier. We have participated in the CoNLL-2005 shared task closed challenge with full syntactic information.
Article
Full-text available
Many natural language processing tasks make use of a lexicon – typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).
Article
Full-text available
We describe here the JNLPBA shared task of bio-entity recognition using an extended version of the GENIA version 3 named entity corpus of MEDLINE abstracts. We provide background information on the task and present a general discussion of the approaches taken by partici-pating systems.
Article
Background Regenerative SystemsOptimization with Finite-Difference and Simultaneous Perturbation Gradient EstimatorsCommon Random NumbersSelection Methods for Optimization with Discrete-Valued θConcluding Remarks
Article
We describe efficient algorithms for projecting a vector onto the l1-ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the l1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform interior point methods, which are considered state-of-the-art optimization techniques. We also show that in online settings gradient updates with l1 projections outperform the exponentiated gradient algorithm while obtaining models with high degrees of sparsity.
Article
The l-bfgs limited-memory quasi-Newton method is the algorithm of choice for optimiz-ing the parameters of large-scale log-linear models with L 2 regularization, but it can-not be used for an L 1 -regularized loss due to its non-differentiability whenever some pa-rameter is zero. Efficient algorithms have been proposed for this task, but they are im-practical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (owl-qn), based on l-bfgs, that can efficiently optimize the L 1 -regularized log-likelihood of log-linear models with millions of parame-ters. In our experiments on a parse re-ranking task, our algorithm was several or-ders of magnitude faster than an alternative algorithm, and substantially faster than l-bfgs on the analogous L 2 -regularized prob-lem. We also present a proof that owl-qn is guaranteed to converge to a globally optimal parameter vector.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.