Conference PaperPDF Available

Supertagging with LSTMs

Authors:

Abstract and Figures

In this paper we present new state-of-the-art performance on CCG supertagging and parsing. Our model outperforms existing approaches by an absolute gain of 1.5%. We analyze the performance of several neural models and demonstrate that while feed-forward architectures can compete with bidirectional LSTMs on POS tagging, models that encode the complete sentence are necessary for the long range syntactic information encoded in supertags.
Content may be subject to copyright.
Supertagging with LSTMs
Ashish Vaswani1, Yonatan Bisk1, Kenji Sagae2, and Ryan Musa3
1University of Southern California, 2Kitt.ai
3University of Illinois at Urbana-Champaign
vaswani@usc.edu, ybisk@isi.edu
sagae@kitt.ai, ramusa2@illinois.edu
Abstract
In this paper we present new state-of-the-art
performance on CCG supertagging and pars-
ing. Our model outperforms existing ap-
proaches by an absolute gain of 1.5%. We an-
alyze the performance of several neural mod-
els and demonstrate that while feed-forward
architectures can compete with bidirectional
LSTMs on POS tagging, models that encode
the complete sentence are necessary for the
long range syntactic information encoded in
supertags.
1 Introduction
Morphosyntactic labels for words are commonly
used in a variety of NLP applications. For this rea-
son, part-of-speech (POS) tagging and supertagging
have drawn significant attention from the commu-
nity. Combinatory Categorial Grammar is a lexical-
ized grammar formalism that is widely used for syn-
tactic and semantic parsing. Supertagging (Clark,
2002; Bangalore and Joshi, 2010) assigns complex
syntactic labels to words to enable fast and accurate
parsing. The disambiguation of correctly labeling
a word with one of over 1,200 CCG labels is dif-
ficult compared to choosing on of the 45 POS la-
bels in the Penn Treebank (Marcus et al., 1993). In
addition to the large label space of CCG supertags,
labeling a word correctly depends on knowledge of
syntactic phenomena arbitrarily far in the sentence
(Hockenmaier and Steedman, 2007). This is be-
cause supertags encode highly specific syntactic in-
formation (e.g. types and locations of arguments)
about a word’s usage in a sentence.
In this paper, we show that Bidirectional Long
Short-Term Memory recurrent neural networks (bi–
LSTMs) (Graves, 2013; Zaremba et al., 2014),
which can use information from the entire sentence,
are a natural and powerful architecture for CCG su-
pertagging. In addition to the bi–LSTM, we create
a simple yet novel model that outperforms the pre-
vious state-of-the-art RNN model that uses hand-
crafted features (Xu et al., 2015) by 1.5%. Con-
current to this work (Lewis et al., 2016) introduced
a different training methodology for bi-LSTM for
supertagging. We provide a detailed analysis of
the quality of various LSTM architectures, forward,
backward, and bi-directional, shedding light over the
ability of the bi–LSTM to exploit rich sentential con-
text necessary for performing supertagging. We also
show that a baseline feed-forward neural network
(NN) architecture significantly outperforms previ-
ous feed-forward NN baselines, with slightly fewer
features, achieving better accuracy than the RNN
model from (Xu et al., 2015).
Recently, bi–LSTMs have achieved high accu-
racies in a simpler sequence labeling task: part-
of-speech tagging (Wang et al., 2015; Ling et al.,
2015) on the Penn treebank, with small improve-
ments over local models. However, we achieve
strong accuracies compared to (Wang et al., 2015)
using feed-forward neural network model trained on
local context, showing that this task does not require
bi–LSTMs. Our strong feed-forward NN baselines
show the power of feed-forward NNs for some tasks.
Our main contributions are the introduction of
a new bi–LSTM model for CCG supertagging that
achieves state-of-the-art, on both CCG supertagging
and parsing, and a detailed analysis of our results,
including a comparison of bi–LSTMs and simpler
feed forward NN models for supertagging and POS
tagging, which suggests that the added complexity
of bi–LSTMs may not be necessary for POS tagging,
where local contexts suffice to a much greater extent
than in supertagging.
2 Models And Training
We use feed-forward neural network models and
bidirectional LSTM (bi–LSTM) based models in
this work.
2.1 Feed-Forward
For both POS tagging and our baseline supertagging
model, we use feed-forward neural networks with
two hidden layers of rectified linear units (Nair and
Hinton, 2010). For supertagging, we use a slightly
smaller set than Lewis and Steedman (2014a), us-
ing a left and right 3-word window with suffix and
capitalization features for the center word. However,
unlike them, we train on the full set of supertag cat-
egories observed during training.
In POS tagging, when tagging word wi, we con-
sider only features from a window of five words,
with wiat the center. For each wjwith i2
ji+ 2, we add wjlowercased and a string that
encodes the basic “word shape” of wj. This is com-
puted by replacing all sequences of uppercase letters
with A, all sequences of lowercase letters with a, all
sequences of digits with 9, and all sequences of other
characters with . Finally, we add two and three let-
ter suffixes and two letter prefix for wionly.
2.2 LSTM models
We experiment with two kinds of bi–LSTM models.
We train a basic bi–LSTM where the forward and
backward LSTMs take input words wiand produce
hidden state
hiand
hi. For each position, we pro-
duce ˜
hi, where
˜
hi=σ(W
h
hT
i+W
h
hT
i),(1)
where σ(x) = max(0, x)is a rectifier nonlinear-
ity, and where W
hand W
hare parameters to be
learned. The unnormalized likelihood of an output
supertag is computed using supertag embeddings
Dtiand biases btias p(ti|˜
hi) = Dti˜
hT
i+bti. The
final softmax layer computes normalized supertag
probabilities.
Although bidirectional LSTMs can capture long
distance interactions between words, each output la-
bel is predicted independently. To explicitly model
supertag interactions, our next model combines two
models, the bi–LSTM and a LSTM language model
(LM) over the supertags (Figure 1). At position
Backward LSTM
Forward LSTM
Combiner Nodes
LSTM LM
Combiner Nodes
Output
Words
hLM
i
hLM
i+2
←−hi
←−−
hi+1
←−−
hi+2
−−→
hi+2
−−→
hi+1
−→hi
ti
ti+1
ti+2
ti
ti+1
ti+2
eat
sushi
with
tuna
Figure 1: We add a language model between supertags.
i, the LM accepts an input supertag ti1produc-
ing hidden state hLM
i, and a second combiner layer,
parametrized by matrices WLM and W˜
htransforms
˜
hiand hLM
ito hisimilar to the combiner for ˜
hi
(Equation 1). Output supertag probabilities are com-
puted just as before, replacing replacing ˜
hiwith hi.
We refer to this model as bi–LSTM–LM. For all our
LSTM models, we only use words as input features.
2.3 Training
We train our models to maximize the log-likelihood
of the data with minibatch gradient ascent. Gradi-
ents of the models are computed with backpropa-
gation (Chauvin and Rumelhart, 1995). Since gold
supertags are available during training time and not
while decoding, a bi–LSTM–LM trained on gold su-
pertags might not recover from errors caused by us-
ing incorrectly predicted supertags. This results in
the bi–LSTM–LM slightly underperforming the bi–
LSTM (we refer to training with gold supertags as
g–train in Table 1). To bridge this gap between train-
ing and testing we also experiment with a sampling
training regime in addition to training.
Scheduled sampling: Following (Bengio et al.,
2015; Ranzato et al., 2015), for each output token,
with some probability p, we use the most likely pre-
dicted supertag (arg maxtiP(ti|hi)) from the
model in position i1as input to the supertag LSTM
LM in position iand use the gold supertag with
probability 1p. We denote this training as ss–
train–1. We also experiment with using the 5-best
previous predicted supertags from the output distri-
bution at position i1and feed them to the LM as
input in position ias a bit vector. Additionally, we
Epoch
g-train
SS-train-1b
SS-train-kb
1
1.663079261779785
1.493532538414001
1.471613883972168
2
1.59983241558075
1.397878646850586
1.391600012779236
3
1.584500551223755
1.343078851699829
1.331377267837524
4
1.565570712089539
1.313792586326599
1.307917714118958
5
1.568343997001648
1.305177807807922
1.298569440841675
6
1.585107803344727
1.30352258682251
1.290364503860474
7
1.562329530715942
1.279528975486755
1.276203751564026
8
1.612882614135742
1.2797691822052
1.279410839080811
9
1.588557481765747
1.278935551643372
1.273308753967285
10
1.612249255180359
1.28274667263031
1.272010803222656
11
1.594677925109863
1.282948136329651
1.272024273872375
12
1.617396116256714
1.281896352767944
1.278515696525574
13
1.610980749130249
1.285645723342896
1.274761080741882
14
1.621490359306335
1.28062105178833
1.275495886802673
15
1.628495216369629
1.285783290863037
1.272665500640869
16
1.639569163322449
1.284226417541504
1.276064276695251
17
1.638893842697144
1.286049485206604
1.275808334350586
18
1.630186915397644
1.284650564193726
1.275925874710083
19
1.633984208106995
1.286053538322449
1.276958823204041
20
1.628738522529602
1.284542202949524
1.27592945098877
21
1.627252221107483
1.284929633140564
1.276000380516052
22
1.627161264419556
1.28664767742157
1.276474475860596
23
1.626600503921509
1.286289811134338
1.275714635848999
24
1.630642294883728
1.284499049186707
1.275835752487183
25
1.626240372657776
1.285952687263489
1.275785207748413
Perplexity
1.25
1.4
1.55
1.7
Epochs
1
5
9
13
17
21
25
g-train ss-train-1 ss-train-5
1
Figure 2: Scheduled sampling improves the perplexity of the
gold sequence under predicted tags. We see that the perplexity
of the gold supertag sequence when using predicted tags for the
LM is lower for ss–train–1 and ss–train–5 than with g–train.
use their probabilities (re-normalized over the 5-best
tags) and scale the input supertag embeddings with
their re-normalized probability during look-up. We
refer to this setting as ss–train–5. In this work, we
use an inverse sigmoid schedule to compute p,
p=k
k+es
k
,
where sis the epoch number and kis a hyperpa-
rameter that is tuned.1In Figure 2, we see that for
the development set training with scheduled sam-
pling improves the perplexity of the gold supertag
sequence when using predicted supertags, indicat-
ing better recovery from conditioning on erroneous
supertags. For both ss-train and g-train, we use gold
supertags for the output layer and train the model to
maximize the log-likelihood of the data.2
2.4 Architectures
Our feed-forward models use 2048 rectifier units in
the first hidden layer, 50 and 128 rectifier units in the
second hidden layer for POS tagging and Supertag-
ging respectively, and 64 dim. input embeddings.
Our LSTM based models use 512 hidden states.
We pre-train our word embeddings with a 7-
gram feed-forward neural language model using the
NPLM toolkit3on a concatenation of the BLLIP cor-
pus (Charniak et al., 2000) and WSJ sections 02–21
of the Penn Treebank.
1The reader should refer to (Bengio et al., 2015) for details.
2We use dropout for all our feed-forward (Srivastava, 2013)
and bi-LSTM based models (Zaremba et al., 2014). We carry
out a grid search over dropout probabilities and sampling sched-
ules. We train the LSTMs for 25 epochs and the feed-forward
models for 30 epochs, tuning on the development data.
3http://nlg.isi.edu/software/nplm/
Supertag Accuracy
Model All Seen Novel % P
Lewis et al. (2014) 91.30
Wenduan et al. (2015) 93.07
Feed Forward + g–train 93.29 93.77 91.53 70.3
Forward LSTM + g–train 83.70 85.76 46.22 20.7
Backward LSTM + g–train 88.82 90.06 66.22 40.6
bi–LSTM 94.08 95.03 76.36 81.1
bi–LSTM–LM + g–train 93.89 94.93 76.83 96.5
bi–LSTM–LM + ss–train–1 94.24 95.22 76.70 87.8
bi–LSTM–LM + ss–train–5 94.23 95.20 76.62 94.5
Table 1: Accuracies on the development section. The language
model provides a boost in performance, and large gains on the
parseability of the sequence (%P). The numbers for bi–LSTM–
LM + ss–train–1 and + g–train are with beam decoding. All
others use greedy decoding. Interestingly, greedy decoding with
ss–train–5 works as well as beam decoding with ss–train–1.
2.5 Decoding
We perform greedy decoding. For each position i,
we select the most probable supertag from the output
distribution. For the bi–LSTM–LM models trained
with g–train and ss–train–1, we feed the most likely
supertag from the output distribution as LM input
in the next position. We decode with beam search
(size 12) for bi–LSTM–LMs trained with g–train
and ss–train–1. For the bi–LSTM–LMs trained with
ss–train–5, we perform greedy decoding similar to
training, feeding the k-best supertags from the out-
put supertag distribution in position i1as input
to the LM in position i, along with the renormal-
ized probabilities. We don’t perform beam decoding
for ss–train–5, as the previous k-best inputs already
capture different paths through the network.4
3 Data
For supertagging, experiments were run with the
standard splits of CCGbank. Unlike previous work
no features were extracted for the LSTM models and
rare categories were not thresholded. Words were
lowercased and digits replaced with @.
CCGbank’s training section contains 1,284 lexi-
cal categories (394 in Dev). The distribution of cate-
gories has a long tail, with only a third of those cate-
4Code and supertags for our models can be downloaded
here: https://bitbucket.org/ashish_vaswani/
lstm_supertagger
LSTM
Supertag F-For Forward Backward bi–LSTM +LM(g–train) ss–train–1 ss–train–5
(NP\NP)/NP 90.00 88.89 81.91 92.09 92.18 91.72 92.31
((S\NP)\(S\NP))/NP 75.75 69.53 61.60 80.38 78.21 79.91 78.77
S[dcl]\NP 77.29 61.14 58.52 84.28 83.41 82.97 80.35
(S[dcl]\NP)/NP 91.39 56.58 69.86 92.34 92.46 92.46 92.82
((S[dcl]\NP)/PP)/NP 42.30 30.77 42.31 56.41 64.10 62.82 60.26
(S[dcl]\NP)/(S[adj]\NP)86.80 22.84 83.25 87.31 88.83 87.82 86.80
((S[dcl]\NP)/(S[to]\NP))/NP 86.49 56.76 75.68 94.59 91.89 91.89 91.89
Table 2: Prediction accuracy for our models on several common and difficult supertags.
Architecture Test Acc
Ling et al. (2015) Bi-LSTM 97.36
Wang et al. (2015) Bi-LSTM 97.78
Søgaard (2011) SCNN 97.50
This work Feed-Forward 97.40
Table 3: Our new POS tagging results show a strong Feed-
Forward baseline can perform as well as or better than more
sophisticated models (e.g. Bi-LSTMs).
gories having a frequency count 10 (the threshold
used by existing literature). Following (Lewis and
Steedman, 2014b), we allow the model to predict all
categories for a word, not just those with which the
word was observed to co-occur in the training data.
Accuracies on these unseen (word, cat) pairs are pre-
sented in the third column of Table 1.
4 Results
Table 3 presents our Feed-Forward POS tagging re-
sults. We achieve 97.28% on the development set
and 97.4% on test. Although slightly below state-of-
the-art, we approach existing work with bi–LSTMs,
and our models are much simpler and faster to train.5
Table 1 shows a steady increase in performance
as the model is provided additional context. The for-
ward and backward models are presented with infor-
mation that may be arbitrarily far away in the sen-
tence, but only in a specific direction. This yields
weaker results than the Feed Forward model which
can see in both directions within a small window.
The real gains are achieved by the Bidirectional
LSTM which incorporates knowledge from the en-
tire sentence. Our addition of a language model
and changes to training, further improve the perfor-
5We use train, dev, and test splits of WSJ sections 00–18,
19–21, and 22–24, for POS tagging.
Dev F1 Test F1
Wenduan et al. (2015) 86.25 87.04
+ new POS Tags & C&C 86.99 87.50
bi–LSTM–LM +ss–train–1 87.75 88.32
Table 4: Parsing at 100% coverage with our new Feed-Forward
POS tagger and the Java implementation of C&C. We show both
the published and improved results for Wenduan et al.
mance. Our final model (bi–LSTM–LM+ss–train–1
model with beam decoding) has a test accuracy of
94.5%, 1.5% above state-of-the-art.
4.1 Parsing
Our primary goal in this paper was to demonstrate
how a bi–LSTM captures new and different in-
formation from uni-directional or feed-forward ap-
proaches. This advantage also translates to gains
in parsing. Table 4 presents new state-of-the-art
parsing results for both (Xu et al., 2015) and our
bi–LSTM–LM +ss–train–1. These results were at-
tained using our part-of-speech tags (Table 3) and
the Java implementation (Clark et al., 2015) of the
C&C parser (Clark and Curran, 2007)6.
4.2 Error Analysis
Our analysis indicates that the information follow-
ing a word is more informative than what preceded
it. Table 2 compares how well our models recover
common and syntactically interesting supertags. In
particular, the Forward and Backward models, moti-
vate the need for a Bi-directional approach.
6Results are presented on the standard development and test
splits (Section 00 and 23), and with a beam threshold of 106.
For a fair comparison to prior work we report results without
the skimmer, so no partial credit is given to parse failures. The
skimmer boosts performance to 87.91/88.39 for Dev and Test.
(S[dcl]\NP)/(S[adj]\NP)
Forward Backward Bidirectional
((S[dcl]\NP)/PP)/(S[adj]\NP) ((S[dcl]\NP)/PP)/(S[adj]\NP) (S[dcl]\NP)/(S[pss]\NP)
((S[dcl]\NP)/(S[to]\NP))/(S[adj]\NP) ((S[b]\NP)\NP)/(S[adj]\NP) (S[dcl]\NP)/PP)/(S[adj]\NP)
((S[dcl]\NP)/PP)/PP (S[dcl]\S[qem])/(S[adj]\NP) (S[b]\NP)\NP)/(S[adj]\NP)
(S[dcl]\NP)/S((S[dcl]\NP)/(S[to]\NP))/(S[adj]\NP) (S[dcl]\NP)/(S[to]\NP))/(S[adj]\NP)
(S[dcl]\NP)/(S[pss]\NP) ((S[dcl]\NP)/(S[adj]\NP))/(S[adj]\NP) (S[dcl]\NP)/(S[adj]\NP))/(S[adj]\NP)
Table 5: “Neighbor” categories as determined by embedding-based vector similarity for each class of model. As expected for this
category, the Backward model captures the argument preference while the Forward model correctly predicts the result.
The first two rows show prepositional phrase at-
tachment decisions (noun and verb attaching cate-
gories are in rows one and two, respectively). Here
the forward model outperforms the backward model,
presumably because knowing the word to be modi-
fied and the preposition, is more important than ob-
serving the object of the prepositional phrase (the
information available to the backward model).
Conversely, the backward model outperforms the
forward model in most of the remaining categories.
(Di-)transitive verbs (lines 4 & 5) require knowledge
of future arguments in the sentence (e.g. separated
by a relative clause). Because English has strict SVO
word-order, the presence of a subject is more pre-
dictable than the presence of an (in-)direct object. It
is therefore not surprising that the backward model
is often comparable to the Feed Forward model.
If the information missing from either the forward
or backward models were local, the bidirectional
model should perform the same as the Feed-Forward
model, instead it surpasses it, often by a large mar-
gin. This implies there is long range information
necessary for choosing a supertag.
Embeddings In addition, we can visualize the in-
formation captured by our models by investigating
a category’s nearest neighbors based on the learned
embeddings. Table 5 shows nearest neighbor cate-
gories for (S[dcl]\NP)/(S[adj]\NP)under the For-
ward, Backward, and Bidirectional models.
We see see that the forward model learns inter-
nal structure with the query category, but the list of
arguments is nearly random. In contrast, the back-
ward model clusters categories primarily based on
the final argument, perhaps sharing similarities in
the subject argument only because of the predictable
SVO nature of English text. However, due to its
lack of forward context the model incorrectly asso-
ciates categories with less-common first arguments
(e.g. S[qem]). Finally, the bidirectional embeddings
appear to cleanly capture the strengths of both the
forward and backward models.
Consistency and Internal Structure Because su-
pertags are highly structured their co-occurence in
a sentence must be permitted by the combinators
of CCG. Without encoding this explicitly, the lan-
guage model dramatically increases the percent of
predicted sequences that result in a valid parse by up
to 15% (last column of Table 2).
Sparsity One consideration of our approach is that
we do not threshold rare categories or use any tag
dictionaries; our models are presented with the full
space of CCG categories, despite the long tail. This
did not did not hurt performance and the models
learned to successfully use several categories which
were outside the set of traditionally-thresholded fre-
quent categories. Additionally, the total number of
categories used correctly at least once by the bi-
directional models was substantially higher than the
other models (270 vs. 220 of 394), though the
large number of unused categories (120) indicates
that there is still substantial room for improvement.
5 Conclusions and Future Work
Because bi–LSTMs with a language model encode
an entire sentence at decision time, we demonstrated
large gains in supertagging and parsing. Future work
will investigate improving performance on rare cat-
egories.
Acknowledgements
This work was supported by the U.S. DARPA
LORELEI Program No. HR0011-15-C-0115. We
would like to thank Wenduan Xu for his help.
References
Srinivas Bangalore and Aravind K. Joshi. 2010. Su-
pertagging: Using Complex Lexical Descriptions in
Natural Language Processing. The MIT Press.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam
Shazeer. 2015. Scheduled sampling for sequence pre-
diction with recurrent neural networks. In Advances in
Neural Information Processing Systems, pages 1171–
1179.
Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall,
John Hale, and Mark Johnson. 2000. Bllip 1987-
89 wsj corpus release 1. Linguistic Data Consortium,
Philadelphia, 36.
Yves Chauvin and David E Rumelhart. 1995. Backprop-
agation: theory, architectures, and applications. Psy-
chology Press.
Stephen Clark and James Curran. 2007. Wide-Coverage
Efficient Statistical Parsing with CCG and Log-Linear
Models. Computational Linguistics, 33(4):493–552.
Stephen Clark, Darren Foong, Luana Bulat, and Wenduan
Xu. 2015. The Java Version of the C&C Parser: Ver-
sion 0.95. Technical report, University of Cambridge
Computer Laboratory, August.
Stephen Clark. 2002. Supertagging for combinatory cat-
egorial grammar. In Proceedings of the 6th Interna-
tional Workshop on Tree Adjoining Grammars and Re-
lated Formalisms (TAG+6), pages 19–24.
A. Graves. 2013. Generating sequences with recurrent
neural networks. arXiv preprint arXiv:1308.0850.
Julia Hockenmaier and Mark Steedman. 2007. CCG-
bank: A Corpus of CCG Derivations and Dependency
Structures Extracted from the Penn Treebank. Com-
putational Linguistics, 33:355–396, September.
Mike Lewis and Mark Steedman. 2014a. A* ccg pars-
ing with a supertag-factored model. In Proceedings of
the Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP-2014).
Mike Lewis and Mark Steedman. 2014b. Improved ccg
parsing with semi-supervised supertagging. Transac-
tions of the Association for Computational Linguistics,
2:327–338.
Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016.
LSTM CCG Parsing. In Proceedings of the 15th An-
nual Conference of the North American Chapter of the
Association for Computational Linguistics.
Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernan-
dez Astudillo, Silvio Amir, Chris Dyer, Alan W
Black, and Isabel Trancoso. 2015. Finding func-
tion in form: Compositional character models for
open vocabulary word representation. arXiv preprint
arXiv:1508.02096.
Mitchell P Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a Large Annotated
Corpus of English: The Penn Treebank. Computa-
tional Linguistics, 19:313–330, June.
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified lin-
ear units improve restricted Boltzmann machines. In
Proceedings of ICML, pages 807–814.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and
Wojciech Zaremba. 2015. Sequence Level Train-
ing with Recurrent Neural Networks. arXiv preprint
arXiv:1511.06732.
Anders Søgaard. 2011. Semisupervised condensed near-
est neighbor for part-of-speech tagging. In Proceed-
ings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Tech-
nologies: short papers-Volume 2, pages 48–52. Asso-
ciation for Computational Linguistics.
Nitish Srivastava. 2013. Improving neural networks with
dropout. Ph.D. thesis, University of Toronto.
Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai
Zhao. 2015. Part-of-speech tagging with bidirec-
tional long short-term memory recurrent neural net-
work. arXiv preprint arXiv:1510.06168.
Wenduan Xu, Michael Auli, and Stephen Clark. 2015.
Ccg supertagging with a recurrent neural network.
Volume 2: Short Papers, page 250.
W. Zaremba, I. Sutskever, and O. Vinyals. 2014. Re-
current neural network regularization. arXiv preprint
arXiv:1409.2329.
... BiLSTMs were recently popularized by Graves (2008), and deep BiRNNs were introduced to NLP by Irsoy and Cardie (2014), who used them for sequence tagging. In the context of parsing, Lewis et al. (2016) and Vaswani et al. (2016) use a BiLSTM sequence tagging model to assign a CCG supertag for each token in the sentence. Lewis et al. (2016) feeds the resulting supertags sequence into an A* CCG parser. ...
... Lewis et al. (2016) feeds the resulting supertags sequence into an A* CCG parser. Vaswani et al. (2016) adds an additional layer of LSTM which receives the BiLSTM representation together with the k-best supertags for each word and outputs the most likely supertag given previous tags, and then feeds the predicted supertags to a discriminitively trained parser. In both works, the BiLSTM is trained to produce accurate CCG supertags, and is not aware of the global parsing objective. ...
Preprint
We present a simple and effective scheme for dependency parsing which is based on bidirectional-LSTMs (BiLSTMs). Each sentence token is associated with a BiLSTM vector representing the token in its sentential context, and feature vectors are constructed by concatenating a few BiLSTM vectors. The BiLSTM is trained jointly with the parser objective, resulting in very effective feature extractors for parsing. We demonstrate the effectiveness of the approach by applying it to a greedy transition-based parser as well as to a globally optimized graph-based parser. The resulting parsers have very simple architectures, and match or surpass the state-of-the-art accuracies on English and Chinese.
... CCG supertagging is the task to predict the lexical categories for each word in a sentence. Existing algorithms on CCG supertagging range from point estimation (Clark and Curran, 2007;Lewis and Steedman, 2014) to sequential estimation (Xu et al., 2015;Lewis et al., 2016;Vaswani et al., 2016), which predict the most probable supertag of the current word according to the context in a fixed size window. This fixed size window assumption is too strong to generalize. ...
... But RNNs may suffer from the gradient vanishing/exploding problems and are not good at capturing long-range dependencies in practice. Vaswani et al. (2016) and Lewis et al. (2016) shows the effectiveness of bi-LSTMs in supertagging, but they do not use a context window for the inputs. We only get 93.9% performance on the development set without using context windows. ...
Preprint
Combinatory Category Grammar (CCG) supertagging is a task to assign lexical categories to each word in a sentence. Almost all previous methods use fixed context window sizes as input features. However, it is obvious that different tags usually rely on different context window sizes. These motivate us to build a supertagger with a dynamic window approach, which can be treated as an attention mechanism on the local contexts. Applying dropout on the dynamic filters can be seen as drop on words directly, which is superior to the regular dropout on word embeddings. We use this approach to demonstrate the state-of-the-art CCG supertagging performance on the standard test set.
... Labeled Unlabeled CCGbank C&C (Clark and Curran, 2007) 85.5 91.7 w/ LSTMs (Vaswani et al., 2016) 88.3 -EasySRL 87. 88.0 92.9 neuralccg 88.7 93.7 HEADFIRST w/o NF (Ours) 88.8 94.0 dency terms of the model and applying the attach low heuristics (Section 1) instead (i.e., a supertagfactored model; Section 2.1). ...
... ment, our parser shows the better result than all the baseline parsers except C&C with an LSTM supertagger (Vaswani et al., 2016). Our parser outperforms EasySRL by 0.5% and our reimplementation of that parser (EasySRL reimpl) by 0.9% in terms of labeled F1. ...
Preprint
We propose a new A* CCG parsing model in which the probability of a tree is decomposed into factors of CCG categories and its syntactic dependencies both defined on bi-directional LSTMs. Our factored model allows the precomputation of all probabilities and runs very efficiently, while modeling sentence structures explicitly via dependencies. Our model achieves the state-of-the-art results on English and Japanese CCG parsing.
... Zhang et al. (2010a) proposed the efficient methods to obtain super tags for HPSG parsing using dependency information. Xu et al. (2015) and Vaswani et al. (2016) turn to design recursive neural network for supertagging for CCG parsing. In contrast, our models predict the constituent hierarchy instead of single super tag for each word in the input sentence, which are also likely regarded as the member of multiple ordered labels prediction family. ...
Preprint
Transition-based models can be fast and accurate for constituent parsing. Compared with chart-based models, they leverage richer features by extracting history information from a parser stack, which spans over non-local constituents. On the other hand, during incremental parsing, constituent information on the right hand side of the current word is not utilized, which is a relative weakness of shift-reduce parsing. To address this limitation, we leverage a fast neural model to extract lookahead features. In particular, we build a bidirectional LSTM model, which leverages the full sentence information to predict the hierarchy of constituents that each word starts and ends. The results are then passed to a strong transition-based constituent parser as lookahead features. The resulting parser gives 1.3% absolute improvement in WSJ and 2.3% in CTB compared to the baseline, given the highest reported accuracies for fully-supervised parsing.
... They used a maximum entropy model; Xu et al. (2015) introduced a neural supertagger for CCG. Vaswani et al. (2016) and Tian et al. (2020) ("out-of-vocabulary") words. 6 Supertagging experiments with HPSG parsing speed using hand-engineered grammars are summarized in Table 1. ...
... Models achieved language modeling perplexities ranging from 74.76 to 75.70 on the test portion of the Gulordava et al. (2018) corpus, while Gulordava et al. (2018)'s best language model achieved a perplexity of 52.0. Models assigned the highest likelihood to the correct CCG supertag in the CCGBank test set between 84.1% and 84.5% of the time, compared to bi-LSTM supertaggers which can achieve an accuracy of 94.1% (Vaswani et al., 2016). Note that these supertagging numbers are not directly comparable, as our models use unidirectional LSTMs and thus have no access to a word's right context when supertagging. ...
... BiLSTM has been proven to be capable of sequence reasoning (Vaswani et al., 2016;Katiyar and Cardie, 2016;Zheng et al., 2017). In this paper, we utilize two BiLSTMs for intent and slot to achieve the fusion and interactions among the semantics and label knowledge of both tasks. ...
... In CCG, supertagging is the task of assigning a plausible CCG category to each word in the sentence. In existing supertagging methods, various encoders transform each word into a highdimensional vector that is input to the classifier to predict the appropriate category for each word (Vaswani et al., 2016;Lewis et al., 2016;Tian et al., 2020). The present supertagging approach differs from existing models in its training mechanism of word vectors, which are treated as intermediate products rather than end products. ...
Article
Recognizing Chinese entities in low-resource settings is a challenging but promising task, which extracts structured pre-defined entities and corresponding types from unstructured text. Compared with the prosperous Named Entity Recognition (NER) methods for Indo-European languages, such as English, the research on Chinese NER is still in its infancy. The main obstacles to the development of Chinese NER methods include the ambiguity of Chinese entity boundary recognition and limited data resources. To address these issues, in this paper, a word-segmentation-based model is present for few-shot Chinese NER. First, we enumerate all possible candidate entity spans on the character level for accurate entity boundary identification with the proposed word segmentation and combination strategy. Then, one kind of question-answer-based prompt template loaded with the candidate entity spans is proposed to cast entity extraction into the masked token prediction task, for dealing with the low-data problem by taking full advantage of the generality and transferability of the pre-trained language model. The extensive experimental results show that our method outperforms the state-of-the-art baselines in low-data settings and also achieves comparable performance in full-data settings.
Article
Full-text available
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.
Article
Full-text available
We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form-function relationship in language, our "composed" word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).
Article
Current supervised parsers are limited by the size of their labelled training data, making improving them with unlabelled data an important goal. We show how a state-of-the-art CCG parser can be enhanced, by predicting lexical categories using unsupervised vector-space embeddings of words. The use of word embeddings enables our model to better generalize from the labelled data, and allows us to accurately assign lexical categories without depending on a POS-tagger. Our approach leads to substantial improvements in dependency parsing results over the standard supervised CCG parser when evaluated on Wall Street Journal (0.8%), Wikipedia (1.8%) and biomedical (3.4%) text. We compare the performance of two recently proposed approaches for classification using a wide variety of word embeddings. We also give a detailed error analysis demonstrating where using embeddings outperforms traditional feature sets, and showing how including POS features can decrease accuracy.
Article
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists in maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements.
Article
Investigations into employing statistical approaches with linguistically motivated representations and its impact on Natural Language processing tasks. © 2010 Massachusetts Institute of Technology. All rights reserved.
Article
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, and machine translation.
Article
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.