Conference PaperPDF Available

Neural Morphological Analysis: Encoding-Decoding Canonical Segments



Content may be subject to copyright.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 961–967,
Austin, Texas, November 1-5, 2016. c
2016 Association for Computational Linguistics
Neural Morphological Analysis: Encoding-Decoding Canonical Segments
Katharina Kann
Center for Information and Language Processing
LMU Munich, Germany
Ryan Cotterell
Department of Computer Science
Johns Hopkins University, USA
Hinrich Sch¨
Center for Information and Language Processing
LMU Munich, Germany
Canonical morphological segmentation aims
to divide words into a sequence of stan-
dardized segments. In this work, we
propose a character-based neural encoder-
decoder model for this task. Additionally,
we extend our model to include morpheme-
level and lexical information through a neural
reranker. We set the new state of the art for
the task improving previous results by up to
21% accuracy. Our experiments cover three
languages: English, German and Indonesian.
1 Introduction
Morphological segmentation aims to divide words
into morphemes, meaning-bearing sub-word units.
Indeed, segmentations have found use in a diverse
set of NLP applications, e.g., automatic speech
recognition (Afify et al., 2006), keyword spot-
ting (Narasimhan et al., 2014), machine transla-
tion (Clifton and Sarkar, 2011) and parsing (Seeker
and C¸ etino˘
glu, 2015). In the literature, most re-
search has traditionally focused on surface segmen-
tation, whereby a word wis segmented into a se-
quence of substrings whose concatenation is the en-
tire word; see Ruokolainen et al. (2016) for a sur-
vey. In contrast, we consider canonical segmenta-
tion:wis divided into a sequence of standardized
segments. To make the difference concrete, con-
sider the following example: the surface segmen-
tation of the complex English word achievability is
achiev+abil+ity, whereas its canonical segmenta-
tion is achieve+able+ity, i.e., we restore the alter-
ations made during word formation.
Canonical versions of morphological segmenta-
tion have been introduced multiple times in the lit-
erature (Kay, 1977; Naradowsky and Goldwater,
2009; Cotterell et al., 2016). Canonical segmen-
tation has several representational advantages over
surface segmentation, e.g., whether two words share
a morpheme is no longer obfuscated by orthogra-
phy. However, it also introduces a hard algorith-
mic challenge: in addition to segmenting a word,
we must reverse orthographic changes, e.g., map-
ping achievability7→achieveableity.
Computationally, canonical segmentation can be
seen as a sequence-to-sequence problem: we must
map a word form to a canonicalized version with
segmentation boundaries. Inspired by the re-
cent success of neural encoder-decoder models
(Sutskever et al., 2014) for sequence-to-sequence
problems in NLP, we design a neural architecture
for the task. However, a na¨
ıve application of the
encoder-decoder model ignores much of the linguis-
tic structure of canonical segmentation—it cannot
directly model the individual canonical segments,
e.g., it cannot easily produce segment-level embed-
dings. To solve this, we use a neural reranker on
top of the encoder-decoder, allowing us to embed
both characters and entire segments. The combined
approach outperforms the state of the art by a wide
margin (up to 21% accuracy) in three languages: En-
glish, German and Indonesian.
2 Neural Canonical Segmentation
We begin by formally describing the canonical
segmentation task. Given a discrete alphabet
Σ(e.g., the 26 letters of the English alphabet),
Figure 1: Detailed view of the attention mechanism of the neu-
ral encoder-decoder.
our goal is to map a word wΣ(e.g.,
w=achievability), to a canonical segmentation c
(e.g., c=achieve+able+ity). We define Ω =
Σ∪{+}, where +is a distinguished separation sym-
bol. Additionally, we will write the segmented form
as c=σ1+σ2+. . .+σn, where each segment σiΣ
and nis the number of canonical segments.
We take a probabilistic approach and, thus, at-
tempt to learn a distribution p(c|w). Our model
consists of two parts. First, we apply an encoder-
decoder recurrent neural network (RNN) (Bahdanau
et al., 2014) to the sequence of characters of the
input word to obtain candidate canonical segmen-
tations. Second, we define a neural reranker that
allows us to embed individual morphemes and
chooses the final answer from within a set of can-
didates generated by the encoder-decoder.
2.1 Neural Encoder-Decoder
Our encoder-decoder is based on Bahdanau et al.
(2014)’s neural machine translation model.1The en-
coder is a bidirectional gated RNN (GRU) (Cho et
al., 2014b). Given a word wΣ, the input to examples/tree/master/machine_
the encoder is the sequence of characters of w, rep-
resented as one-hot vectors. The decoder defines
a conditional probability distribution over c
given w:
pED(c|w) =
p(ct|c1, . . . , ct1, w)
g(ct1, st, at)
where gis a nonlinear activation function, stis the
state of the decoder at tand atis a weighted sum of
the |w|states of the encoder. The state of the encoder
for wiis the concatenation of forward and backward
hidden states
hifor wi. An overview of how
the attention weight and the weighted sum atare
included in the architecture can be seen in Figure
1. The attention weights αt,i at each timestep tare
computed based on the respective encoder state and
the decoder state st. See Bahdanau et al. (2014) for
further details.
2.2 Neural Reranker
The encoder-decoder, while effective, predicts each
output character in sequentially. It does not use
explicit representations for entire segments and is in-
capable of incorporating simple lexical information,
e.g., does this canonical segment occur as an inde-
pendent word in the lexicon? Therefore, we extend
our model with a reranker.
The reranker rescores canonical segmentations
from a candidate set, which in our setting is sampled
from pED. Let the sample set be Sw={k(i)}N
where k(i)pED(c|w). We define the neural
reranker as
exp u>tanh(W vc) + τlog pED(c|w)
where vc=Pn
i=1 vσi(recall c=σ1+σ2+. . .+σn)
and vσiis a one-hot morpheme embedding of σi
with an additional binary dimension marking if σi
occurs independently as a word in the language.2
The partition function is Zθ(w)and the parame-
ters are θ={u, W, τ}. The parameters Wand u
2To determine if a canonical segment is in the lexicon, we
check its occurrence in AS PEL L. Alternatively, one could ask
whether it occurs in a large corpus, e.g., Wikipedia.
are projection and hidden layers, respectively, of a
multi-layered perceptron and τcan be seen as a tem-
perature parameter that anneals the encoder-decoder
model pED (Kirkpatrick, 1984). We define the parti-
tion function over the sample set Sw:
exp u>tanh(W vk)+τlog pED(k|w).
The reranking model’s ability to embed mor-
phemes is important for morphological segmenta-
tion since we often have strong corpus-level signals.
The reranker also takes into account the character-
level information through the score of the encoder-
decoder model. Due to this combination we expect
stronger performance.
3 Related Work
Various approaches to morphological segmentation
have been proposed in the literature. In the un-
supervised realm, most work has been based on
the principle of minimum description length (Cover
and Thomas, 2012), e.g., LINGUISTICA (Goldsmith,
2001; Lee and Goldsmith, 2016) or MO RFESS OR
(Creutz and Lagus, 2002; Creutz et al., 2007; Poon
et al., 2009). MORFE SSOR was later extended to a
semi-supervised version by Kohonen et al. (2010).
Supervised approaches have also been considered.
Most notably, Ruokolainen et al. (2013) developed
a supervised approach for morphological segmen-
tation based on conditional random fields (CRFs)
which they later extended to work also in a semi-
supervised way (Ruokolainen et al., 2014) using
letter successor variety features (Hafer and Weiss,
1974). Similarly, Cotterell et al. (2015) improved
performance with a semi-Markov CRF.
More recently, Wang et al. (2016) achieved state-
of-the-art results on surface morphological segmen-
tation using a window LSTM. Even though Wang et
al. (2016) also employ a recurrent neural network,
we distinguish our approach, in that we focus on
canonical morphological segmentation, rather than
surface morphological segmentation.
Naturally, our approach is also relevant to other
applications of recurrent neural network transduc-
tion models (Sutskever et al., 2014; Cho et al.,
2014a). In addition to machine translation (Bah-
danau et al., 2014), these models have been success-
fully applied to many areas of NLP, including pars-
ing (Vinyals et al., 2015), morphological reinflec-
tion (Kann and Sch¨
utze, 2016) and automatic speech
recognition (Graves and Schmidhuber, 2005; Graves
et al., 2013).
4 Experiments
To enable comparison to earlier work, we use a
dataset that was prepared by Cotterell et al. (2016)
for canonical segmentation.3
4.1 Languages
The dataset we work on covers 3 languages: En-
glish, German and Indonesian. English and German
are West Germanic Languages, with the former be-
ing an official languages in nearly 60 different states
and the latter being mainly spoken in Western Eu-
rope. Indonesian — or Bahasa Indonesia— is the
official language of Indonesia.
Cotterell et al. (2016) report the best experimental
results for Indonesian, followed by English and fi-
nally German. The high error rate for German might
be caused by it being rich in orthografic changes. In
contrast, Indonesian morphology is comparatively
4.2 Corpora
The data for the English language was extracted
from segmentations derived from the CELEX
database (Baayen et al., 1993). The German data
was extracted from DerivBase (Zeller et al., 2013),
which provides a collection of derived forms to-
gether with the transformation rules, which were
used to create the canonical segmentations. Finally,
the data for Bahasa Indonesia was collected by us-
ing the output of the MORP HIND analyzer (Larasati
et al., 2011), together with an open-source corpus of
Indonesian. For each language we used the 10,000
forms that were selected at random by Cotterell et
al. (2016) from a uniform distribution over types to
form the corpus. Following them, we perform our
experiments on 5 splits of the data into 8000 train-
ing forms, 1000 development forms and 1000 test
forms and report averages.
4.3 Training
We train an ensemble of five encoder-decoder mod-
els. The encoder and decoder RNNs each have
100 hidden units. Embedding size is 300. We use
ADADE LTA (Zeiler, 2012) with a minibatch size of
20. We initialize all weights (encoder, decoder, em-
beddings) to the identity matrix and the biases to
zero (Le et al., 2015). All models are trained for 20
epochs. The hyperparameter values are taken from
Kann and Sch¨
utze (2016) and kept unchanged for
the application to canonical segmentation described
To train the reranking model, we first gather the
sample set Swon the training data. We take 500
individual samples, but (as we often sample the
same form multiple times) |Sw| ≈ 5. We op-
timize the log-likelihood of the training data using
ADADE LTA. For generalization, we employ L2reg-
ularization and we perform grid search to determine
the coefficient λ∈ {0.0,0.1,0.2,0.3,0.4,0.5}. To
decode the model, we again take 500 samples to
populate Swand select the best segmentation.
Baselines. Our first baseline is the joint transduction
and segmentation model (JOINT) of Cotterell et al.
(2016). It is the current state of the art on the datasets
we use and the task of canonical segmentation in
general. This model uses a jointly trained, separate
transduction and segmentation component. Impor-
tantly, the joint model of Cotterell et al. (2016) al-
ready contains segment-level features. Thus, rerank-
ing this baseline would not provide a similar boost.
Our second baseline is a weighted finite-state
transducer (WFST) (Mohri et al., 2002) with a log-
linear parameterization (Dreyer et al., 2008), again,
taken from Cotterell et al. (2016). The WFST
baseline is particularly relevant because, like our
encoder-decoder, it formulates the problem directly
as a string-to-string transduction.
Evaluation Metrics. We follow Cotterell et al.
(2016) and use the following evaluation measures:
error rate, edit distance and morpheme F1. Error
rate is defined as 1minus the proportion of guesses
that are completely correct. Edit distance is the Lev-
enshtein distance between guess and gold standard.
For this, guess and gold are each represented as one
string with a distinguished character denoting the
segment boundaries. Morpheme F1compares the
en .19 (.01) .25 (.01) 0.27 (.02) 0.63 (.01) .06 (.01)
de .20 (.01) .26 (.02) 0.41 (.03) 0.74 (.01) .04 (.01)
id .05 (.01) .09 (.01) 0.10 (.01) 0.71 (.01) .02 (.01)
en .21 (.02) .47 (.02) 0.98 (.34) 1.35 (.01) .10 (.02)
de .29 (.02) .51 (.03) 1.01 (.07) 4.24 (.20) .06 (.01)
id .05 (.00) .12 (.01) 0.15 (.02) 2.13 (.01) .02 (.01)
en .82 (.01) .78 (.01) 0.76 (.02) 0.53 (.02) .96 (.01)
de .87 (.01) .86 (.01) 0.76 (.02) 0.59 (.02) .98 (.00)
id .96 (.01) .93 (.01) 0.80 (.01) 0.62 (.02) .99 (.00)
Table 1: Error rate (top), edit distance (middle), F1(bottom)
for canonical segmentation. Each double column gives the mea-
sure and its standard deviation. Best result on each line (exclud-
ing UB) in bold. RR: encoder-decoder+reranker. ED: encoder-
decoder. JOINT, WFST: baselines (see text). UB: upper bound,
the maximum score our reranker could obtain, i.e., considering
the best sample in the predictions of ED.
morphemes in guess and gold. Precision (resp. re-
call) is the proportion of morphemes in guess (resp.
gold) that occur in gold (resp. guess).
5 Results
The results of the canonical segmentation experi-
ment in Table 1 show that both of our models im-
prove over all baselines. The encoder-decoder alone
has a .02 (English), .15 (German) and .01 (Indone-
sion) lower error rate than the best baseline. The
encoder-decoder improves most for the language for
which the baselines did worst. This suggests that, for
more complex languages, a neural network model
might be a good choice.
The reranker achieves an additional improvement
of .04 to .06. for the error rate. This is likely due
to the additional information the reranker has access
to: morpheme embeddings and existing words.
Important is also the upper bound we report. It
shows the maximum performance the reranker could
achieve, i.e., evaluates the best solution that appears
in the set of candidate answers for the reranker. The
right answer is contained in 94% of samples. Note
that, even though the upper bound goes up with the
number of samples we take, there is no guarantee
for any finite number of samples that they will con-
tain the true answer. Thus, we would need to take
an infinite number of samples to get a perfect upper
bound. However, as the current upper bound is quite
high, the encoder-decoder proves to be an appropri-
ate model for the task. Due to the large gap between
the performance of the encoder-decoder and the up-
per bound, a better reranker could further increase
performance. We will investigate ways to improve
the reranker in future work.
Error analysis. We give for representative samples
the error (E for the segmentation produced by our
method) and the correct analysis (G for gold).
We first analyze cases in which the right an-
swer does not appear at all in the samples
drawn from the encoder-decoder. Those in-
clude problems with umlauts in German (G:
uchtigen7→ ver+¨
uchten+ig, E: verflucht+ig)
and orthographic changes at morpheme boundaries
(G:cutter7→cut+er, E: cutter or cutt+er, sampled
with similar frequency). There are also errors that
are due to problems with the annotation, e.g., the fol-
lowing two gold segmentations are arguably incor-
rect: tec7→detective and syrerin7→syr+er+in (syr is
neither a word nor an affix in German).
In other cases, the encoder-decoder does find the
right solution (G), but gives a higher probability
to an incorrect analysis (E). Examples are a wrong
split into adjectives or nouns instead of verbs (G:
ugen+sam+keit, E: f¨
the other way around (G: z¨
ahler7→zahl+er, E:
ahlen+er), cases where the wrong morphemes
are chosen (G: precognition7→pre+cognition, E:
precognit+ion), difficult cases where letters have
to be inserted (G: redolence7→redolent+ence, E:
re+dolence) or words the model does not split
up, even though they should be (G: additive7→
addition+ive, E: additive).
Based on its access to lexical information and
morpheme embeddings, the reranker is able to
correct some of the errors made by the encoder-
decoder. Samples are G: geschwisterp¨
geschwisterpaar+chen, E: geschwisterpar+chen
(geschwisterpaar is a word in German but geschwis-
terpar is not) or G: zickig7→ zicken+ig, E: zick+ig
(with zicken, but not zick, being a German word).
Finally, we want to know if segments that appear
in the test set without being present in the training
set are a source of errors. In order to investigate
that, we split the test samples into two groups: The
first group contains the samples for which our sys-
tem finds the right answer. The second one contains
all other samples. We compare the percentage of
wrong samples right samples
27.33 (.02) 36.60 (.01)
Table 2: Percentage of segments in the solutions for the test
data that do not appear in the training set - split by samples that
our system does or does not get right. We use the German data
and average over the 5 splits. Standard deviation in parenthesis.
samples that do not appear in the training data for
both groups. We exemplarily use the German data
and the results results are shown in Table 2. First,
it can be seen that very roughly about a third of all
segments does not appear in the training data. This
is mainly due to unseen lemmas as their stems are
naturally unknown to the system. However, the cor-
rectly solved samples contain nearly 10% more un-
seen segments. As the average number of segments
per word for wrong and right solutions — 2.44 and
2.11, respectively — does not differ by much, it
seems unlikely that many errors are caused by un-
known segments.
6 Conclusion and Future Work
We developed a model consisting of an encoder-
decoder and a neural reranker for the task of canoni-
cal morphological segmentation. Our model com-
bines character-level information with features on
the morpheme level and external information about
words. It defines a new state of the art, improv-
ing over baseline models by up to .21 accuracy, 16
points F1and .77 Levenshtein distance.
We found that 94% of correct segmentations
are in the sample set drawn from the encoder-
decoder model, demonstrating the upper bound on
the performance of our reranker is quite high; in fu-
ture work, we hope to develop models to exploit this.
We gratefully acknowledge the financial support of
Siemens for this research.
Mohamed Afify, Ruhi Sarikaya, Hong-Kwang Jeff Kuo,
Laurent Besacier, and Yuqing Gao. 2006. On the use
of morphological analysis for dialectal Arabic speech
recognition. In Proc. of INTERSPEECH.
R. H. Baayen, R. Piepenbrock, and H. Van Rijn. 1993.
The CELEX lexical data base on CD-ROM.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint
Kyunghyun Cho, Bart van Merri¨
enboer, Dzmitry Bah-
danau, and Yoshua Bengio. 2014a. On the proper-
ties of neural machine translation: Encoder-decoder
approaches. arXiv preprint arXiv:1409.1259.
Kyunghyun Cho, Bart Van Merri¨
enboer, C¸ alar G¨
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014b. Learning phrase repre-
sentations using RNN encoder–decoder for statistical
machine translation. In Proc. of EMNLP.
Ann Clifton and Anoop Sarkar. 2011. Combin-
ing morpheme-based machine translation with post-
processing morpheme prediction. In Proc. of ACL.
Ryan Cotterell, Thomas M ¨
uller, Alexander Fraser, and
Hinrich Sch¨
utze. 2015. Labeled morphological seg-
mentation with semi-markov models. In Proc. of
Ryan Cotterell, Tim Vieira, and Hinrich Sch¨
utze. 2016.
A joint model of orthography and morphological seg-
mentation. In Proc. of NAACL.
Thomas M Cover and Joy A Thomas. 2012. Elements of
Information Theory. John Wiley & Sons.
Mathias Creutz and Krista Lagus. 2002. Unsupervised
discovery of morphemes. In Proc. of the ACL-02
Workshop on Morphological and Phonological Learn-
Mathias Creutz, Teemu Hirsim¨
aki, Mikko Kurimo, Antti
Puurula, Janne Pylkk ¨
onen, Vesa Siivola, Matti Var-
jokallio, Ebru Arisoy, Murat Sarac¸ lar, and Andreas
Stolcke. 2007. Morph-based speech recognition
and modeling of out-of-vocabulary words across lan-
guages. ACM Transactions on Speech and Language
Processing, 5(1):3:1–3:29.
Markus Dreyer, Jason R. Smith, and Jason Eisner. 2008.
Latent-variable modeling of string transductions with
finite-state methods. In Proc. of EMNLP.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
Linguistics, 27(2):153–198.
Alex Graves and J¨
urgen Schmidhuber. 2005. Frame-
wise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Net-
works, 18(5):602–610.
Alan Graves, Abdel-rahman Mohamed, and Geoffrey
Hinton. 2013. Speech recognition with deep recurrent
neural networks. In Proc of. ICASSP.
Margaret A. Hafer and Stephen F. Weiss. 1974. Word
segmentation by letter successor varieties. Informa-
tion storage and retrieval, 10(11):371–385.
Katharina Kann and Hinrich Sch¨
utze. 2016. Single-
model encoder-decoder with explicit morphological
representation for reinflection. In Proc. of ACL.
Martin Kay. 1977. Morphological and syntactic analysis.
Linguistic Structures Processing, 5:131–234.
Scott Kirkpatrick. 1984. Optimization by simulated an-
nealing: Quantitative studies. Journal of Statistical
Physics, 34(5-6):975–986.
Oskar Kohonen, Sami Virpioja, and Krista Lagus. 2010.
Semi-supervised learning of concatenative morphol-
ogy. In Proc. of the 11th Meeting of the ACL Spe-
cial Interest Group on Computational Morphology and
Septina Dian Larasati, Vladislav Kuboˇ
n, and Daniel Ze-
man. 2011. Indonesian morphology tool (morphind):
Towards an indonesian corpus. In Proc. of SFCM.
Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hin-
ton. 2015. A simple way to initialize recurrent
networks of rectified linear units. arXiv preprint
Jackson L. Lee and John A. Goldsmith. 2016. Linguis-
tica 5: Unsupervised learning of linguistic structure.
In Proc. of NAACL.
Mehryar Mohri, Fernando Pereira, and Michael Ri-
ley. 2002. Weighted finite-state transducers in
speech recognition. Computer Speech & Language,
Jason Naradowsky and Sharon Goldwater. 2009. Im-
proving morphology induction by learning spelling
rules. In Proc. of IJCAI.
Karthik Narasimhan, Damianos Karakos, Richard
Schwartz, Stavros Tsakalidis, and Regina Barzilay.
2014. Morphological segmentation for keyword spot-
ting. In Proc. of EMNLP.
Hoifung Poon, Colin Cherry, and Kristina Toutanova.
2009. Unsupervised morphological segmentation with
log-linear models. In Proc. of NAACL.
Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and
Mikko Kurimo. 2013. Supervised morphological seg-
mentation in a low-resource learning setting using con-
ditional random fields. In Proc. of CoNLL.
Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja,
and mikko kurimo. 2014. Painless semi-supervised
morphological segmentation using conditional random
fields. In Proc. of EACL.
Teemu Ruokolainen, Oskar Kohonen, Kairit Sirts, Stig-
Arne Gr¨
onroos, Mikko Kurimo, and Sami Virpioja.
2016. Comparative study of minimally supervised
morphological segmentation. Computational Linguis-
tics, 42(1):91–120.
Wolfgang Seeker and ¨
Ozlem C¸ etino ˘
glu. 2015. A graph-
based lattice dependency parser for joint morphologi-
cal segmentation and syntactic analysis. TACL, 3:359–
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Proc. of NIPS.
Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,
Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar
as a foreign language. In Proc. of NIPS.
Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo.
2016. Morphological segmentation with window
LSTM neural networks. In Proc. of AAAI.
Matthew D Zeiler. 2012. Adadelta: an adaptive learning
rate method. arXiv preprint arXiv:1212.5701.
Britta Zeller, Jan ˇ
Snajder, and Sebastian Pad´
o. 2013. De-
rivbase: Inducing and evaluating a derivational mor-
phology resource for german. In Proc. of ACL.
... More recently, the task of canonical segmentation was casted as a sequence transduction problem and tackled with supervised methods: conditional random fields (Cotterell et al., 2015;Cotterell, Vieira, and Schütze, 2016;Cotterell and Schütze, 2018) and neural ED model (Kann, Cotterell, and Schütze, 2016). As in unsupervised setting, the former approaches build on the previous method for surface segmentation and both build on the supervised CRF-MORPH system of Ruokolainen et al. (2013). ...
... As a reference, we compare our results to the joint transduction and segmentation model of Cotterell, Vieira, and Schütze (2016) and the state-of-the-art neural reranker model of Kann, Cotterell, and Schütze (2016). Note, however, that the results cannot be directly compared to these two systems since both models use extra training material in the form of external dictionaries. ...
... Table 4.1: Performance on the task of canonical segmentation (Word accuracy and standard deviation averaged over 5 splits, the rounding schemes of previously published results are applied.). RR* -neural reranker model of Kann, Cotterell, and Schütze (2016). Joint* -joint transduction and segmentation model of Cotterell, Vieira, and Schütze (2016). ...
... However, on widely accepted cross-lingual benchmarks as UD, their performance on languages with 2 Following (More et al., 2019;Goldberg and Elhadad, 2013;Nivre et al., 2020;Shao et al., 2017), we use the term segmentation for the task of extracting word-units from tokens. This task is different from canonical segmentation in Kann et al. (2016), where canonical segments refer to morphemes. complex ambiguous tokens (see §4) lags behind. ...
... complex ambiguous tokens (see §4) lags behind. On top of that, recent prominent works on canonical segmentation of morphologically-complex languages (Kann et al., 2016;Qi et al., 2020;Shao et al., 2018) utilized character-level sequence to sequence frameworks, yet lacked the critical disambiguiating context of the tokens, as required by cases of extreme token-internal ambiguity. ...
... Baselines We use three kinds of baselines: (i) No-Contextualization Baslines: To examine the contribution of the pre-trained token embeddings, we test our model with non-contextualized token embeddings (initialized either by Zeros or using FastText (FT)) trained with the main task (essentially falling back on standard canonical segmentation architecture as in (Kann et al., 2016)). ...
Full-text available
Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward. However, for languages with high token-internal complexity, further token-to-word segmentation is required. Previous canonical segmentation studies were based on character-level frameworks, with no contextualised representation involved. Contextualized vectors a la BERT show remarkable results in many applications, but were not shown to improve performance on linguistic segmentation per se. Here we propose a novel neural segmentation model which combines the best of both worlds, contextualised token representation and char-level decoding, which is particularly effective for languages with high token-internal complexity and extreme morphological ambiguity. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art, and leads to further improvements on downstream tasks such as Part-of-Speech Tagging, Dependency Parsing and Named-Entity Recognition, over existing pipelines. When comparing our segmentation-first pipeline with joint segmentation and labeling in the same settings, we show that, contrary to pre-neural studies, the pipeline performance is superior.
... Lemmatization can be considered as a phologylanguage igm Cell Ackerman it is that produce thout ever xample, a d in 2,263 mber, and is unlikely all forms al item. It e found in cted forms st be able y produce ave never Figure 1 e different domly seection tauent word resembles finally (3) cfp-data discussed in Section 2. This allows us to train the reinflection system in a manner reminiscent of denoising autoencoders (Vincent et al., 2008 Related Work Neural models have recently been shown to be highly competitive in many different tasks of learning supervised morphological inflection (Faruqui et al., 2016;Kann and Schütze, 2016;Makarov et al., 2017;Aharoni and Goldberg, 2017) and derivation . Most current architectures are based on encoderdecoder models (Sutskever et al., 2014), and usually contain an attention component (Bahdanau et al., 2015). ...
... Our system is an RNN Encoder-Decoder network heavily influenced by Kann and Schütze (2016). ...
... the model proposed by Kann and Schütze (2016) only with regard to minor details. The high-level intuition of the system is conveyed by Figure 2. The system takes a sequence of lemma characters and morphological features as input (for examples d, o, g, N, PL) and produces a sequence of word form characters as output (d, o, g, s). ...
Neural network approaches have been applied to computational morphology with great success, improving the performance of most tasks by a large margin and providing new perspectives for modeling. This paper starts with a brief introduction to computational morphology, followed by a review of recent work on computational morphology with neural network approaches, to provide an overview of the area. In the end, we will analyze the advantages and problems of neural network approaches to computational morphology, and point out some directions to be explored by future research and study.
... Neural models have shown to perform well on this task when large amounts of training data are available (Kann et al., 2016;Ruzsics and Samardzic, 2017). Nevertheless, datasets with morphological annotations are difficult to obtain, since they require expert annotators. ...
... Therefore, restoring morphemes to their canonical form was previously discussed in linguistics (Kay, 1977) as well as in the NLP literature. Previous approaches include unsupervised (Naradowsky and Goldwater, 2009), as well as joint models for segmentation and transduction (Cotterell et al., 2016b) and neural encoder-decoder models (Kann et al., 2016;Ruzsics and Samardzic, 2017). However, up to now, supervised models have only been explored in the high-resource setting. ...
... In recent years, the area of morphological generation has experienced substantial progress, with a variety of methods that can be used for the canonical segmentation task. Kann et al. (2016) used a sequence-to-sequence model to inflect a word given a set of morphological tags. Sharma et al. (2018a) proposed a pointer-generator model, which was more suitable for the low-resource setting. ...
... Another motivation for our experiments lies in the fact that previous research on morphological segmentation has mostly concentrated on Indo-European languages in high-resource settings (Goldsmith, 2001;Cotterell et al., 2016b), sometimes relying on external large-scale corpora in order to derive morpheme or lexical frequency information (Cotterell et al., 2015;Ruokolainen et al., 2014;Lindén et al., 2009). By contrast, work on morphological segmentation of augmented low-resource settings or truly underresourced languages is lacking in general (Kann et al., 2016). Hence demonstrations of what model architecture and training settings could be beneficial with data sets of very small size would be informative to other researchers whose work shares similar goals and ethical considerations as ours. ...
... Cotterell et al. (2016b) extended a previous semi-CRF (Cotterell et al., 2015) for surface segmentation to jointly predict morpheme boundaries and orthographic changes, leading to improved results for German and Indonesian. With the same datasets, Kann et al. (2016) adopted character-based neural sequence models coupled with a neural reranker, presenting further improvement from Cotterell et al. (2016b). There has, however, been some unsupervised induction of canonical segmentation (see Hammarström and Borin (2011) for a thorough review). ...
... For other languages this may be done using models for canonical segmentation as in(Kann et al., 2016). ...
Full-text available
Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.
... While this enables a relatively easy application to out-ofvocabulary (OOV) words, a more detailed and fine-grained notation could potentially add further benefit. For instance, canonical morphology [17] modifies each detected unit to one of a standardised set. Take acquirability: its surface representation in Unisyn is <a{cquir}>abil >ity >, but canonical segments would be more consistent:<{acquire}>able>ity >, and thus increase the frequency even more of the morphemes curve in Figure 1. ...
... For sequence-to-sequence models we interpret the process of transforming a word into its segmented form as a character-level sequence transduction problem, which has previously been shown to be effective when applied to other languages (Wang et al., 2016;Shao, 2017;Ruzsics and Samardžić, 2017). Sequence-to-sequence models are able to deal with input and output sequences of differing lengths, and subsequently to handle canonical segmentation, where a morpheme may not be equal to the segment of the word that it corresponds as written (Kann et al., 2016). The CRFs on the other hand are suitable for surface segmentation, where the morphemes are a pure segmentation of the orthography of the word. ...
Full-text available
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperform a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high accuracy of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Conference Paper
Full-text available
Morphological reinflection is the task of generating a target form given a source form, a source tag and a target tag. We propose a new way of modeling this task with neural encoder-decoder models. Our approach reduces the amount of required training data for this architecture and achieves state-of-the-art results, making encoder-decoder models applicable to morphological reinflection even for low-resource languages. We further present a new automatic correction method for the outputs based on edit trees.
Full-text available
This paper introduces Linguistica 5, a software for unsupervised learning of linguistic structure. It is a descendant of Goldsmith's (2001, 2006) Linguistica. Open-source and written in Python, the new Linguistica 5 is both a graphical user interface software and a Python library. While Linguistica 5 inherits its predecessors' strength in unsupervised learning of natural language morphology, it incorporates significant improvements in multiple ways. Notable new features include tools for data visualization as well as straightforward extensions for both its components and embedding in other programs.
Full-text available
We explore the impact of morphological segmentation on keyword spotting (KWS). Despite potential benefits, state-of-the-art KWS systems do not use morphological information. In this paper, we augment a state-of-the-art KWS system with sub-word units derived from supervised and unsupervised morphological segmentations, and compare with phonetic and syllabic segmentations. Our experiments demonstrate that morphemes improve overall performance of KWS systems. Syllabic units, however, rival the performance of morphological units when used in KWS. By combining morphological, phonetic and syllabic segmentations, we demonstrate substantial performance gains.
Morphological segmentation, which aims to break words into meaning-bearing morphemes, is an important task in natural language processing. Most previous work relies heavily on linguistic preprocessing. In this paper, we instead propose novel neural network architectures that learn the structure of input sequences directly from raw input words and are subsequently able to predict morphological boundaries. Our architectures rely on Long Short Term Memory (LSTM) units to accomplish this, but exploit windows of characters to capture more contextual information. Experiments on multiple languages confirm the effectiveness of our models on this task.
Space-delimited words in Turkish and Hebrew text can be further segmented into meaningful units, but syntactic and semantic context is necessary to predict segmentation. At the same time, predicting correct syntactic structures relies on correct segmentation. We present a graph-based lattice dependency parser that operates on morphological lattices to represent different segmentations and morphological analyses for a given input sentence. The lattice parser predicts a dependency tree over a path in the lattice and thus solves the joint task of segmentation, morphological analysis, and syntactic parsing. We conduct experiments on the Turkish and the Hebrew treebank and show that the joint model outperforms three state-of-the-art pipeline systems on both data sets. Our work corroborates findings from constituency lattice parsing for Hebrew and presents the first results for full lattice parsing on Turkish.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
We present labeled morphological segmentation—an alternative view of morphological processing that unifies several tasks. We introduce a new hierarchy of morphotactic tagsets and CHIPMUNK, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. For morphological segmentation our method shows absolute improvements of 2-6 points F1 over a strong baseline.
This article presents a comparative study of a subfield of morphology learning referred to as minimally supervised morphological segmentation. In morphological segmentation, word forms are segmented into morphs, the surface forms of morphemes. In the minimally supervised data-driven learning setting, segmentation models are learned from a small number of manually annotated word forms and a large set of unannotated word forms. In addition to providing a literature survey on published methods, we present an in-depth empirical comparison on three diverse model families, including a detailed error analysis. Based on the literature survey, we conclude that the existing methodology contains substantial work on generative morph lexicon-based approaches and methods based on discriminative boundary detection. As for which approach has been more successful, both the previous work and the empirical evaluation presented here strongly imply that the current state of the art is yielded by the discriminative boundary detection methodology.