Conference PaperPDF Available

Probabilistic Hierarchical Clustering Of Morphological Paradigms

Authors:

Abstract and Figures

We propose a novel method for learning morphological paradigms that are structured within a hierarchy. The hierarchical structuring of paradigms groups morphologically similar words close to each other in a tree structure. This allows detecting morphological similarities easily leading to improved morphological segmentation. Our evaluation using (Kurimo et al., 2011a; Kurimo et al., 2011b) dataset shows that our method performs competitively when compared with current state-of-art systems.
Content may be subject to copyright.
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 654–663,
Avignon, France, April 23 - 27 2012. c
2012 Association for Computational Linguistics
Probabilistic Hierarchical Clustering of
Morphological Paradigms
Burcu Can
Department of Computer Science
University of York
Heslington, York, YO10 5GH, UK
burcucan@gmail.com
Suresh Manandhar
Department of Computer Science
University of York
Heslington, York, YO10 5GH, UK
suresh@cs.york.ac.uk
Abstract
We propose a novel method for learning
morphological paradigms that are struc-
tured within a hierarchy. The hierarchi-
cal structuring of paradigms groups mor-
phologically similar words close to each
other in a tree structure. This allows detect-
ing morphological similarities easily lead-
ing to improved morphological segmen-
tation. Our evaluation using (Kurimo et
al., 2011a; Kurimo et al., 2011b) dataset
shows that our method performs competi-
tively when compared with current state-of-
art systems.
1 Introduction
Unsupervised morphological segmentation of a
text involves learning rules for segmenting words
into their morphemes. Morphemes are the small-
est meaning bearing units of words. The learn-
ing process is fully unsupervised, using only raw
text as input to the learning system. For example,
the word respectively is split into morphemes re-
spect,ive and ly. Many fields, such as machine
translation, information retrieval, speech recog-
nition etc., require morphological segmentation
since new words are always created and storing
all the word forms will require a massive dictio-
nary. The task is even more complex, when mor-
phologically complicated languages (i.e. agglu-
tinative languages) are considered. The sparsity
problem is more severe for more morphologically
complex languages. Applying morphological seg-
mentation mitigates data sparsity by tackling the
issue with out-of-vocabulary (OOV) words.
In this paper, we propose a paradigmatic ap-
proach. A morphological paradigm is a pair
(StemList, SuffixList) such that each concatena-
tion of Stem+Suffix (where Stem StemList and
Suffix SuffixList) is a valid word form. The
learning of morphological paradigms is not novel
as there has already been existing work in this area
such as Goldsmith (2001), Snover et al. (2002),
Monson et al. (2009), Can and Manandhar (2009)
and Dreyer and Eisner (2011). However, none of
these existing approaches address learning of the
hierarchical structure of paradigms.
Hierarchical organisation of words help cap-
ture morphological similarities between words in
a compact structure by factoring these similarities
through stems, suffixes or prefixes. Our inference
algorithm simultaneously infers latent variables
(i.e. the morphemes) along with their hierarchical
organisation. Most hierarchical clustering algo-
rithms are single-pass, where once the hierarchi-
cal structure is built, the structure does not change
further.
The paper is structured as follows: section 2
gives the related work, section 3 describes the
probabilistic hierarchical clustering scheme, sec-
tion 4 explains the morphological segmenta-
tion model by embedding it into the clustering
scheme and describes the inference algorithm
along with how the morphological segmentation
is performed, section 5 presents the experiment
settings along with the evaluation scores, and fi-
nally section 6 presents a discussion with a com-
parison with other systems that participated in
Morpho Challenge 2009 and 2010 .
2 Related Work
We propose a Bayesian approach for learning of
paradigms in a hierarchy. If we ignore the hierar-
chical aspect of our learning algorithm, then our
654
walk walking talked talks
{walk}{0,ing} {talk}{ed,s} {quick}{0,ly}
quick quickly
{walk, talk, quick}{0,ed,ing,ly, s}
{walk, talk}{0,ed,ing,s}
Figure 1: A sample tree structure.
method is similar to the Dirichlet Process (DP)
based model of Goldwater et al. (2006). From
this perspective, our method can be understood
as adding a hierarchical structure learning layer
on top of the DP based learning method proposed
in Goldwater et al. (2006). Dreyer and Eisner
(2011) propose an infinite Diriclet mixture model
for capturing paradigms. However, they do not
address learning of hierarchy.
The method proposed in Chan (2006) also
learns within a hierarchical structure where La-
tent Dirichlet Allocation (LDA) is used to find
stem-suffix matrices. However, their work is su-
pervised, as true morphological analyses of words
are provided to the system. In contrast, our pro-
posed method is fully unsupervised.
3 Probabilistic Hierarchical Model
The hierarchical clustering proposed in this work
is different from existing hierarchical clustering
algorithms in two aspects:
It is not single-pass as the hierarchical struc-
ture changes.
It is probabilistic and is not dependent on a
distance metric.
3.1 Mathematical Definition
In this paper, a hierarchical structure is a binary
tree in which each internal node represents a clus-
ter.
Let a data set be D={x1, x2, . . . , xn}and
Tbe the entire tree, where each data point xiis
located at one of the leaf nodes (see Figure 2).
Here, Dkdenotes the data points in the branch
Tk. Each node defines a probabilistic model for
words that the cluster acquires. The probabilistic
Di
Dk
Dj
X1X2X3X4
Figure 2: A segment of a tree with with internal nodes
Di, Dj, Dkhaving data points {x1, x2, x3, x4}. The
subtree below the internal node Diis called Ti, the
subtree below the internal node Djis Tj, and the sub-
tree below the internal node Dkis Tk.
model can be denoted as p(xi|θ)where θdenotes
the parameters of the probabilistic model.
The marginal probability of data in any node
can be calculated as:
p(Dk) = p(Dk|θ)p(θ|β)dθ(1)
The likelihood of data under any subtree is de-
fined as follows:
p(Dk|Tk) = p(Dk)p(Dl|Tl)p(Dr|Tr)(2)
where the probability is defined in terms of left Tl
and right Trsubtrees. Equation 2 provides a re-
cursive decomposition of the likelihood in terms
of the likelihood of the left and the right sub-
trees until the leaf nodes are reached. We use the
marginal probability (Equation 1) as prior infor-
mation since the marginal probability bears the
probability of having the data from the left and
right subtrees within a single cluster.
4 Morphological Segmentation
In our model, data points are words to be clus-
tered and each cluster represents a paradigm. In
the hierarchical structure, words will be organised
in such a way that morphologically similar words
will be located close to each other to be grouped
in the same paradigms. Morphological similarity
refers to at least one common morpheme between
words. However, we do not make a distinction be-
tween morpheme types. Instead, we assume that
each word is organised as a stem+suffix combina-
tion.
4.1 Model Definition
Let a dataset D
D
Dconsist of words to be analysed,
where each word wihas a latent variable which is
655
the split point that analyses the word into its stem
siand suffix mi:
D
D
D={w1=s1+m1, . . . , wn=sn+mn}
The marginal likelihood of words in the node k
is defined such that:
p(Dk) = p(Sk)p(Mk)
=p(s1, s2, . . . , sn)p(m1, m2, . . . , mn)
The words in each cluster represents a
paradigm that consists of stems and suffixes. The
hierarchical model puts words sharing the same
stems or suffixes close to each other in the tree.
Each word is part of all the paradigms on the
path from the leaf node having that word to the
root. The word can share either its stem or suffix
with other words in the same paradigm. Hence,
a considerable number of words can be generated
through this approach that may not be seen in the
corpus.
We postulate that stems and suffixes are gen-
erated independently from each other. Thus, the
probability of a word becomes:
p(w=s+m) = p(s)p(m)(3)
We define two Dirichlet processes to generate
stems and suffixes independently:
Gs|βs, PsDP (βs, Ps)
Gm|βm, PmDP (βm, Pm)
s|GsGs
m|GmGm
where DP (βs, Ps)denotes a Dirichlet process
that generates stems. Here, βsis the concentration
parameter, which determines the number of stem
types generated by the Dirichlet process. The
smaller the value of the concentration parameter,
the less likely to generate new stem types the pro-
cess is. In contrast, the larger the value of concen-
tration parameter, the more likely it is to generate
new stem types, yielding a more uniform distribu-
tion over stem types. If βs<1, sparse stems are
supported, it yields a more skewed distribution.
To support a small number of stem types in each
cluster, we chose βs<1.
Here, Psis the base distribution. We use the
base distribution as a prior probability distribu-
tion for morpheme lengths. We model morpheme
βsβm
PsPm
GsGm
simi
wi
L N
n
Figure 3: The plate diagram of the model, representing
the generation of a word wifrom the stem siand the
suffix mithat are generated from Dirichlet processes.
In the representation, solid-boxes denote that the pro-
cess is repeated with the number given on the corner
of each box.
lengths implicitly through the morpheme letters:
Ps(si) =
cisi
p(ci)(4)
where cidenotes the letters, which are distributed
uniformly. Modelling morpheme letters is a way
of modelling the morpheme length since shorter
morphemes are favoured in order to have fewer
factors in Equation 4 (Creutz and Lagus, 2005b).
The Dirichlet process, DP (βm, Pm), is defined
for suffixes analogously. The graphical represen-
tation of the entire model is given in Figure 3.
Once the probability distributions G=
{Gs, Gm}are drawn from both Dirichlet pro-
cesses, words can be generated by drawing a stem
from Gsand a suffix from Gm. However, we do
not attempt to estimate the probability distribu-
tions G; instead, Gis integrated out. The joint
probability of stems is calculated by integrating
out Gs:
p(s1, s2, . . . , sM)
=p(Gs)
L
i=1
p(si|Gs)dGs(5)
where Ldenotes the number of stem tokens. The
joint probability distribution of stems can be tack-
led as a Chinese restaurant process. The Chi-
nese restaurant process introduces dependencies
between stems. Hence, the joint probability of
656
stems S={s1, . . . , sL}becomes:
p(s1, s2, . . . , sL)
=p(s1)p(s2|s1). . . p(sM|s1, . . . , sM1)
=Γ(βs)
Γ(L+βs)βK1
s
K
i=1
Ps(si)
K
i=1
(nsi1)!
(6)
where Kdenotes the number of stem types. In
the equation, the second and the third factor corre-
spond to the case where novel stems are generated
for the first time; the last factor corresponds to the
case in which stems that have already been gener-
ated for nsitimes previously are being generated
again. The first factor consists of all denominators
from both cases.
The integration process is applied for proba-
bility distributions Gmfor suffixes analogously.
Hence, the joint probability of suffixes M=
{m1, . . . , mN}becomes:
p(m1, m2, . . . , mN)
=p(m1)p(m2|m1). . . p(mN|m1, . . . , mN1)
=Γ(α)
Γ(N+α)αT
T
i=1
Pm(mi)
T
i=1
(nmi1)!
(7)
where Tdenotes the number of suffix types and
nmiis the number of stem types miwhich have
been already generated.
Following the joint probability distribution of
stems, the conditional probability of a stem given
previously generated stems can be derived as:
p(si|Ssi, βs, Ps)
=
nSsi
si
L1+βsif siSsi
βsPs(si)
L1+βsotherwise
(8)
where nSsi
sidenotes the number of stem in-
stances sithat have been previously generated,
where Ssidenotes the stem set excluding the
new instance of the stem si.
The conditional probability of a suffix given the
other suffixes that have been previously generated
is defined similarly:
p(mi|Mmi, βm, Pm)
=
nMmi
mi
N1+βmif miMmi
βmPm(mi)
N1+βmotherwise
(9)
where nMi
k
miis the number of instances mithat
have been generated previously where Mmiis
plugg+ed skew+ed
exclaim+ed
borrow+s borrow+ed
liken+s liken+ed
consist+s consist+ed
Figure 4: A portion of a sample tree.
the set of suffixes, excluding the new instance of
the suffix mi.
A portion of a tree is given in Figure 4. As
can be seen on the figure, all words are lo-
cated at leaf nodes. Therefore, the root node
of this subtree consists of words {plugg+ed,
skew+ed, exclaim+ed, borrow+s, borrow+ed,
liken+s, liken+ed, consist+s, consist+ed}.
4.2 Inference
The initial tree is constructed by randomly choos-
ing a word from the corpus and adding this into a
randomly chosen position in the tree. When con-
structing the initial tree, latent variables are also
assigned randomly, i.e. each word is split at a ran-
dom position (see Algorithm 1).
We use Metropolis Hastings algorithm (Hast-
ings, 1970), an instance of Markov Chain Monte
Carlo (MCMC) algorithms, to infer the optimal
hierarchical structure along with the morphologi-
cal segmentation of words (given in Algorithm 2).
During each iteration i, a leaf node Di={wi=
si+mi}is drawn from the current tree structure.
The drawn leaf node is removed from the tree.
Next, a node Dkis drawn uniformly from the tree
657
Algorithm 1 Creating initial tree.
1: input: data D={w1=s1+m1, . . . , wn=
sn+mn},
2: initialise: root D1where
D1={w1=s1+m1}
3: initialise: cn1
4: while c >= 1 do
5: Draw a word wjfrom the corpus.
6: Split the word randomly such that wj=
sj+mj
7: Create a new node Djwhere Dj=
{wj=sj+mj}
8: Choose a sibling node Dkfor Dj
9: Merge Dnew DjDk
10: Remove wjfrom the corpus
11: cc1
12: end while
13: output: Initial tree
to make it a sibling node to Di. In addition to a
sibling node, a split point wi=s
i+m
iis drawn
uniformly. Next, the node Di={wi=s
i+m
i}
is inserted as a sibling node to Dk. After updating
all probabilities along the path to the root, the new
tree structure is either accepted or rejected by ap-
plying the Metropolis-Hastings update rule. The
likelihood of data under the given tree structure is
used as the sampling probability.
We use a simulated annealing schedule to up-
date PAcc:
PAcc =pnext(D|T)
pcur(D|T)1
γ(10)
where γdenotes the current temperature,
pnext(D|T)denotes the marginal likelihood
of the data under the new tree structure, and
pcur(D|T)denotes the marginal likelihood of
data under the latest accepted tree structure. If
(pnext(D|T)> pcur (D|T)) then the update is
accepted (see line 9, Algorithm 2), otherwise, the
tree structure is still accepted with a probability
of pAcc (see line 14, Algorithm 2). In our
experiments (see section 5) we set γto 2. The
system temperature is reduced in each iteration
of the Metropolis Hastings algorithm:
γγη(11)
Most tree structures are accepted in the earlier
stages of the algorithm, however, as the tempera-
Algorithm 2 Inference algorithm
1: input: data D={w1=s1+m1, . . . , wn=
sn+mn}, initial tree T, initial temperature
of the system γ, the target temperature of the
system κ, temperature decrement η
2: initialise: i1,wwi=si+mi,
pcur(D|T)p(D|T)
3: while γ > κ do
4: Remove the leaf node Dithat has the
word wi=si+mi
5: Draw a split point for the word such that
wi=s
i+m
i
6: Draw a sibling node Dj
7: DmDiDj
8: Update pnext(D|T)
9: if pnext(D|T)>=pcur (D|T)then
10: Accept the new tree structure
11: pcur(D|T)pnext (D|T)
12: else
13: random N ormal(0,1)
14: if random < pnext (D|T)
pcur(D|T)1
γthen
15: Accept the new tree structure
16: pcur(D|T)pnext (D|T)
17: else
18: Reject the new tree structure
19: Re-insert the node Diat its pre-
vious position with the previous
split point
20: end if
21: end if
22: wwi+1 =si+1 +mi+1
23: γγη
24: end while
25: output: A tree structure where each node
corresponds to a paradigm.
ture decreases only tree structures that lead lead to
a considerable improvement in the marginal prob-
ability p(D|T)are accepted.
An illustration of sampling a new tree structure
is given in Figure 5 and 6. Figure 5 shows that
D0will be removed from the tree in order to sam-
ple a new position on the tree, along with a new
split point of the word. Once the leaf node is re-
moved from the tree, the parent node is removed
from the tree, as the parent node D5will consist
of only one child. Figure 6 shows that D8is sam-
pled to be the sibling node of D0. Subsequently,
the two nodes are merged within a new cluster that
658
Figure 5: D0will be removed from the tree.
D9
D1
D6
D2D3D4D0
D7
D8
Figure 6: D8is sampled to be the sibling of D0.
introduces a new node D9.
4.3 Morphological Segmentation
Once the optimal tree structure is inferred, along
with the morphological segmentation of words,
any novel word can be analysed. For the segmen-
tation of novel words, the root node is used as it
contains all stems and suffixes which are already
extracted from the training data. Morphological
segmentation is performed in two ways: segmen-
tation at a single point and segmentation at multi-
ple points.
4.3.1 Single Split Point
In order to find single split point for the mor-
phological segmentation of a word, the split point
yielding the maximum probability given inferred
stems and suffixes is chosen to be the final analy-
sis of the word:
arg max
j
p(wi=sj+mj|Droot, βm, Pm, βs, Ps)
(12)
where Droot refers to the root of the entire tree.
Here, the probability of a segmentation of a
given word given Droot is calculated as given be-
low:
p(wi=sj+mj|Droot, βm, Pm, βs, Ps) =
p(sj|Sroot, βs, Ps)p(mj|Mroot, βm, Pm)
(13)
where Sroot denotes all the stems in Droot and
Mroot denotes all the suffixes in Droot. Here
p(sj|Sroot, βs, Ps)is calculated as given below:
p(si|Sroot, βs, Ps) =
nSroot
si
L+βsif siSroot
βsPs(si)
L+βsotherwise
(14)
Similarly, p(mj|Mroot, βm, Pm)is calculated
as:
p(mi|Mroot, βm, Pm) =
nMroot
mi
N+βmif miMroot
βmPm(mi)
N+βmotherwise
(15)
4.3.2 Multiple Split Points
In order to discover words with multiple split
points, we propose a hierarchical segmentation
where each segment is split further. The rules for
generating multiple split points is given by the fol-
lowing context free grammar:
ws1m1|s2m2(16)
s1s m|s s (17)
s2s(18)
m1m m (19)
m2s m|m m (20)
Here, s is a pre-terminal node that generates all
the stems from the root node. And similarly, m is
a pre-terminal node that generates all the suffixes
from the root node. First, using Equation 16, the
word (e.g. housekeeper) is split into s1m1(e.g.
housekeep+er) or s2m2(house+keeper). The first
segment is regarded as a stem, and the second
segment is either a stem or a suffix, consider-
ing the probability of having a compound word.
Equation 12 is used to decide whether the sec-
ond segment is a stem or a suffix. At the sec-
ond segmentation level, each segment is split once
more. If the first production rule is followed in
the first segmentation level, the first segment s1
can be analysed as s m (e.g. housekeep+) or s s
659
!"#$%&%%'%(
!"#$% &%%'%(
!"#$% ) &%%' %(
Figure 7: An example that depicts how the word
housekeeper can be analysed further to find more split
points.
(e.g. house+keep) (Equation 17). The decision
to choose which production rule to apply is made
using:
s1{s s if p(s|S, βs, Ps)> p(m|M , βm, Pm)
s m otherwise (21)
where S and M denote all the stems and suffixes
in the root node.
Following the same production rule, the second
segment m1can only be analysed as m m (er+).
We postulate that words cannot have more than
two stems and suffixes always follow stems. We
do not allow any prefixes, circumfixes, or infixes.
Therefore, the first production rule can output two
different analyses: smmmand ssmm(e.g.
housekeep+er and house+keep+er).
On the other hand, if the word is analysed as
s2m2(e.g. house+keeper), then s2cannot be
analysed further. (e.g. house). The second seg-
ment m2can be analysed further, such that s m
(stem+suffix) (e.g. keep+er, keeper+) or m m
(suffix+suffix). The decision to choose which pro-
duction rule to apply is made as follows:
m2{s m if p(s|S, βs, Ps)> p(m|M , βm, Pm)
m m otherwise (22)
Thus, the second production rule yields two
different analyses: ssmand s m m (e.g.
house+keep+er or house+keeper).
5 Experiments & Results
Two sets of experiments were performed for the
evaluation of the model. In the first set of exper-
iments, each word is split at single point giving a
single stem and a single suffix. In the second set
of experiments, potentially multiple split points
!"##$%
&'(#)%
*
+,
%*-
,-.
,+/
0**
*%,
*.-
1*.
/%/
/.*
+1,
.,-
...
21/
%-,*
%-2,
%%/-
%,,.
%,2/
%0/*
%*0,
%1--
%1/.
%/0/
%+-*
3%4.-56--+
3%4/-56--+
3%4*-56--+
3%4,-56--+
3%4--56--+
3.4--56--/
3/4--56--/
3*4--56--/
3,4--56--/
-4--5 6---
%/7
,,7
8$#9'$:;<=
>'9(:<'?)?:@#?:";;A
Figure 8: Marginal likelihood convergence for datasets
of size 16K and 22K words.
are generated, by splitting each stem and suffix
once more, if it is possible to do so.
Morpho Challenge (Kurimo et al., 2011b) pro-
vides a well established evaluation framework
that additionally allows comparing our model in
a range of languages. In both sets of experiments,
the Morpho Challenge 2010 dataset is used (Ku-
rimo et al., 2011b). Experiments are performed
for English, where the dataset consists of 878,034
words. Although the dataset provides word fre-
quencies, we have not used any frequency infor-
mation. However, for training our model, we only
chose words with frequency greater than 200.
In our experiments, we used dataset sizes of
10K, 16K, 22K words. However, for final eval-
uation, we trained our models on 22K words. We
were unable to complete the experiments with
larger training datasets due to memory limita-
tions. We plan to report this in future work. Once
the tree is learned by the inference algorithm, the
final tree is used for the segmentation of the entire
dataset. Several experiments are performed for
each setting where the setting varies with the tree
size and the model parameters. Model parameters
are the concentration parameters β={βs, βm}
of the Dirichlet processes. The concentration pa-
rameters, which are set for the experiments, are
0.1,0.2,0.02,0.001,0.002.
In all experiments, the initial temperature of the
system is assigned as γ= 2 and it is reduced to
the temperature γ= 0.01 with decrements η=
0.0001. Figure 8 shows how the log likelihoods of
trees of size 16K and 22K converge in time (where
the time axis refers to sampling iterations).
Since different training sets will lead to differ-
ent tree structures, each experiment is repeated
three times keeping the experiment setting the
same.
660
Data Size P(%) R(%) F(%) βs, βm
10K 81.48 33.03 47.01 0.1, 0.1
16K 86.48 35.13 50.02 0.002, 0.002
22K 89.04 36.01 51.28 0.002, 0.002
Table 1: Highest evaluation scores of single split point
experiments obtained from the trees with 10K, 16K,
and 22K words.
Data Size P(%) R(%) F(%) βs, βm
10K 62.45 57.62 59.98 0.1, 0.1
16K 67.80 57.72 62.36 0.002, 0.002
22K 68.71 62.56 62.56 0.001 0.001
Table 2: Evaluation scores of multiple split point ex-
periments obtained from the trees with 10K, 16K, and
22K words.
5.1 Experiments with Single Split Points
In the first set of experiments, words are split into
a single stem and suffix. During the segmentation,
Equation 12 is used to determine the split position
of each word. Evaluation scores are given in Ta-
ble 1. The highest F-measure obtained is 51.28%
with the dataset of 22K words. The scores are no-
ticeably higher with the largest training set.
5.2 Experiments with Multiple Split Points
The evaluation scores of experiments with mul-
tiple split points are given in Table 2. The high-
est F-measure obtained is 62.56% with the dataset
with 22K words. As for single split points, the
scores are noticeably higher with the largest train-
ing set.
For both, single and multiple segmentation, the
same inferred tree has been used.
5.3 Comparison with Other Systems
For all our evaluation experiments using Mor-
pho Challenge 2010 (English and Turkish) and
Morpho Challenge 2009 (English), we used 22k
words for training. For each evaluation, we ran-
domly chose 22k words for training and ran our
MCMC inference procedure to learn our model.
We generated 3 different models by choosing 3
different randomly generated training sets each
consisting of 22k words. The results are the best
results over these 3 models. We are reporting the
best results out of the 3 models due to the small
(22k word) datasets used. Use of larger datasets
would have resulted in less variation and better
results.
System P(%) R(%) F(%)
Allomorf168.98 56.82 62.31
Morf. Base.274.93 49.81 59.84
PM-Union355.68 62.33 58.82
Lignos483.49 45.00 58.48
Prob. Clustering (multiple) 57.08 57.58 57.33
PM-mimic353.13 59.01 55.91
MorphoNet565.08 47.82 55.13
Rali-cof668.32 46.45 55.30
CanMan758.52 44.82 50.76
1Virpioja et al. (2009)
2Creutz and Lagus (2002)
3Monson et al. (2009)
4Lignos et al. (2009)
5Bernhard (2009)
6Lavall´
ee and Langlais (2009)
7Can and Manandhar (2009)
Table 3: Comparison with other unsupervised systems
that participated in Morpho Challenge 2009 for En-
glish.
We compare our system with the other partici-
pant systems in Morpho Challenge 2010. Results
are given in Table 6 (Virpioja et al., 2011). Since
the model is evaluated using the official (hidden)
Morpho Challenge 2010 evaluation dataset where
we submit our system for evaluation to the organ-
isers, the scores are different from the ones that
we presented Table 1 and Table 2.
We also demonstrate experiments with Morpho
Challenge 2009 English dataset. The dataset con-
sists of 384,904 words. Our results and the re-
sults of other participant systems in Morpho Chal-
lenge 2009 are given in Table 3 (Kurimo et al.,
2009). It should be noted that we only present
the top systems that participated in Morpho Chal-
lenge 2009. If all the systems are considered, our
system comes 5th out of 16 systems.
The problem of morphologically rich lan-
guages is not our priority within this research.
Nevertheless, we provide evaluation scores on
Turkish. The Turkish dataset consists of 617,298
words. We chose words with frequency greater
than 50 for Turkish since the Turkish dataset is not
large enough. The results for Turkish are given in
Table 4. Our system comes 3rd out of 7 systems.
6 Discussion
The model can easily capture common suffixes
such as -less, -s, -ed, -ment, etc. Some sample tree
nodes obtained from trees are given in Table 6.
661
System P(%) R(%) F(%)
Morf. CatMAP 79.38 31.88 45.49
Aggressive Comp. 55.51 34.36 42.45
Prob. Clustering (multiple) 72.36 25.81 38.04
Iterative Comp. 68.69 21.44 32.68
Nicolas 79.02 19.78 31.64
Morf. Base. 89.68 17.78 29.67
Base Inference 72.81 16.11 26.38
Table 4: Comparison with other unsupervised systems
that participated in Morpho Challenge 2010 for Turk-
ish.
regard+less, base+less, shame+less, bound+less,
harm+less, regard+ed, relent+less
solve+d, high+-priced, lower+s, lower+-level,
high+-level, lower+-income, histor+ians
pre+mise, pre+face, pre+sumed, pre+, pre+gnant
base+ment, ail+ment, over+looked, predica+ment,
deploy+ment, compart+ment, embodi+ment
anti+-fraud, anti+-war, anti+-tank, anti+-nuclear,
anti+-terrorism, switzer+, anti+gua, switzer+land
sharp+ened, strength+s, tight+ened, strength+ened,
black+ened
inspir+e, inspir+ing, inspir+ed, inspir+es, earn+ing,
ponder+ing
downgrade+s, crash+ed, crash+ing, lack+ing,
blind+ing, blind+, crash+, compris+ing, com-
pris+es, stifl+ing, compris+ed, lack+s, assist+ing,
blind+ed, blind+er,
Table 5: Sample tree nodes obtained from various
trees.
As seen from the table, morphologically similar
words are grouped together. Morphological sim-
ilarity refers to at least one common morpheme
between words. For example, the words high-
priced and lower-level are grouped in the same
node through the word high-level which shares
the same stem with high-priced and the same end-
ing with lower-level.
As seen from the sample nodes, prefixes
can also be identified, for example anti+fraud,
anti+war, anti+tank, anti+nuclear. This illus-
trates the flexibility in the model by capturing the
similarities through either stems, suffixes or pre-
fixes. However, as mentioned above, the model
does not consider any discrimination between dif-
ferent types of morphological forms during train-
ing. As the prefix pre- appears at the beginning of
words, it is identified as a stem. However, identi-
fying pre- as a stem does not yield a change in the
morphological analysis of the word.
System P(%) R(%) F(%)
Base Inference180.77 53.76 64.55
Iterative Comp.180.27 52.76 63.67
Aggressive Comp.171.45 52.31 60.40
Nicolas267.83 53.43 59.78
Prob. Clustering (multiple) 57.08 57.58 57.33
Morf. Baseline381.39 41.70 55.14
Prob. Clustering (single) 70.76 36.51 48.17
Morf. CatMAP486.84 30.03 44.63
1Lignos (2010)
2Nicolas et al. (2010)
3Creutz and Lagus (2002)
4Creutz and Lagus (2005a)
Table 6: Comparison of our model with other unsuper-
vised systems that participated in Morpho Challenge
2010 for English.
Sometimes similarities may not yield a valid
analysis of words. For example, the prefix pre-
leads the words pre+mise, pre+sumed, pre+gnant
to be analysed wrongly, whereas pre- is a valid
prefix for the word pre+face. Another nice fea-
ture about the model is that compounds are easily
captured through common stems: e.g. doubt+fire,
bon+fire, gun+fire, clear+cut.
7 Conclusion & Future Work
In this paper, we present a novel probabilis-
tic model for unsupervised morphology learn-
ing. The model adopts a hierarchical structure
in which words are organised in a tree so that
morphologically similar words are located close
to each other.
In hierarchical clustering, tree-cutting would be
a very useful thing to do but it is not addressed
in the current paper. We used just the root node
as a morpheme lexicon to apply segmentation.
Clearly, adding tree cutting would improve the ac-
curacy of the segmentation and will help us iden-
tify paradigms with higher accuracy. However,
the segmentation accuracy obtained without us-
ing tree cutting provides a very useful indicator
to show whether this approach is promising. And
experimental results show that this is indeed the
case.
In the current model, we did not use any syn-
tactic information, only words. POS tags can be
utilised to group words which are both morpho-
logically and syntactically similar.
662
References
Delphine Bernhard. 2009. Morphonet: Exploring the
use of community structure for unsupervised mor-
pheme analysis. In Working Notes for the CLEF
2009 Workshop, September.
Burcu Can and Suresh Manandhar. 2009. Cluster-
ing morphological paradigms using syntactic cate-
gories. In Working Notes for the CLEF 2009 Work-
shop, September.
Erwin Chan. 2006. Learning probabilistic paradigms
for morphology in a latent class model. In Proceed-
ings of the Eighth Meeting of the ACL Special Inter-
est Group on Computational Phonology and Mor-
phology, SIGPHON ’06, pages 69–78, Stroudsburg,
PA, USA. Association for Computational Linguis-
tics.
Mathias Creutz and Krista Lagus. 2002. Unsu-
pervised discovery of morphemes. In Proceed-
ings of the ACL-02 workshop on Morphological
and phonological learning - Volume 6, MPL ’02,
pages 21–30, Stroudsburg, PA, USA. Association
for Computational Linguistics.
Mathias Creutz and Krista Lagus. 2005a. Induc-
ing the morphological lexicon of a natural language
from unannotated text. In In Proceedings of the
International and Interdisciplinary Conference on
Adaptive Knowledge Representation and Reasoning
(AKRR 2005, pages 106–113.
Mathias Creutz and Krista Lagus. 2005b. Unsu-
pervised morpheme segmentation and morphology
induction from text corpora using morfessor 1.0.
Technical Report A81.
Markus Dreyer and Jason Eisner. 2011. Discover-
ing morphological paradigms from plain text using
a dirichlet process mixture model. In Proceedings
of the 2011 Conference on Empirical Methods in
Natural Language Processing, pages 616–627, Ed-
inburgh, Scotland, UK., July. Association for Com-
putational Linguistics.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
Linguistics, 27(2):153–198.
Sharon Goldwater, Thomas L. Griffiths, and Mark
Johnson. 2006. Interpolating between types and to-
kens by estimating power-law generators. In In Ad-
vances in Neural Information Processing Systems
18, page 18.
W. K. Hastings. 1970. Monte carlo sampling meth-
ods using markov chains and their applications.
Biometrika, 57:97–109.
Mikko Kurimo, Sami Virpioja, Ville T. Turunen,
Graeme W. Blackwood, and William Byrne. 2009.
Overview and results of morpho challenge 2009.
In Proceedings of the 10th cross-language eval-
uation forum conference on Multilingual infor-
mation access evaluation: text retrieval experi-
ments, CLEF’09, pages 578–597, Berlin, Heidel-
berg. Springer-Verlag.
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
Ville Turunen. 2011a. Morpho challenge
2009. http://research.ics.tkk.fi/
events/morphochallenge2009/, June.
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
Ville Turunen. 2011b. Morpho challenge
2010. http://research.ics.tkk.fi/
events/morphochallenge2010/, June.
Jean Franc¸ois Lavall´
ee and Philippe Langlais. 2009.
Morphological acquisition by formal analogy. In
Working Notes for the CLEF 2009 Workshop,
September.
Constantine Lignos, Erwin Chan, Mitchell P. Marcus,
and Charles Yang. 2009. A rule-based unsuper-
vised morphology learning framework. In Working
Notes for the CLEF 2009 Workshop, September.
Constantine Lignos. 2010. Learning from unseen
data. In Mikko Kurimo, Sami Virpioja, Ville Tu-
runen, and Krista Lagus, editors, Proceedings of the
Morpho Challenge 2010 Workshop, pages 35–38,
Aalto University, Espoo, Finland.
Christian Monson, Kristy Hollingshead, and Brian
Roark. 2009. Probabilistic paramor. In Pro-
ceedings of the 10th cross-language evaluation fo-
rum conference on Multilingual information access
evaluation: text retrieval experiments, CLEF’09,
September.
Lionel Nicolas, Jacques Farr´
e, and Miguel A. Mo-
linero. 2010. Unsupervised learning of concate-
native morphology based on frequency-related form
occurrence. In Mikko Kurimo, Sami Virpioja, Ville
Turunen, and Krista Lagus, editors, Proceedings of
the Morpho Challenge 2010 Workshop, pages 39–
43, Aalto University, Espoo, Finland.
Matthew G. Snover, Gaja E. Jarosz, and Michael R.
Brent. 2002. Unsupervised learning of morphol-
ogy using a novel directed search algorithm: Taking
the first step. In Proceedings of the ACL-02 Work-
shop on Morphological and Phonological Learn-
ing, pages 11–20, Morristown, NJ, USA. ACL.
Sami Virpioja, Oskar Kohonen, and Krista Lagus.
2009. Unsupervised morpheme discovery with al-
lomorfessor. In Working Notes for the CLEF 2009
Workshop. September.
Sami Virpioja, Ville T. Turunen, Sebastian Spiegler,
Oskar Kohonen, and Mikko Kurimo. 2011. Em-
pirical comparison of evaluation methods for unsu-
pervised learning of morphology. In Traitement Au-
tomatique des Langues.
663
... Computational Linguistics Just Accepted MS. https://doi.org/10.1162/COLI_a_00318 © Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license Can and Manandhar (2012). Our current approach is more efficient as the training cost is divided across multiple tree structures with each tree being shallower compared to our previous model. ...
... In order to draw a substantive empirical comparison, we performed another set of experiments by running the current approach on only 22K words as the Single Tree Probabilistic Clustering (Can and Manandhar 2012). Results are given in Table 5, (2015)) that provide segmentation points 3 . ...
Article
Full-text available
This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data.
... The concept of a paradigm is also used in many published efforts on unsupervised learning of morphology, although not always in a way consistent with its use in linguistics. For instance, Snover et al. (2002) (and later on Can and Manandhar (2012)) define a paradigm as "a set of suffixes and the stems that attach to those suffixes and no others". This definition is quite limited since it is not modeling the notion of lexeme. ...
... Morfessor is the name of the family of a group of unsupervised morphological segmentation systems which are all stochastic [8,10,9]. Non-parametric Bayesian models have also been applied in morphological segmentation [12,19,5]. ...
Chapter
In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. We adopt prior information from different sources in the model. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts obtained from tries that are built by neural word embeddings. Our results show that using different information sources such as neural word embeddings and letter successor variety as prior information improves morphological segmentation in a Bayesian model. Our model outperforms other unsupervised morphological segmentation models on Turkish and gives promising results on English and German for scarce resources.
... Many morphological segmentation systems [2][3][4][5] only split words into their surface morphs rather than finding lexical morphemes. Morphs are distinct realizations that belong to the same type of morpheme. ...
Article
Full-text available
One morpheme may have several surface forms that correspond to allomorphs. In English, ed and d are surface forms of the past tense morpheme, and s, es, and ies are surface forms of the plural or present tense morpheme. Turkish has a large number of allomorphs due to its morphophonemic processes. One morpheme can have tens of different surface forms in Turkish. This leads to a sparsity problem in natural language processing tasks in Turkish. Detection of allomorphs has not been studied much because of its difficulty. For example, tü and di are Turkish allomorphs (i.e. past tense morpheme), but all of their letters are different. This paper presents an unsupervised model to extract the allomorphs in Turkish. We are able to obtain an F-measure of 73.71% in the detection of allomorphs, and our model outperforms previous unsupervised models on morpheme clustering.
... Morfessor is the name of the family of a group of unsupervised morphological segmentation systems which are all stochastic [8,10,9]. Non-parametric Bayesian models have also been applied in morphological segmentation [12,19,5]. ...
Article
Full-text available
In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. We adopt prior information from different sources in the model. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts obtained from tries that are built by neural word embeddings. Our results show that using different information sources such as neural word embeddings and letter successor variety as prior information improves morphological segmentation in a Bayesian model. Our model outperforms other unsupervised morphological segmentation models on Turkish and gives promising results on English and German for scarce resources.
... Can and Manandhar [6] propose a hierarchical Dirichlet process model to capture morphological paradigms that are structured within a tree hierarchy. Each node on the tree denotes a morphological paradigm that consists of a stem and a suffix list. ...
Conference Paper
Full-text available
Morphemes are not independent units and attached to each other based on morphotactics. However, they are assumed to be independent from each other to cope with the complexity in most of the models in the literature. We introduce a language independent model for unsupervised morphological segmentation using hierarchical Dirichlet process (HDP). We model the morpheme dependencies in terms of morpheme trigrams in each word. Trigrams, bigrams and unigrams are modeled within a three-level HDP, where the trigram Dirichlet process (DP) uses the bigram DP and bigram DP uses unigram DP as the base distribution. The results show that modeling morpheme dependencies improve the F-measure noticeably in English, Turkish and Finnish.
Article
Full-text available
Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and nonconcatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems.
Conference Paper
Morphological models are used in many natural language processing tasks including machine translation and speech recognition. We investigated probabilistic and grouping methods to develop a morphological root identification model for Afaan Oromo. In this paper, we have experimentally shown that the proposed methods can improve the morphological root identification performance of some state-of-the-art methods.
Article
Full-text available
The learner of Lignos et al. (2009) at-tained excellent performance in English in Morpho Challenge 2009, but its reliance on minimal word pairs in the input to learn which words a rule applies to led to poor performance in other languages. We demonstrate that this learner can per-form well across a broader set of lan-guages if it works to infer word forms un-seen in the data. We evaluate approaches to compounding and base word inference to accomplish this goal, improving the learner's performance greatly in Turkish and Finnish.
Article
Full-text available
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.
Conference Paper
Full-text available
We describe an unsupervised approach to acquire from raw data a data-orientated representation of a concatenative morphology. This approach takes advantage of various phenomena and, among them, it exploits the fact that the more frequent a form is, the more chances there are to find its morphologically related forms. Since its implementation is based on straightforward statements and rather simple computations, its efficiency tend to rely more on the size of the input corpus and the adequation between the statements and the language studied than it relies on the intern formulas or parameters. The expert work required to apply it to various languages is thus greatly reduced.
Article
Full-text available
This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and a novel search algorithm. By examining morphologically rich subsets of an input lexicon, the search identifies highly productive paradigms. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with other recent unsupervised morphology learning algorithms demonstrate the effectiveness of this technique.
Article
Full-text available
Unsupervised and semi-supervised learning of morphology provide practical solutions for processing morphologically rich languages with less human labor than the traditional rule-based analyzers. Direct evaluation of the learning methods using linguistic reference analyses is important for their development, as evaluation through the final applications is often time consuming. However, even linguistic evaluation is not straightforward for full morphological analysis, because the morpheme labels generated by the learning method can be arbitrary. We review the previous evaluation methods for the learning tasks and propose new variations. In order to compare the methods, we perform an extensive meta-evaluation using the large collection of results from the Morpho Challenge competitions.
Article
Full-text available
We use the Base and Transforms Model proposed by Chan [1] as the core of a morpho-logical analyzer, extending its concept of base-derived relationships to allow multi-step derivations and adding a number of features required for robustness on larger corpora. The result is a rule-based morphological analyzer, attaining an F-score of 58.48% in English and 33.61% in German in the Morphochallenge 2009 Competition 1 evaluation.
Article
Full-text available
In this work, we describe the first public version of the Morfessor software, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. Morfessor is not language-dependent. The number of segments per word is not restricted to two or three as in some other existing mor-phology learning models. The current version of the software essentially implements two morpheme segmentation models presented earlier by us (Creutz and Lagus, 2002; Creutz, 2003). The document contains user's instructions, as well as the mathematical formula-tion of the model and a description of the search algorithm used. Additionally, a few experiments on Finnish and English text corpora are reported in order to give the user some ideas of how to apply the program to his own data sets and how to evaluate the results.
Article
Full-text available
This work presents an algorithm for the unsupervised learn-ing, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexi-con stores parameters related to both the "meaning" and "form" of the morphs it contains. These parameters af-fect the role of the morphs in words. The model is imple-mented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are ob-tained for Finnish and almost as good results are obtained in the English task.
Article
We describe Allomorfessor, which extends the unsupervised morpheme segmentation method Morfessor to account for the linguistic phenomenon of allomorphy, where one morpheme has several different surface forms. The method discovers common base forms for allomorphs from an unannotated corpus by finding small modifications, called mutations, for them. Using Maximum a Posteriori estimation, the model is able to decide the amount and types of the mutations needed for the particular language. The method is evaluated in Morpho Challenge 2009.
Article
This paper introduces the probabilistic paradigm, a probabilistic, declarative model of morphological structure. We de-scribe an algorithm that recursively ap-plies Latent Dirichlet Allocation with an orthogonality constraint to discover mor-phological paradigms as the latent classes within a suffix-stem matrix. We apply the algorithm to data preprocessed in several different ways, and show that when suf-fixes are distinguished for part of speech and allomorphs or gender/conjugational variants are merged, the model is able to correctly learn morphological paradigms for English and Spanish. We compare our system with Linguistica (Goldsmith 2001), and discuss the advantages of the probabilistic paradigm over Linguistica's signature representation.