Content uploaded by Burcu Can
Author content
All content in this area was uploaded by Burcu Can on May 21, 2014
Content may be subject to copyright.
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 654–663,
Avignon, France, April 23 - 27 2012. c
2012 Association for Computational Linguistics
Probabilistic Hierarchical Clustering of
Morphological Paradigms
Burcu Can
Department of Computer Science
University of York
Heslington, York, YO10 5GH, UK
burcucan@gmail.com
Suresh Manandhar
Department of Computer Science
University of York
Heslington, York, YO10 5GH, UK
suresh@cs.york.ac.uk
Abstract
We propose a novel method for learning
morphological paradigms that are struc-
tured within a hierarchy. The hierarchi-
cal structuring of paradigms groups mor-
phologically similar words close to each
other in a tree structure. This allows detect-
ing morphological similarities easily lead-
ing to improved morphological segmen-
tation. Our evaluation using (Kurimo et
al., 2011a; Kurimo et al., 2011b) dataset
shows that our method performs competi-
tively when compared with current state-of-
art systems.
1 Introduction
Unsupervised morphological segmentation of a
text involves learning rules for segmenting words
into their morphemes. Morphemes are the small-
est meaning bearing units of words. The learn-
ing process is fully unsupervised, using only raw
text as input to the learning system. For example,
the word respectively is split into morphemes re-
spect,ive and ly. Many fields, such as machine
translation, information retrieval, speech recog-
nition etc., require morphological segmentation
since new words are always created and storing
all the word forms will require a massive dictio-
nary. The task is even more complex, when mor-
phologically complicated languages (i.e. agglu-
tinative languages) are considered. The sparsity
problem is more severe for more morphologically
complex languages. Applying morphological seg-
mentation mitigates data sparsity by tackling the
issue with out-of-vocabulary (OOV) words.
In this paper, we propose a paradigmatic ap-
proach. A morphological paradigm is a pair
(StemList, SuffixList) such that each concatena-
tion of Stem+Suffix (where Stem ∈StemList and
Suffix ∈SuffixList) is a valid word form. The
learning of morphological paradigms is not novel
as there has already been existing work in this area
such as Goldsmith (2001), Snover et al. (2002),
Monson et al. (2009), Can and Manandhar (2009)
and Dreyer and Eisner (2011). However, none of
these existing approaches address learning of the
hierarchical structure of paradigms.
Hierarchical organisation of words help cap-
ture morphological similarities between words in
a compact structure by factoring these similarities
through stems, suffixes or prefixes. Our inference
algorithm simultaneously infers latent variables
(i.e. the morphemes) along with their hierarchical
organisation. Most hierarchical clustering algo-
rithms are single-pass, where once the hierarchi-
cal structure is built, the structure does not change
further.
The paper is structured as follows: section 2
gives the related work, section 3 describes the
probabilistic hierarchical clustering scheme, sec-
tion 4 explains the morphological segmenta-
tion model by embedding it into the clustering
scheme and describes the inference algorithm
along with how the morphological segmentation
is performed, section 5 presents the experiment
settings along with the evaluation scores, and fi-
nally section 6 presents a discussion with a com-
parison with other systems that participated in
Morpho Challenge 2009 and 2010 .
2 Related Work
We propose a Bayesian approach for learning of
paradigms in a hierarchy. If we ignore the hierar-
chical aspect of our learning algorithm, then our
654
walk walking talked talks
{walk}{0,ing} {talk}{ed,s} {quick}{0,ly}
quick quickly
{walk, talk, quick}{0,ed,ing,ly, s}
{walk, talk}{0,ed,ing,s}
Figure 1: A sample tree structure.
method is similar to the Dirichlet Process (DP)
based model of Goldwater et al. (2006). From
this perspective, our method can be understood
as adding a hierarchical structure learning layer
on top of the DP based learning method proposed
in Goldwater et al. (2006). Dreyer and Eisner
(2011) propose an infinite Diriclet mixture model
for capturing paradigms. However, they do not
address learning of hierarchy.
The method proposed in Chan (2006) also
learns within a hierarchical structure where La-
tent Dirichlet Allocation (LDA) is used to find
stem-suffix matrices. However, their work is su-
pervised, as true morphological analyses of words
are provided to the system. In contrast, our pro-
posed method is fully unsupervised.
3 Probabilistic Hierarchical Model
The hierarchical clustering proposed in this work
is different from existing hierarchical clustering
algorithms in two aspects:
•It is not single-pass as the hierarchical struc-
ture changes.
•It is probabilistic and is not dependent on a
distance metric.
3.1 Mathematical Definition
In this paper, a hierarchical structure is a binary
tree in which each internal node represents a clus-
ter.
Let a data set be D={x1, x2, . . . , xn}and
Tbe the entire tree, where each data point xiis
located at one of the leaf nodes (see Figure 2).
Here, Dkdenotes the data points in the branch
Tk. Each node defines a probabilistic model for
words that the cluster acquires. The probabilistic
Di
Dk
Dj
X1X2X3X4
Figure 2: A segment of a tree with with internal nodes
Di, Dj, Dkhaving data points {x1, x2, x3, x4}. The
subtree below the internal node Diis called Ti, the
subtree below the internal node Djis Tj, and the sub-
tree below the internal node Dkis Tk.
model can be denoted as p(xi|θ)where θdenotes
the parameters of the probabilistic model.
The marginal probability of data in any node
can be calculated as:
p(Dk) = p(Dk|θ)p(θ|β)dθ(1)
The likelihood of data under any subtree is de-
fined as follows:
p(Dk|Tk) = p(Dk)p(Dl|Tl)p(Dr|Tr)(2)
where the probability is defined in terms of left Tl
and right Trsubtrees. Equation 2 provides a re-
cursive decomposition of the likelihood in terms
of the likelihood of the left and the right sub-
trees until the leaf nodes are reached. We use the
marginal probability (Equation 1) as prior infor-
mation since the marginal probability bears the
probability of having the data from the left and
right subtrees within a single cluster.
4 Morphological Segmentation
In our model, data points are words to be clus-
tered and each cluster represents a paradigm. In
the hierarchical structure, words will be organised
in such a way that morphologically similar words
will be located close to each other to be grouped
in the same paradigms. Morphological similarity
refers to at least one common morpheme between
words. However, we do not make a distinction be-
tween morpheme types. Instead, we assume that
each word is organised as a stem+suffix combina-
tion.
4.1 Model Definition
Let a dataset D
D
Dconsist of words to be analysed,
where each word wihas a latent variable which is
655
the split point that analyses the word into its stem
siand suffix mi:
D
D
D={w1=s1+m1, . . . , wn=sn+mn}
The marginal likelihood of words in the node k
is defined such that:
p(Dk) = p(Sk)p(Mk)
=p(s1, s2, . . . , sn)p(m1, m2, . . . , mn)
The words in each cluster represents a
paradigm that consists of stems and suffixes. The
hierarchical model puts words sharing the same
stems or suffixes close to each other in the tree.
Each word is part of all the paradigms on the
path from the leaf node having that word to the
root. The word can share either its stem or suffix
with other words in the same paradigm. Hence,
a considerable number of words can be generated
through this approach that may not be seen in the
corpus.
We postulate that stems and suffixes are gen-
erated independently from each other. Thus, the
probability of a word becomes:
p(w=s+m) = p(s)p(m)(3)
We define two Dirichlet processes to generate
stems and suffixes independently:
Gs|βs, Ps∼DP (βs, Ps)
Gm|βm, Pm∼DP (βm, Pm)
s|Gs∼Gs
m|Gm∼Gm
where DP (βs, Ps)denotes a Dirichlet process
that generates stems. Here, βsis the concentration
parameter, which determines the number of stem
types generated by the Dirichlet process. The
smaller the value of the concentration parameter,
the less likely to generate new stem types the pro-
cess is. In contrast, the larger the value of concen-
tration parameter, the more likely it is to generate
new stem types, yielding a more uniform distribu-
tion over stem types. If βs<1, sparse stems are
supported, it yields a more skewed distribution.
To support a small number of stem types in each
cluster, we chose βs<1.
Here, Psis the base distribution. We use the
base distribution as a prior probability distribu-
tion for morpheme lengths. We model morpheme
βsβm
PsPm
GsGm
simi
wi
L N
n
Figure 3: The plate diagram of the model, representing
the generation of a word wifrom the stem siand the
suffix mithat are generated from Dirichlet processes.
In the representation, solid-boxes denote that the pro-
cess is repeated with the number given on the corner
of each box.
lengths implicitly through the morpheme letters:
Ps(si) =
ci∈si
p(ci)(4)
where cidenotes the letters, which are distributed
uniformly. Modelling morpheme letters is a way
of modelling the morpheme length since shorter
morphemes are favoured in order to have fewer
factors in Equation 4 (Creutz and Lagus, 2005b).
The Dirichlet process, DP (βm, Pm), is defined
for suffixes analogously. The graphical represen-
tation of the entire model is given in Figure 3.
Once the probability distributions G=
{Gs, Gm}are drawn from both Dirichlet pro-
cesses, words can be generated by drawing a stem
from Gsand a suffix from Gm. However, we do
not attempt to estimate the probability distribu-
tions G; instead, Gis integrated out. The joint
probability of stems is calculated by integrating
out Gs:
p(s1, s2, . . . , sM)
=p(Gs)
L
i=1
p(si|Gs)dGs(5)
where Ldenotes the number of stem tokens. The
joint probability distribution of stems can be tack-
led as a Chinese restaurant process. The Chi-
nese restaurant process introduces dependencies
between stems. Hence, the joint probability of
656
stems S={s1, . . . , sL}becomes:
p(s1, s2, . . . , sL)
=p(s1)p(s2|s1). . . p(sM|s1, . . . , sM−1)
=Γ(βs)
Γ(L+βs)βK−1
s
K
i=1
Ps(si)
K
i=1
(nsi−1)!
(6)
where Kdenotes the number of stem types. In
the equation, the second and the third factor corre-
spond to the case where novel stems are generated
for the first time; the last factor corresponds to the
case in which stems that have already been gener-
ated for nsitimes previously are being generated
again. The first factor consists of all denominators
from both cases.
The integration process is applied for proba-
bility distributions Gmfor suffixes analogously.
Hence, the joint probability of suffixes M=
{m1, . . . , mN}becomes:
p(m1, m2, . . . , mN)
=p(m1)p(m2|m1). . . p(mN|m1, . . . , mN−1)
=Γ(α)
Γ(N+α)αT
T
i=1
Pm(mi)
T
i=1
(nmi−1)!
(7)
where Tdenotes the number of suffix types and
nmiis the number of stem types miwhich have
been already generated.
Following the joint probability distribution of
stems, the conditional probability of a stem given
previously generated stems can be derived as:
p(si|S−si, βs, Ps)
=
nS−si
si
L−1+βsif si∈S−si
βs∗Ps(si)
L−1+βsotherwise
(8)
where nS−si
sidenotes the number of stem in-
stances sithat have been previously generated,
where S−sidenotes the stem set excluding the
new instance of the stem si.
The conditional probability of a suffix given the
other suffixes that have been previously generated
is defined similarly:
p(mi|M−mi, βm, Pm)
=
nM−mi
mi
N−1+βmif mi∈M−mi
βm∗Pm(mi)
N−1+βmotherwise
(9)
where nM−i
k
miis the number of instances mithat
have been generated previously where M−miis
plugg+ed skew+ed
exclaim+ed
borrow+s borrow+ed
liken+s liken+ed
consist+s consist+ed
Figure 4: A portion of a sample tree.
the set of suffixes, excluding the new instance of
the suffix mi.
A portion of a tree is given in Figure 4. As
can be seen on the figure, all words are lo-
cated at leaf nodes. Therefore, the root node
of this subtree consists of words {plugg+ed,
skew+ed, exclaim+ed, borrow+s, borrow+ed,
liken+s, liken+ed, consist+s, consist+ed}.
4.2 Inference
The initial tree is constructed by randomly choos-
ing a word from the corpus and adding this into a
randomly chosen position in the tree. When con-
structing the initial tree, latent variables are also
assigned randomly, i.e. each word is split at a ran-
dom position (see Algorithm 1).
We use Metropolis Hastings algorithm (Hast-
ings, 1970), an instance of Markov Chain Monte
Carlo (MCMC) algorithms, to infer the optimal
hierarchical structure along with the morphologi-
cal segmentation of words (given in Algorithm 2).
During each iteration i, a leaf node Di={wi=
si+mi}is drawn from the current tree structure.
The drawn leaf node is removed from the tree.
Next, a node Dkis drawn uniformly from the tree
657
Algorithm 1 Creating initial tree.
1: input: data D={w1=s1+m1, . . . , wn=
sn+mn},
2: initialise: root ←D1where
D1={w1=s1+m1}
3: initialise: c←n−1
4: while c >= 1 do
5: Draw a word wjfrom the corpus.
6: Split the word randomly such that wj=
sj+mj
7: Create a new node Djwhere Dj=
{wj=sj+mj}
8: Choose a sibling node Dkfor Dj
9: Merge Dnew ←Dj⊎Dk
10: Remove wjfrom the corpus
11: c←c−1
12: end while
13: output: Initial tree
to make it a sibling node to Di. In addition to a
sibling node, a split point wi=s′
i+m′
iis drawn
uniformly. Next, the node Di={wi=s′
i+m′
i}
is inserted as a sibling node to Dk. After updating
all probabilities along the path to the root, the new
tree structure is either accepted or rejected by ap-
plying the Metropolis-Hastings update rule. The
likelihood of data under the given tree structure is
used as the sampling probability.
We use a simulated annealing schedule to up-
date PAcc:
PAcc =pnext(D|T)
pcur(D|T)1
γ(10)
where γdenotes the current temperature,
pnext(D|T)denotes the marginal likelihood
of the data under the new tree structure, and
pcur(D|T)denotes the marginal likelihood of
data under the latest accepted tree structure. If
(pnext(D|T)> pcur (D|T)) then the update is
accepted (see line 9, Algorithm 2), otherwise, the
tree structure is still accepted with a probability
of pAcc (see line 14, Algorithm 2). In our
experiments (see section 5) we set γto 2. The
system temperature is reduced in each iteration
of the Metropolis Hastings algorithm:
γ←γ−η(11)
Most tree structures are accepted in the earlier
stages of the algorithm, however, as the tempera-
Algorithm 2 Inference algorithm
1: input: data D={w1=s1+m1, . . . , wn=
sn+mn}, initial tree T, initial temperature
of the system γ, the target temperature of the
system κ, temperature decrement η
2: initialise: i←1,w←wi=si+mi,
pcur(D|T)←p(D|T)
3: while γ > κ do
4: Remove the leaf node Dithat has the
word wi=si+mi
5: Draw a split point for the word such that
wi=s′
i+m′
i
6: Draw a sibling node Dj
7: Dm←Di⊎Dj
8: Update pnext(D|T)
9: if pnext(D|T)>=pcur (D|T)then
10: Accept the new tree structure
11: pcur(D|T)←pnext (D|T)
12: else
13: random ∼N ormal(0,1)
14: if random < pnext (D|T)
pcur(D|T)1
γthen
15: Accept the new tree structure
16: pcur(D|T)←pnext (D|T)
17: else
18: Reject the new tree structure
19: Re-insert the node Diat its pre-
vious position with the previous
split point
20: end if
21: end if
22: w←wi+1 =si+1 +mi+1
23: γ←γ−η
24: end while
25: output: A tree structure where each node
corresponds to a paradigm.
ture decreases only tree structures that lead lead to
a considerable improvement in the marginal prob-
ability p(D|T)are accepted.
An illustration of sampling a new tree structure
is given in Figure 5 and 6. Figure 5 shows that
D0will be removed from the tree in order to sam-
ple a new position on the tree, along with a new
split point of the word. Once the leaf node is re-
moved from the tree, the parent node is removed
from the tree, as the parent node D5will consist
of only one child. Figure 6 shows that D8is sam-
pled to be the sibling node of D0. Subsequently,
the two nodes are merged within a new cluster that
658
D5
D1
D6
D2D3D4
D0
D7
D8
Figure 5: D0will be removed from the tree.
D9
D1
D6
D2D3D4D0
D7
D8
Figure 6: D8is sampled to be the sibling of D0.
introduces a new node D9.
4.3 Morphological Segmentation
Once the optimal tree structure is inferred, along
with the morphological segmentation of words,
any novel word can be analysed. For the segmen-
tation of novel words, the root node is used as it
contains all stems and suffixes which are already
extracted from the training data. Morphological
segmentation is performed in two ways: segmen-
tation at a single point and segmentation at multi-
ple points.
4.3.1 Single Split Point
In order to find single split point for the mor-
phological segmentation of a word, the split point
yielding the maximum probability given inferred
stems and suffixes is chosen to be the final analy-
sis of the word:
arg max
j
p(wi=sj+mj|Droot, βm, Pm, βs, Ps)
(12)
where Droot refers to the root of the entire tree.
Here, the probability of a segmentation of a
given word given Droot is calculated as given be-
low:
p(wi=sj+mj|Droot, βm, Pm, βs, Ps) =
p(sj|Sroot, βs, Ps)p(mj|Mroot, βm, Pm)
(13)
where Sroot denotes all the stems in Droot and
Mroot denotes all the suffixes in Droot. Here
p(sj|Sroot, βs, Ps)is calculated as given below:
p(si|Sroot, βs, Ps) =
nSroot
si
L+βsif si∈Sroot
βs∗Ps(si)
L+βsotherwise
(14)
Similarly, p(mj|Mroot, βm, Pm)is calculated
as:
p(mi|Mroot, βm, Pm) =
nMroot
mi
N+βmif mi∈Mroot
βm∗Pm(mi)
N+βmotherwise
(15)
4.3.2 Multiple Split Points
In order to discover words with multiple split
points, we propose a hierarchical segmentation
where each segment is split further. The rules for
generating multiple split points is given by the fol-
lowing context free grammar:
w←s1m1|s2m2(16)
s1←s m|s s (17)
s2←s(18)
m1←m m (19)
m2←s m|m m (20)
Here, s is a pre-terminal node that generates all
the stems from the root node. And similarly, m is
a pre-terminal node that generates all the suffixes
from the root node. First, using Equation 16, the
word (e.g. housekeeper) is split into s1m1(e.g.
housekeep+er) or s2m2(house+keeper). The first
segment is regarded as a stem, and the second
segment is either a stem or a suffix, consider-
ing the probability of having a compound word.
Equation 12 is used to decide whether the sec-
ond segment is a stem or a suffix. At the sec-
ond segmentation level, each segment is split once
more. If the first production rule is followed in
the first segmentation level, the first segment s1
can be analysed as s m (e.g. housekeep+∅) or s s
659
!"#$%&%%'%(
!"#$% &%%'%(
!"#$% ) &%%' %(
Figure 7: An example that depicts how the word
housekeeper can be analysed further to find more split
points.
(e.g. house+keep) (Equation 17). The decision
to choose which production rule to apply is made
using:
s1←{s s if p(s|S, βs, Ps)> p(m|M , βm, Pm)
s m otherwise (21)
where S and M denote all the stems and suffixes
in the root node.
Following the same production rule, the second
segment m1can only be analysed as m m (er+∅).
We postulate that words cannot have more than
two stems and suffixes always follow stems. We
do not allow any prefixes, circumfixes, or infixes.
Therefore, the first production rule can output two
different analyses: smmmand ssmm(e.g.
housekeep+er and house+keep+er).
On the other hand, if the word is analysed as
s2m2(e.g. house+keeper), then s2cannot be
analysed further. (e.g. house). The second seg-
ment m2can be analysed further, such that s m
(stem+suffix) (e.g. keep+er, keeper+∅) or m m
(suffix+suffix). The decision to choose which pro-
duction rule to apply is made as follows:
m2←{s m if p(s|S, βs, Ps)> p(m|M , βm, Pm)
m m otherwise (22)
Thus, the second production rule yields two
different analyses: ssmand s m m (e.g.
house+keep+er or house+keeper).
5 Experiments & Results
Two sets of experiments were performed for the
evaluation of the model. In the first set of exper-
iments, each word is split at single point giving a
single stem and a single suffix. In the second set
of experiments, potentially multiple split points
!"##$%
&'(#)%
*
+,
%*-
,-.
,+/
0**
*%,
*.-
1*.
/%/
/.*
+1,
.,-
...
21/
%-,*
%-2,
%%/-
%,,.
%,2/
%0/*
%*0,
%1--
%1/.
%/0/
%+-*
3%4.-56--+
3%4/-56--+
3%4*-56--+
3%4,-56--+
3%4--56--+
3.4--56--/
3/4--56--/
3*4--56--/
3,4--56--/
-4--5 6---
%/7
,,7
8$#9'$:;<=
>'9(:<'?)?:@#?:";;A
Figure 8: Marginal likelihood convergence for datasets
of size 16K and 22K words.
are generated, by splitting each stem and suffix
once more, if it is possible to do so.
Morpho Challenge (Kurimo et al., 2011b) pro-
vides a well established evaluation framework
that additionally allows comparing our model in
a range of languages. In both sets of experiments,
the Morpho Challenge 2010 dataset is used (Ku-
rimo et al., 2011b). Experiments are performed
for English, where the dataset consists of 878,034
words. Although the dataset provides word fre-
quencies, we have not used any frequency infor-
mation. However, for training our model, we only
chose words with frequency greater than 200.
In our experiments, we used dataset sizes of
10K, 16K, 22K words. However, for final eval-
uation, we trained our models on 22K words. We
were unable to complete the experiments with
larger training datasets due to memory limita-
tions. We plan to report this in future work. Once
the tree is learned by the inference algorithm, the
final tree is used for the segmentation of the entire
dataset. Several experiments are performed for
each setting where the setting varies with the tree
size and the model parameters. Model parameters
are the concentration parameters β={βs, βm}
of the Dirichlet processes. The concentration pa-
rameters, which are set for the experiments, are
0.1,0.2,0.02,0.001,0.002.
In all experiments, the initial temperature of the
system is assigned as γ= 2 and it is reduced to
the temperature γ= 0.01 with decrements η=
0.0001. Figure 8 shows how the log likelihoods of
trees of size 16K and 22K converge in time (where
the time axis refers to sampling iterations).
Since different training sets will lead to differ-
ent tree structures, each experiment is repeated
three times keeping the experiment setting the
same.
660
Data Size P(%) R(%) F(%) βs, βm
10K 81.48 33.03 47.01 0.1, 0.1
16K 86.48 35.13 50.02 0.002, 0.002
22K 89.04 36.01 51.28 0.002, 0.002
Table 1: Highest evaluation scores of single split point
experiments obtained from the trees with 10K, 16K,
and 22K words.
Data Size P(%) R(%) F(%) βs, βm
10K 62.45 57.62 59.98 0.1, 0.1
16K 67.80 57.72 62.36 0.002, 0.002
22K 68.71 62.56 62.56 0.001 0.001
Table 2: Evaluation scores of multiple split point ex-
periments obtained from the trees with 10K, 16K, and
22K words.
5.1 Experiments with Single Split Points
In the first set of experiments, words are split into
a single stem and suffix. During the segmentation,
Equation 12 is used to determine the split position
of each word. Evaluation scores are given in Ta-
ble 1. The highest F-measure obtained is 51.28%
with the dataset of 22K words. The scores are no-
ticeably higher with the largest training set.
5.2 Experiments with Multiple Split Points
The evaluation scores of experiments with mul-
tiple split points are given in Table 2. The high-
est F-measure obtained is 62.56% with the dataset
with 22K words. As for single split points, the
scores are noticeably higher with the largest train-
ing set.
For both, single and multiple segmentation, the
same inferred tree has been used.
5.3 Comparison with Other Systems
For all our evaluation experiments using Mor-
pho Challenge 2010 (English and Turkish) and
Morpho Challenge 2009 (English), we used 22k
words for training. For each evaluation, we ran-
domly chose 22k words for training and ran our
MCMC inference procedure to learn our model.
We generated 3 different models by choosing 3
different randomly generated training sets each
consisting of 22k words. The results are the best
results over these 3 models. We are reporting the
best results out of the 3 models due to the small
(22k word) datasets used. Use of larger datasets
would have resulted in less variation and better
results.
System P(%) R(%) F(%)
Allomorf168.98 56.82 62.31
Morf. Base.274.93 49.81 59.84
PM-Union355.68 62.33 58.82
Lignos483.49 45.00 58.48
Prob. Clustering (multiple) 57.08 57.58 57.33
PM-mimic353.13 59.01 55.91
MorphoNet565.08 47.82 55.13
Rali-cof668.32 46.45 55.30
CanMan758.52 44.82 50.76
1Virpioja et al. (2009)
2Creutz and Lagus (2002)
3Monson et al. (2009)
4Lignos et al. (2009)
5Bernhard (2009)
6Lavall´
ee and Langlais (2009)
7Can and Manandhar (2009)
Table 3: Comparison with other unsupervised systems
that participated in Morpho Challenge 2009 for En-
glish.
We compare our system with the other partici-
pant systems in Morpho Challenge 2010. Results
are given in Table 6 (Virpioja et al., 2011). Since
the model is evaluated using the official (hidden)
Morpho Challenge 2010 evaluation dataset where
we submit our system for evaluation to the organ-
isers, the scores are different from the ones that
we presented Table 1 and Table 2.
We also demonstrate experiments with Morpho
Challenge 2009 English dataset. The dataset con-
sists of 384,904 words. Our results and the re-
sults of other participant systems in Morpho Chal-
lenge 2009 are given in Table 3 (Kurimo et al.,
2009). It should be noted that we only present
the top systems that participated in Morpho Chal-
lenge 2009. If all the systems are considered, our
system comes 5th out of 16 systems.
The problem of morphologically rich lan-
guages is not our priority within this research.
Nevertheless, we provide evaluation scores on
Turkish. The Turkish dataset consists of 617,298
words. We chose words with frequency greater
than 50 for Turkish since the Turkish dataset is not
large enough. The results for Turkish are given in
Table 4. Our system comes 3rd out of 7 systems.
6 Discussion
The model can easily capture common suffixes
such as -less, -s, -ed, -ment, etc. Some sample tree
nodes obtained from trees are given in Table 6.
661
System P(%) R(%) F(%)
Morf. CatMAP 79.38 31.88 45.49
Aggressive Comp. 55.51 34.36 42.45
Prob. Clustering (multiple) 72.36 25.81 38.04
Iterative Comp. 68.69 21.44 32.68
Nicolas 79.02 19.78 31.64
Morf. Base. 89.68 17.78 29.67
Base Inference 72.81 16.11 26.38
Table 4: Comparison with other unsupervised systems
that participated in Morpho Challenge 2010 for Turk-
ish.
regard+less, base+less, shame+less, bound+less,
harm+less, regard+ed, relent+less
solve+d, high+-priced, lower+s, lower+-level,
high+-level, lower+-income, histor+ians
pre+mise, pre+face, pre+sumed, pre+, pre+gnant
base+ment, ail+ment, over+looked, predica+ment,
deploy+ment, compart+ment, embodi+ment
anti+-fraud, anti+-war, anti+-tank, anti+-nuclear,
anti+-terrorism, switzer+, anti+gua, switzer+land
sharp+ened, strength+s, tight+ened, strength+ened,
black+ened
inspir+e, inspir+ing, inspir+ed, inspir+es, earn+ing,
ponder+ing
downgrade+s, crash+ed, crash+ing, lack+ing,
blind+ing, blind+, crash+, compris+ing, com-
pris+es, stifl+ing, compris+ed, lack+s, assist+ing,
blind+ed, blind+er,
Table 5: Sample tree nodes obtained from various
trees.
As seen from the table, morphologically similar
words are grouped together. Morphological sim-
ilarity refers to at least one common morpheme
between words. For example, the words high-
priced and lower-level are grouped in the same
node through the word high-level which shares
the same stem with high-priced and the same end-
ing with lower-level.
As seen from the sample nodes, prefixes
can also be identified, for example anti+fraud,
anti+war, anti+tank, anti+nuclear. This illus-
trates the flexibility in the model by capturing the
similarities through either stems, suffixes or pre-
fixes. However, as mentioned above, the model
does not consider any discrimination between dif-
ferent types of morphological forms during train-
ing. As the prefix pre- appears at the beginning of
words, it is identified as a stem. However, identi-
fying pre- as a stem does not yield a change in the
morphological analysis of the word.
System P(%) R(%) F(%)
Base Inference180.77 53.76 64.55
Iterative Comp.180.27 52.76 63.67
Aggressive Comp.171.45 52.31 60.40
Nicolas267.83 53.43 59.78
Prob. Clustering (multiple) 57.08 57.58 57.33
Morf. Baseline381.39 41.70 55.14
Prob. Clustering (single) 70.76 36.51 48.17
Morf. CatMAP486.84 30.03 44.63
1Lignos (2010)
2Nicolas et al. (2010)
3Creutz and Lagus (2002)
4Creutz and Lagus (2005a)
Table 6: Comparison of our model with other unsuper-
vised systems that participated in Morpho Challenge
2010 for English.
Sometimes similarities may not yield a valid
analysis of words. For example, the prefix pre-
leads the words pre+mise, pre+sumed, pre+gnant
to be analysed wrongly, whereas pre- is a valid
prefix for the word pre+face. Another nice fea-
ture about the model is that compounds are easily
captured through common stems: e.g. doubt+fire,
bon+fire, gun+fire, clear+cut.
7 Conclusion & Future Work
In this paper, we present a novel probabilis-
tic model for unsupervised morphology learn-
ing. The model adopts a hierarchical structure
in which words are organised in a tree so that
morphologically similar words are located close
to each other.
In hierarchical clustering, tree-cutting would be
a very useful thing to do but it is not addressed
in the current paper. We used just the root node
as a morpheme lexicon to apply segmentation.
Clearly, adding tree cutting would improve the ac-
curacy of the segmentation and will help us iden-
tify paradigms with higher accuracy. However,
the segmentation accuracy obtained without us-
ing tree cutting provides a very useful indicator
to show whether this approach is promising. And
experimental results show that this is indeed the
case.
In the current model, we did not use any syn-
tactic information, only words. POS tags can be
utilised to group words which are both morpho-
logically and syntactically similar.
662
References
Delphine Bernhard. 2009. Morphonet: Exploring the
use of community structure for unsupervised mor-
pheme analysis. In Working Notes for the CLEF
2009 Workshop, September.
Burcu Can and Suresh Manandhar. 2009. Cluster-
ing morphological paradigms using syntactic cate-
gories. In Working Notes for the CLEF 2009 Work-
shop, September.
Erwin Chan. 2006. Learning probabilistic paradigms
for morphology in a latent class model. In Proceed-
ings of the Eighth Meeting of the ACL Special Inter-
est Group on Computational Phonology and Mor-
phology, SIGPHON ’06, pages 69–78, Stroudsburg,
PA, USA. Association for Computational Linguis-
tics.
Mathias Creutz and Krista Lagus. 2002. Unsu-
pervised discovery of morphemes. In Proceed-
ings of the ACL-02 workshop on Morphological
and phonological learning - Volume 6, MPL ’02,
pages 21–30, Stroudsburg, PA, USA. Association
for Computational Linguistics.
Mathias Creutz and Krista Lagus. 2005a. Induc-
ing the morphological lexicon of a natural language
from unannotated text. In In Proceedings of the
International and Interdisciplinary Conference on
Adaptive Knowledge Representation and Reasoning
(AKRR 2005, pages 106–113.
Mathias Creutz and Krista Lagus. 2005b. Unsu-
pervised morpheme segmentation and morphology
induction from text corpora using morfessor 1.0.
Technical Report A81.
Markus Dreyer and Jason Eisner. 2011. Discover-
ing morphological paradigms from plain text using
a dirichlet process mixture model. In Proceedings
of the 2011 Conference on Empirical Methods in
Natural Language Processing, pages 616–627, Ed-
inburgh, Scotland, UK., July. Association for Com-
putational Linguistics.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
Linguistics, 27(2):153–198.
Sharon Goldwater, Thomas L. Griffiths, and Mark
Johnson. 2006. Interpolating between types and to-
kens by estimating power-law generators. In In Ad-
vances in Neural Information Processing Systems
18, page 18.
W. K. Hastings. 1970. Monte carlo sampling meth-
ods using markov chains and their applications.
Biometrika, 57:97–109.
Mikko Kurimo, Sami Virpioja, Ville T. Turunen,
Graeme W. Blackwood, and William Byrne. 2009.
Overview and results of morpho challenge 2009.
In Proceedings of the 10th cross-language eval-
uation forum conference on Multilingual infor-
mation access evaluation: text retrieval experi-
ments, CLEF’09, pages 578–597, Berlin, Heidel-
berg. Springer-Verlag.
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
Ville Turunen. 2011a. Morpho challenge
2009. http://research.ics.tkk.fi/
events/morphochallenge2009/, June.
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
Ville Turunen. 2011b. Morpho challenge
2010. http://research.ics.tkk.fi/
events/morphochallenge2010/, June.
Jean Franc¸ois Lavall´
ee and Philippe Langlais. 2009.
Morphological acquisition by formal analogy. In
Working Notes for the CLEF 2009 Workshop,
September.
Constantine Lignos, Erwin Chan, Mitchell P. Marcus,
and Charles Yang. 2009. A rule-based unsuper-
vised morphology learning framework. In Working
Notes for the CLEF 2009 Workshop, September.
Constantine Lignos. 2010. Learning from unseen
data. In Mikko Kurimo, Sami Virpioja, Ville Tu-
runen, and Krista Lagus, editors, Proceedings of the
Morpho Challenge 2010 Workshop, pages 35–38,
Aalto University, Espoo, Finland.
Christian Monson, Kristy Hollingshead, and Brian
Roark. 2009. Probabilistic paramor. In Pro-
ceedings of the 10th cross-language evaluation fo-
rum conference on Multilingual information access
evaluation: text retrieval experiments, CLEF’09,
September.
Lionel Nicolas, Jacques Farr´
e, and Miguel A. Mo-
linero. 2010. Unsupervised learning of concate-
native morphology based on frequency-related form
occurrence. In Mikko Kurimo, Sami Virpioja, Ville
Turunen, and Krista Lagus, editors, Proceedings of
the Morpho Challenge 2010 Workshop, pages 39–
43, Aalto University, Espoo, Finland.
Matthew G. Snover, Gaja E. Jarosz, and Michael R.
Brent. 2002. Unsupervised learning of morphol-
ogy using a novel directed search algorithm: Taking
the first step. In Proceedings of the ACL-02 Work-
shop on Morphological and Phonological Learn-
ing, pages 11–20, Morristown, NJ, USA. ACL.
Sami Virpioja, Oskar Kohonen, and Krista Lagus.
2009. Unsupervised morpheme discovery with al-
lomorfessor. In Working Notes for the CLEF 2009
Workshop. September.
Sami Virpioja, Ville T. Turunen, Sebastian Spiegler,
Oskar Kohonen, and Mikko Kurimo. 2011. Em-
pirical comparison of evaluation methods for unsu-
pervised learning of morphology. In Traitement Au-
tomatique des Langues.
663