Conference PaperPDF Available

Maximum Entropy Model for Disambiguation of Rich Morphological Tags

Authors:
  • Tilde Company

Abstract and Figures

In this work we describe a statistical morphological tagger for Latvian, Lithuanian and Estonian languages based on morphological tag disambiguation. These languages have rich tagsets and very high rates of morphological ambiguity. We model distribution of possible tags with an exponential probabilistic model, which allows to select and use features from surrounding context. Results show significant improvement in error rates over the baseline, the same as the results for Czech. In comparison with the simplified parameter estimation method applied for Czech, we show that maximum entropy weight estimation achieves considerably better results.
Content may be subject to copyright.
Maximum Entropy Model for Disambiguation of
Rich Morphological Tags
arcis Pinnis and K¯arlis Goba
Tilde,
Vienibas 75a, LV-1004 Riga, Latvia
{marcis.pinnis,karlis.goba}@tilde.lv
http://www.tilde.com
Abstract. In this work we describe a statistical morphological tagger for
Latvian, Lithuanian and Estonian languages based on morphological tag
disambiguation. These languages have rich tagsets and very high rates of
morphological ambiguity. We model distribution of possible tags with an
exponential probabilistic model, which allows to select and use features
from surrounding context. Results show significant improvement in error
rates over the baseline, the same as the results for Czech. In compari-
son with the simplified parameter estimation method applied for Czech,
we show that maximum entropy weight estimation achieves considerably
better results.
Keywords: tagger, maximum entropy, inflective languages, Estonian,
Latvian, Lithuanian
1 Introduction
The scope of this work covers three languages—Estonian, Latvian and Lithua-
nian, all of which have rich nominal and verbal morphology. While inflections
in Estonian are formed agglutinatively, Latvian and Lithuanian share similar
fusional morphology. All three languages exhibit high ambiguity of possible mor-
phological analyses of a word, which in the case of Latvian and Lithuanian can be
explained by their fusional nature, with several inflections sharing the same mor-
phemes. In Estonian some agglutinative morphemes are shared between several
inflections, producing homonymous surface forms.
1.1 Morphological Tagging
Morphological tagging can be viewed as a classification problem for a given
word sequence (typically sentence), where each word is assigned a single tag
describing its morphological properties. In this work, all three languages are
processed within the same framework. Morphological analysis of a word (or in
general, token) is encoded in a single tag consisting of fixed number of subtags
corresponding to certain morphological categories (e.g., part of speech, gender,
number, etc.).
2 arcis Pinnis and K¯arlis Goba
Like in similar work for Czech [2], we take a two-step approach to tagging,
where a token is first analyzed for possible morphological tags and disambiguated
separately. The morphological analyzer is based on a lemma lexicon and inflec-
tional rules, and produces one or several analyses for a given word. The tagger
then disambiguates the analysis by estimating probabilities of individual analy-
ses and selecting the most probable.
In this work, we used an unified morphological analyzer consisting of a rule-
based analysis module for Latvian and Lithuanian(developed by Tilde), and a
separate analysis module for Estonian [6](developed by Filosoft).
1.2 Morphological Tagset
The notion of tagset includes the set of valid combinations of subtags. Some
subtags are mutually independent (e.g. a noun can decline in number and case
independently), while others are valid only in certain contexts (e.g. tense is only
valid for verbs).
The morphological tagset used for all three languages is similar to MULTEXT-
East format [7] and consists of 28 categories. Each category is represented as a
single-character subtag (see Fig. 1 for an example tag), with ‘0’ corresponding
to no value. While each language uses its own subset of all categories and their
values, the category positions within the morphological tag and their meanings
remain fixed.
POS adjective
GENDER male
NUMBER plural
CASE nominative
DEGREE positive
DEFINITENESS indefinite
Fig. 1. Example of a Morphological Tag a0mpnp0n000000000000000000l0 for the Lat-
vian word paˇssaprotami (lit. self-evident ).
2 Training Data
The training data (see Fig. 2 for a sample) for the morphological tagger consists
of multiple lines; where each line represents a token and a sequence of possi-
ble tags. Sentences are separated with an empty line. The sequence of tags is
given by the morphological analyzer of the particular language. The first tag is
always the correct (manually annotated) tag. If the morphological analyzer does
not recognize a token, it returns an empty tag. We assume that the morpho-
logical analyzer has recognized all tokens, thus the morphological tagger does
Maximum Entropy Model for Disambiguation of Rich Morphological Tags 3
Viss p0msn0000000g0000000000000f0
bija vs0000300i0000000000100000l0 vs0000300i0000000000700000l0
gal¯a n0msl000000000n00000000000l0 r0000p00000000000000000000l0
. t000000000000000000000000000
Fig. 2. Latvian training data excerpt (lit. all was at end ).
not process unknown words and the tagging task is reduced to a morphological
disambiguation task for known tokens.
We use morphologically disambiguated corpora for each of the three lan-
guages (Estonian, Latvian and Lithuanian) to train and test the morphological
tagger.
Internal corpora were used for Latvian and Lithuanian, which consist of fic-
tion, newspaper articles, scientific papers, business reports and letters, govern-
ment documents, legal documents, student essays and theses, IT documents
(such as manuals and web site information) and forum comments. Latvian and
Lithuanian corpora were pre-tagged using a morphological analyzer and then
given to annotators for manual disambiguation. Due to budget limitations, each
token has been disambiguated only by one annotator, which lowers the corpus
quality and creates unnecessary noise in the corpora.
For the Estonian tagger a freely available morphologically disambiguated cor-
pus [9] was used, which consists of fiction, legal, newspaper and scientific texts. In
this corpus, each word has been annotated by two annotators and disagreements
have been resolved by a third annotator, thereby increasing the corpus quality.
The Estonian corpus tagset is different to our unified tagset, therefore it had
to be converted to the Multext-East tagset using a one-to-one transformation
and a transformation from Multext-East to our unified tagset with some minor
transformations to adjust the corpus to our unified morphological analyzer. In
order to create the training data for the morphological tagger, the ambiguous
tag sequence had to be created, therefore, the corpus was preprocessed also with
our morphological analyzer.
After disambiguation, the corpora were split into training and test data so
that none of the test sentences would be present in the training data. The final
corpora statistics is shown in Table 1.
2.1 Ambiguity Classes
Following the work for Czech [2], we use the notion of ambiguity class to describe
possible morphological ambiguities within a subtag. For example, ambiguity class
POSan describes part of speech ambiguity between noun and adjective.
There are in total 216, 250 and 259 ambiguity classes throughout 22, 20 and
14 ambiguous morphological categories in the Latvian, Lithuanian and Estonian
language training corpus respectively.
4 arcis Pinnis and K¯arlis Goba
Table 1. Training and test corpora
Estonian Latvian Lithuanian
Total tokens 419,137 117,362 71,460
Sentences 31,266 6,564 4,201
Ambiguous words, % 32.4% 48.5% 36.0%
Word OOV rate 1.5% 3.0% 2.3%
Distinct tags 268 1401 1052
Tag perplexity 48.86 184.46 125.60
Test data, % 6% 10% 10%
Test tokens 26,366 12,826 8,103
3 Model
The tagging model is based on the exponential probabilistic model used for
Czech [2]. We assume that individual subtags {yPOS ,yTENSE,yGENDER , . . .}are
independent, and model the probability of a candidate tag as a product of indi-
vidual subtag probabilities:
p(y) = Y
cCAT
p(yc).(1)
The subtag probabilities are modeled separately within each ambiguity class
AC. The probability of an event yin context xis modeled as an exponentially
weighted sum of feature functions [1]:
pΛ(y|x) = exp Piλifi(y, x)
Z(x),(2)
where f(y, x) are binary valued feature functions predicting event yin context
x, and Z(x) is the normalization factor. Here, events correspond to subtag values
in a corresponding morphological category, and features describe the surrounding
context of a word in a sentence.
4 Training
4.1 Feature Selection
The training of the morphological tagger heavily relies on the feature set used
in the training and tagging process as can be seen in the results section. We
use binary feature functions, which consist of a context address, function type
(for instance, simple types, such as, part of speech, gender, number, also the
token itself, or complex types, such as gender, number and case equality with
the token whose category is being predicted) and the value of the function type
(for example, ‘a’ for part of speech or ‘kas’ for a token in Latvian). We use the
value ‘== ’ to define equality of the function type of the token in the address
Maximum Entropy Model for Disambiguation of Rich Morphological Tags 5
defined by the function and the function type of the token whose category is
being predicted. The first line of the feature excerpt in Fig. 3, therefore, is read
in the following way: if the next token is either a conjunction or a comma, the
gender, number and case of the second token to the right have to agree with the
gender, number and case of the predicted token.
RC GenderNumberCase ==
RC POS v
R POS c
L POS p
Fig. 3. First four feature excerpt for the Latvian part-of-speech ambiguity class ‘qsv’.
Our morphological tagger uses different feature sets for each of the ambiguity
classes in the training corpus. Therefore, a feature selection algorithm was used in
order to select the best features that describe each of the ambiguity classes. But
before the selection algorithm was applied, the initial feature set was generated
using all possible categories, events, context position indicators (up to three
tokens to the left and right) and some trigger words (conjunctions, prepositions,
particles and adverbs) extracted from the training corpus. Although the trigger
words increased the precision, the increase was very insignificant (in the order of
102of a percent). This might be due to the fact that the part-of-speech feature
functions already express the characteristics of the trigger words and, thus, the
increase is very low. The feature generation resulted in 10017, 3801 and 3045
initial features for Estonian, Latvian and Lithuanian respectively.
When the initial feature set was created, a simple feature selection algorithm
based on the maximal mutual information was used to select the set of feature
functions with the highest score for each ambiguity class. The maximal mutual
information of a feature function in an ambiguity class is
I(X;Y) = X
yY
X
xX
p(x, y) log p(x, y)
p(x)p(y),(3)
where X={0,1}corresponds to the binary value of feature function, Yis
the set of possible events in the ambiguity class being processed (for instance,
{‘a’, ‘n’}for the ambiguity class ‘an’), p(x) is the probability of the feature
function to receive the value xin the context of the ambiguity class, p(y) is the
probability of the event yin the ambiguity class and p(x, y) is the probability of
the feature function receiving the value xand the event simultaneously being y
in the ambiguity class.
All probabilities are computed as normalized frequency distributions. Out of
all initial feature functions a total of 1684, 775 and 742 feature functions were
selected as important by the feature selection algorithm throughout all ambiguity
classes for Estonian, Latvian and Lithuanian respectively for the best exponential
6 arcis Pinnis and K¯arlis Goba
models (applying a maximum of 150 feature functions in an ambiguity class for
Estonian, 100 for Latvian and 50 for Lithuanian).
4.2 Model Parameters
We use a maximum entropy library developed at the Tsujii Laboratory of The
University of Tokyo [8] to train the models of each of the ambiguity classes.
The maximum entropy library features the LMVM (Limited Memory Variable
Metric) parameter estimation [5], where parameter re-estimation, in comparison
with iterative scaling algorithms, such as IIS (Improved Iterative Scaling) (for
instance, in our tests IIS performed up to 30 times slower on the Latvian cor-
pus using 150 features), converges significantly faster [4]. The estimated weights
together with the feature sets of all ambiguity classes are combined in a single
tagging model, which is used in the tagging process.
When disambiguating a token, we use the exponential model (1) to predict
all events yin the context xfor each ambiguity class of a token. Then we combine
the probabilities of separate event predictions using a slightly modified version
of the formula (1) for each possible tag [2]:
p(y|x) = Y
cCAT
(1 α)pACc(yc|x) + αpACc(yc),(4)
where we use linear interpolation of the model probability and the probability
of the event yin the ambiguity class AC (which, in fact, is the frequency distri-
bution of the event yin the ambiguity class AC) as a smoothing method. The α
weights were manually estimated based on the highest training corpus precision.
The usage of linear smoothing with the frequency distribution of an event in an
ambiguity class has proven to increase the overall precision by 0.2–0.3%.
5 Results
We compare the error rates of the exponential model trained on Estonian, Lat-
vian and Lithuanian data (Table 2) with HunPos, a HMM trigram tagger [3].
We trained the exponential model with Maximum Entropy parameter estima-
tion and the simplified parameter estimation described in [2]. The baseline error
rate is computed using only the category label statistics (with α= 1). HunPos
tagger was run in guided mode, with possible morphological tags provided for
each token.
We trained and evaluated the exponential maximum entropy models on vari-
ous numbers of selected features (using the maximal mutual information feature
selection method) and the best test results were achieved using 150, 100 and 50
features for Estonian, Latvian and Lithuanian respectively.
Based on the best exponential maximum entropy models, we also evaluated
the individual subtag error rates over all test tokens (Table 3). The results sug-
gest that for all languages the error rate distribution is fairly similar (with an
Maximum Entropy Model for Disambiguation of Rich Morphological Tags 7
Table 2. Error rates
Experiment Estonian Latvian Lithuanian
Baseline 9.72 14.00 7.47
HunPos 8.51 6.67 14.55
Exponential; simplified estimation 6.98 12.76 6.82
Exponential; ME estimation 4.04 8.49 5.65
Exponential; training data 3.07 5.32 3.76
Feature functions 150 100 50
exception of Estonian, in which gender is not used), more precisely, the cate-
gories with the most misclassifications are: part of speech, gender, number and
case; case being the most difficult to predict.
Table 3. Error rates within categories
Category Estonian Latvian Lithuanian
POS 2.11 1.91 2.33
GENDER — 2.67 2.06
NUMBER 1.31 3.34 2.23
CASE 2.15 4.53 2.64
PERSON 0.29 0.37 0.37
TENSE 0.58 0.82 0.88
MODE 0.31 0.48 0.67
VOICE 0.60 0.51 0.63
REFLEX — 0.05 0.09
NEGATIVE 0.30 0.00 0.05
DEGREE 0.50 0.80 1.01
DEFINITENESS — 0.68 0.96
DIMINUTIVE — 0.40 0.05
PREPNUMBER — 0.60
PREPCASE 0.30 0.63 0.12
PREPTYPE 0.30 0.16 0.07
NUMTYPE 0.24 0.02 0.06
PRONCLASS — 0.22 0.57
PARTTYPE 0.02 0.06 0.08
VERBTYPE — 0.43 0.15
ADVTYPE — 0.03
CONJTYPE 0.27 0.58 0.88
5.1 Error Analysis
We have performed error analysis on the Estonian, Latvian and Lithuanian ex-
ponential models with 150, 100 and 50 feature functions respectively. For better
8 arcis Pinnis and K¯arlis Goba
interpretation of tagging errors, we grouped the errors by differences between
the correct and the predicted tags. The cumulative error rates of the most com-
mon error types for each language (Table 4) show that the top six errors cover
approximately 50% of all errors in each language training corpus.
Table 4. Top six error types
Language Correct Wrong (Category) Error Coverage
Estonian
ng (case) 13.22
gn (case) 24.85
ps (number) 32.25
sp (number) 39.36
pg (case) 45.01
rc (part of speech) 49.52
Latvian
pg sa (number & case) 14.53
mf (gender) 26.31
fm (gender) 33.68
sa pg (number & case) 39.83
pn sg (number & case) 45.96
an (case) 50.86
Lithuanian
pn sg (number & case) 14.89
fm (gender) 28.16
mf (gender) 34.12
qc (part of speech) 39.08
sg pn (number & case) 44.01
an (part of speech) 48.53
The error type, for instance, ‘ng (case)’ given in the Table 4 explains that
instead of the case n(nominative) the case g(genitive) was selected as being
more probable. Other error types in the table include number (s- singular, p
plural), case (p- partitive, a- accusative) and part of speech (a- adjective, c-
conjunction, n- noun, q- particle, r- adverb).
When analyzing the top six errors of the Latvian morphological tagger, it
can be seen that the errors are fairly regular, for instance, for the error type ‘m
f (gender)’ (as well as for the opposite) a common misclassification is done for
the pronoun ‘to’, which is obvious as the gender can either be distinguished by
the sentence context (for instance, in noun phrases), by an anaphora resolution
or cannot be distinguished at all in the case when the context is too small. As
the feature functions do not consider anaphora resolution for pronouns of this
type and the context may not reveal the correct gender, the statistical morpho-
logical tagger makes misclassifications. Another common misclassification occurs
in noun phrases where adjuncts are used, for instance, consider the error type
sa pg (number & case)’ . The adjunct number and case in most cases, when
observing the context to the right, can be identified, but the tagger makes a
misclassification. This suggests that for specific ambiguity classes, either wrong
Maximum Entropy Model for Disambiguation of Rich Morphological Tags 9
feature functions have been prioritized or more complex feature functions would
have to be generated, that address the issue of misclassification.
6 Conclusions
The results of the application of maximum entropy modeling to Estonian, Lat-
vian and Lithuanian confirms the suitability of this method for morphologically
rich languages and corresponds well to the results for Czech [2]. The exponential
tagger performs significantly better than the baseline and in two cases signifi-
cantly better than HMM tagger. In the case of Latvian, we have observed an
interesting deviation in favor of HMM tagger. Also the high tagset perplexity for
Latvian indicates that careful investigation of training data quality is necessary.
The feature selection algorithm used in our training and evaluation exper-
iments does not consider interfeature relations, which lowers the final tagging
precision because features, which in combination perform well, may not be se-
lected and features, which in combination perform poorly, on the contrary, may
be selected. Therefore, a better feature selection algorithm would be the use of
iterative feature selection as explained by [2]. As we use the maximum entropy
training method, the iterative feature selection would require large computing
resources. An interesting experiment would be to run the iterative feature selec-
tion based on the simplified weight estimation algorithm and compare the results
to the model acquired by maximum entropy training on the features selected by
the iterative feature selection.
The tagger model could be extended to handle unknown words, allowing to
avoid the shortcomings of the lexicon-based analyzer. In this case, the ambiguity
class is unknown, and the model needs to be adjusted. One possibility would be
to combine subtag classifiers trained on whole data (as opposed to conditioned
by ambiguity class). In this case, some model of valid subtag combinations should
be used to avoid predicting invalid tags.
The combination of subtag models currently treats all subtags equally. This
combination could be parameterized by weighing the individual subtag proba-
bilities in a log-linear fashion, effectively treating subtag probabilities as feature
values. This approach would allow the parameters to be tuned and allow min-
imum error rate training. Also, more features (like subtag classifiers over all
training data) could be added.
7 Acknowledgements
The research within the project Accurat leading to these results has received
funding from the European Union Seventh Framework Programme (FP7/2007–
2013), grant agreement no 248347.
References
1. Berger, A., Della Pietra, S., Della Pietra, V.: A Maximum Entropy Approach to
Natural Language Processing. In: Computational Linguistics, (22–1), March 1996
10 arcis Pinnis and K¯arlis Goba
2. Hajiˇc, J., Vidov´a-Hladk´a, B.: Tagging Inflective Languages: Prediction of Morpho-
logical Categories for a Rich, Structured Tagset. In: Proceedings of the COLING-
ACL Conference, Montreal, Canada, pp. 483–490 (1998)
3. Hal´acsy P., Kornai A., Oravec C.: HunPos — an Open Source Trigram Tagger.
In: Proceedings of the 45th Annual Meeting of the Association for Computational
Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Asso-
ciation for Computational Linguistics, Prague, Czech Republic, pp. 209–212 (2007)
4. Malouf, R.: A Comparison of Algorithms for Maximum Entropy Parameter Estima-
tion. In: Proceedings of CoNLL-2002, pp. 49–55 (2002)
5. Benson, S., More, J.: A Limited Memory Variable Metric Method in Subspaces and
Bound Constrained Optimization Problems. In: Technical Report ANL/MCS-P909-
0901, Argonne National Laboratory (2001)
6. Kaalep, H-J.: An Estonian Morphological Analyser and the Impact of a Corpus on
its Development. In: Computers and Humanities. 31: 115–133 (1997)
7. MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern
European Languages. http://nl.ijs.si/ME/
8. A Simple C++ Library for Maximum Entropy Classification. http://www-tsujii.
is.s.u-tokyo.ac.jp/~tsuruoka/maxent
9. Morphologically Disambiguated Estonian Corpus. http://www.cl.ut.ee/
korpused/morfkorpus
... The pattern lists for Latvian and Lithuanian, for instance, contain 120 different patterns. Initially, these patterns were automatically extracted from morphologically tagged (Pinnis and Goba 2011) texts in which terms were marked by human annotators. Since this initial list contained patterns for specific cases and not general language rules, the obtained patterns were then manually revised and generalised. ...
... The human annotated data and unlabelled data that are used in NER model training or tagging are pre-processed using morpho-syntactic taggers. For Latvian and Lithuanian, we used the maximum entropy-based tagger by Pinnis and Goba (2011). After morpho-syntactic tagging, positional information is added in order to trace every token from the tab-separated document back to its positions in the plaintext input document. ...
Chapter
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.
... В [37] приведены аналогичные данные для румынского языка. Работа [38] также использует классы неоднозначности, применяемые для исследования размеров и свойств этих классов в эстонском, литовском и латышском языках. В отличие от большинства предыдущих работ, в наших статьях [39,40] было рассмотрено распределение типов омонимии для нескольких языков. ...
Article
Full-text available
During our previous research, we found that the grammatical ambiguity of most frequent words of European languages has a different distribution in comparison with less frequent ones. In the current research, we investigate in more details the reasons of such a phenomenon; we pay a special attention to the first thousand of most frequent tokens. Our investigation of modern disambiguation systems demonstrated that the increase of language diversity, we had found for most frequent words, leads to increase of number of mistakes made by those systems..
... Multinomial logistic was first applied to machine translation at IBM (Berger et al., 1996), and then was applied to other NLP tasks like part-of-speech tagging (Pinnis and Goba, 2011), text classification (Cuong et al., 2006), sentiment analysis (Lee and Renganathan, 2011;Rao et al., 2015) and discourse relations recognition (Lin et al., 2009;Keskes et al., 2014). ...
... Multinomial logistic was first applied to machine translation at IBM (Berger et al., 1996), and then was applied to other NLP tasks like part-of-speech tagging (Pinnis and Goba, 2011), text classification (Cuong et al., 2006), sentiment analysis (Lee and Renganathan, 2011;Rao et al., 2015) and discourse relations recognition (Lin et al., 2009;Keskes et al., 2014). ...
Article
Abstract: Rhetorical relations between two text segments are crucial information and have been proven useful for many natural language processing (NLP) applications. In this paper, we propose a supervised approach for automatic identifying of rhetorical relations in Arabic texts. To the best of our knowledge, this is the first model that attempts to identify both implicit and explicit rhetorical relations between elementary discourse units within the rhetorical structure theory (RST). To carry out this research, we developed a discourse annotated corpus following the RST framework with high reliability.Relations annotation was done using a set of 23 fine-grained relations enriched with nuclearity annotation. To automatically learn these relations, we reuse some state of the arts featuresand contribute new lexical and semantics' features. The experimental results, on fine-grained and coarse-grained relations, showed that our model achieved the best performance relative to all baselines Keywords: Rhetorical relations; Arabic language; Rhetorical structure theory.
... Therefore, the source data were further factored using a language-specific tag- Table 1: Training data statistics (sentence counts) for SMT and NMT systems before and after filtering ger or parser. For Latvian, we used an averaged perceptron-based morpho-syntactic tagger (Nikiforovs, 2014) that was trained on the data from Pinnis and Goba (2011). For English, we used the lexicalized probabilistic parser (Klein et al., 2002) from the Stanford CoreNLP toolkit (Manning et al., 2014). ...
... For Latvian and Lithuanian the term patterns have been created in a semi-automatic manner. At first, morpho-syntactic tag sequences were automatically extracted from morpho-syntactically tagged texts (Pinnis & Goba, 2011) in which terms were marked by human annotators. Then, the obtained morpho-syntactic tag sequences were manually revised and generalised into patterns. ...
Thesis
Full-text available
The aim of this doctoral thesis is to research methods and develop tools that allow successfully integrating bilingual terminology into statistical machine translation systems so that the translation quality of terminology would increase and that the overall translation quality of the source text would increase. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources necessary for different tasks involved in workflows for terminology integration in SMT systems. The thesis describes and evaluates methods designed and implemented by the author for: 1) monolingual term identification in SMT system training data as well as documents submitted for translation, 2) term normalisation for acquisition of canonical forms of terms from terms in different inflected forms, 3) cross-lingual term mapping in parallel and comparable corpora collected from the Web, 4) probabilistic dictionary filtering in order to acquire resources for cross-lingual term mapping, 5) development of character-based SMT transliteration systems from probabilistic dictionaries, 6) inflected form generation for terms through rule-based morphological synthesis or monolingual corpus look-up, and other methods involved in the workflows for static and dynamic terminology integration in SMT systems. The terminology integration methods have been evaluated using the Moses SMT system and the LetsMT platform. The evaluation efforts show that the methods for monolingual term identification and cross-lingual term mapping allow achieving state-of-the-art performance, which has been also validated by third party (independent) evaluation efforts. The static terminology integration methods allow achieving a cumulative SMT quality improvement by up to 28.1% (or 3.56 absolute BLEU points) over an initial baseline system for the English-Latvian language pair. However, the most impressive achievement of the author’s work is the dynamic terminology integration method in SMT systems using a source text pre-processing workflow. In almost all experiments performed in the scope of the thesis the methods allowed achieving SMT quality improvements. Automatic evaluation for four investigated language pairs in the automotive domain shows SMT quality improvements by up to 26.9% (or 3.41 absolute BLEU points) over baseline systems. Manual comparative evaluation performed for seven language pairs in the information technology domain shows that the proportion of correctly translated terms increases for all language pairs by up to +52.6%.
... One of the gaps in basic technologies for Latvian was the lack of a morphological tagger, which was not available till 2010. However now there are three different Latvian taggers available based on perceptron 17 and other machine learning algorithms [ [23], [24]]. ...
Conference Paper
Full-text available
Although human language technologies have a long history in Latvia, the Latvian language still belongs to under-resourced languages, as there are many gaps in basic language technologies and tools. However, despite difficulties, some of these gaps for both, resources and tools, have been filled in the last five years. The main goal of this paper is to report on recent achievements in language resources and technologies (LRT) for Latvian and to describe the current situation.
Chapter
This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.
Article
Full-text available
In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.
Article
Full-text available
In the world of non-proprietary NLP soft-ware the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the crit-icism aimed at HMM performance on lan-guages with rich morphology should more properly be directed at TnT's peculiar li-cense, free but not open source, since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos 1 , a free and open source (LGPL-licensed) al-ternative, which can be tuned by the user to fully utilize the potential of HMM architec-tures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
Conference Paper
Full-text available
The paper puts forward a quasi-dependency model for structural analysis of Chinese baseNPs and a MDL-based algorithm for quasi-dependency-strength acquisition. The experiments show that the proposed model is more suitable for Chinese baseNP analysis ...
Conference Paper
Full-text available
Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Surprisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limitedmemory variable metric algorithm outperformed the other choices.
Article
Full-text available
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Article
The paper describes a morphological analyser forEstonian and how using a text corpus influenced theprocess of creating it and the resulting programitself. The influence is not limited to the lexicononly, but is also noticeable in the resulting algorithm andimplementation too. When work on the analyser began,there were no computational treatment of Estonianderivatives and compounds. After some cycles ofdevelopment and testing on the corpus, we came up withan acceptable algorithm for their treatment. Both themorphological analyser and the speller based on ithave been successfully marketed.
Article
We describe an algorithm for solving nonlinear optimization problems with lower and upper bounds that constrain the variables. The algorithm uses projected gradients to construct a limited memory BFGS matrix and determine a step direction. The algorithm has been implemented and distributed as part of the Toolkit for Advanced Optimization (TAO). We include numerical results demonstrate is eectiveness on a set of large test problems and its scalability to multiple processors.
Article
The major obstacle in morphological (sometimes called morpho-syntactic, or extended POS) tagging of highly inflective languages, such as Czech or Russian, is -- given the resources possibly available -- the tagset size. Typically, it is in the order of thousands. Our method uses an exponential probabilistic model based on automatically selected features. The parameters of the model are computed using simple estimates (which makes training much faster than when one uses Maximum Entropy) to directly minimize the error rate on training data. The results obtained so far not only show good performance on disambiguation of most of the individual morphological categories, but they also show a significant improvement on the overall prediction of the resulting combined tag over a HMM-based tag n-gram model, using even substantially less training data. 1 Introduction 1.1 Orthogonality of morphological categories of inflective languages The major obstacle in morphological 1 tagging of high...