ArticlePDF Available

Abstract

This work implements a novel formulation for phrase-based translation models making use of morpheme-based transla-tion units under a stochastic finite-state framework. This ap-proach has an additional interest for speech translation tasks since it leads to the integration of the acoustic and translation models. As a further contribution, this is the first paper addressing a Basque-to-Spanish speech translation task. For this purpose a morpheme based finite-state recognition system is com-bined with a finite-state transducer that translates phrases of morphemes in the source language into usual sequences of words in the target language. The proposed models were assessed under a limited-domain application task. Good performances were obtained for the proposed phrase-based finite-state translation model using morphemes as translation units, and also notable im-provements are obtained in decoding time.
EXPLOITING MORPHOLOGY IN SPEECH TRANSLATION
WITH PHRASE-BASED FINITE-STATE TRANSDUCERS
Alicia P´
erez, M. In´
es Torres
Department of Electricity and Electronics
University of the Basque Country
manes.torres@ehu.es
Francisco Casacuberta
Instituto Tecnol´
ogico de Inform´
atica
Technical University of Valencia
fcn@iti.upv.es
ABSTRACT
This work implements a novel formulation for phrase-based
translation models making use of morpheme-based transla-
tion units under a stochastic finite-state framework. This ap-
proach has an additional interest for speech translation tasks
since it leads to the integration of the acoustic and translation
models.
As a further contribution, this is the first paper addressing
a Basque-to-Spanish speech translation task. For this purpose
a morpheme based finite-state recognition system is com-
bined with a finite-state transducer that translates phrases of
morphemes in the source language into usual sequences of
words in the target language.
The proposed models were assessed under a limited-
domain application task. Good performances were obtained
for the proposed phrase-based finite-state translation model
using morphemes as translation units, and also notable im-
provements are obtained in decoding time.
Index TermsSpeech Translation, Stochastic Finite-
State Transducers, Morphology
1. INTRODUCTION
The use of morphological knowledge in machine translation
(MT) is relatively recent and has been mainly sustained in
tasks where morphologically rich languages were involved.
In both transfer-based and example-based MT approaches
morphological analysis has been used in the source language
to extract lemmas and split words into their compounds so as
to predict word-forms in the target language [1, 2]. In [3] it
was Moses [4], the state-of-the art statistical MT system, that
was used to train phrase-based models at morpheme level.
With respect to MT under finite-state framework, in [5] a
text-to-text translation paradigm was proposed by combining
a phrase-based model dealing with running words and finite-
state models including morphological knowledge. Specifi-
This work has been partially supported by the University of the Basque
Country under grants 9/UPV 00224.310-15900/2004 and GIU07/57, by the
Spanish CICYT under grant TIN2005-08660-C04-03, and by the Spanish
program Consolider-Ingenio 2010 under grant CSD2007-00018.
cally, the finite-state machine consisted of a composition of
a word-to-stem statistical analyser in source word, a stem-to-
stem translation model from source to target language and a
stem-to-word statistical generation module in target language
all the constituents being implemented with ATT-tools. No
other morphemes except stems were used.
The contribution of this work is twofold: first, the formu-
lation of speech translation based on morphemes under the
finite-state framework, and second, its application on Basque
to Spanish speech translation. We take advantage of all the
compounds of a word, and not only of lemmas. We promote
the use of finite-state models due to their decoding speed.
Spanish and Basque languages entail many challenges for
current machine translation systems. Due to the fact that both
languages are official in the Basque Country, there is a real
demand of several documents to be bilingual. In spite of
the fact that both languages coexist in the same area, they
differ enormously. To begin with, it is precise to note that
they have different origin: while Spanish belongs to the set of
Romance languages, Basque is a pre-Indoeuropean language.
There are notable differences in both morphology and syn-
tax. In contrast to Spanish, Basque is an extremely inflected
language, with more than 17 declension cases that can be re-
cursively combined. Inflection makes the size of the vocab-
ulary (in terms of word-forms) grow. Hence, the number of
occurrences of word n-grams within the data is much smaller
than in the case of Spanish, and this leads to poor or even un-
reliable statistic estimates. By applying to morpheme based
models we aim at tackling sparsity of data and consequently
getting improved statistical distributions.
2. MORPHEME-BASED SPEECH TRANSLATION
The goal of statistical speech translation is to find the most
likely translation, ˆ
¯
t, given the acoustic representation, X, of a
speech signal from the source language:
ˆ
¯
t= arg max
¯
t
P(¯
t|X)(1)
The transcription of speech in the source language into a se-
quence of morphemes, ¯m, can be introduced as a hidden vari-
6
able. ˆ
¯
t= arg max
¯
tX
¯m
P(¯
t, ¯m|X)(2)
Applying the Bayes’ decision rule:
ˆ
¯
t= arg max
¯
tX
¯m
P(¯
t, ¯m)P(X|¯
t, ¯m)
P(X)(3)
Let us assume that the probability of an utterance does not
depend on the transcription in other language. Hence, the
denominator would be independent of the variable over which
the optimisation is being done, and thus, the decoding would
be carried out as follows:
ˆ
¯
t= arg max
¯
tX
¯m
P(¯
t, ¯m)P(X|¯m)(4)
It is the contribution of two terms that drives the search prob-
lem: 1) the acoustic model, P(X|¯m), connecting a text string
in terms of morphemes to its acoustic utterance; 2) the joint
translation model, P(¯
t, ¯m), connecting source and target lan-
guages. Joint probability translation models are good candi-
dates to be approached by stochastic finite-state transducers
(SFSTs).
Some effort has been recently made in order to efficiently
take advantage of both acoustic and translation knowledge
sources [6] by exploring different architectures. We have
implemented the morpheme-based speech translation models
under two different architectures described in [7]: a) inte-
grated architecture implementing eq. (4) analogously as in
an automatic speech recognition (ASR) system where the
LM was replaced by a joint probability model. Thanks to the
nature of the finite state models a tight integration is allowed,
making a difference with respect to other kind of integration;
b) decoupled architecture where two stages are involved, that
is, first, an ASR system copes with transcription of the speech
utterance, and later, a text-to-text translation system translates
the given transcription.
Finally, there is an important issue to be noted, and it is the
fact that this formulation for speech translation makes use of
morphemes only in the source language, while using word-
forms in the target language. The underlying motivation is
simply that a speech translation from a morphologically rich
language into another that does not present inflection in nouns
is being taken into consideration. This is, in fact, our case
when translating from Basque to Spanish.
2.1. Phrase-based stochastic finite-state transducers
An SFST is a finite-state machine that analyses strings in a
source language and accordingly produces strings in a tar-
get language along with the joint probability of both strings
to be translation each other (for a formal definition turn
to [6]). The characteristics defining the SFST are the topol-
ogy and the probability distributions over the transitions and
the states. These distinctive features can be automatically
learnt from bilingual samples by efficient algorithms such as
GIATI (Grammar Inference and Alignments for Transducers
Inference) [7], which is applied in this work. As it is well
known, an outstanding aspect of the finite-state models is the
fact that they count on efficient standard decoding algorithms
[8]. Indeed, it is the speed of the decoding stage that makes
these models so attractive for speech translation.
In this work we deal with SFSTs based on phrases of mor-
phemes. Previously, in [9], in phrase-based SFSTs were pre-
sented based on word-forms (we will refer to this approach as
PW-SFS T). In such a models the transitions occur consuming
a sequence of words. Here we propose the use of sequences of
morphemes PM -SF ST instead. As for what the standard base-
line SFST is concerned (referred to as W-SF ST ), the difference
lies on the fact that the transitions consume isolated word-
forms instead of sequences of either words or morphemes. In
all the cases, the transitions of SFSTs produce a sequence of
zero or more words in the target language and have a proba-
bility associated.
2.2. Morphological analysis
In this work we deal with a morphologically rich language:
Basque. In Basque there is no freely available linguistic tool
that splits the words into proper morphemes. For this rea-
son, morpheme-like units were obtained by means of Morfes-
sor [10], a data-driven approach based on unsupervised learn-
ing of morphological word segmentation. For both ASR and
SMT it is convenient to keep a low morpheme to word ratio,
in order to get better language modelling, acoustic separabil-
ity and word generation amongst others. Consequently, in a
previous work [11], an approach based of decomposing the
words into two morpheme-like units, a root and an ending
was presented. By default, Morfessor decomposed the words
using 3 types of morphemes: prefixes, stems and suffixes. To
convert the decompositions into the desired root-ending form,
all the suffixes at the end of the word were joined to form the
ending, and the root was built joining all the remaining pre-
fixes, stems and possible suffixes between stems. This proce-
dures led to a vocabulary of 946 morphemes set of [11].
3. EXPERIMENTAL RESULTS
Basque is a minority but official language in the Basque
Country (Spain). It counts on scarce linguistic resources and
database, in addition, it is a highly inflected language. As
a result, exploiting the morphology seems a good choice to
improve the reliance on statistics.
The models were assessed under ME TE U S corpus, con-
sisting of a text and speech of weather forecast reports picked
from those published in the Internet. As shown in Table 1, the
corpus is divided into a training set and a training-independent
test set consisting of 500 sentences. Each sentence of the test
7
was uttered by at least 3 speakers, resulting in a speech evalu-
ation data of 1,800 utterances from 36 speakers. Note that the
size of the Basque vocabulary is 38% bigger than the Spanish
one due to its inflected nature.
Basque Spanish
Training
(Text)
Pair of sentences 14,615
Different pairs 8,220
Running words 154,778 168,722
Vocabulary 1,097 675
Average length 10.6 11.5
Test
(Speech)
Utterances 1,800
Length (hours) 3.5 3.0
Table 1. Main features of M ETE US corpus.
The phrase-based SFST using morphemes proposed here,
PM -SF ST , was compared with the other two models, previ-
ously mentioned, namely P W-SF ST and W-S FS T. The three
models were trained from the corpus described in Table 1
making use of the so-called GIATI algorithm [7]. Speech
translation was carried out using both the integrated and de-
coupled architectures. Besides, in order to explore the in-
fluence on the translation model of errors derived from the
recognition process, a verbatim translation was also carried
out. In this case, the input of the text-to-text translation sys-
tem is the transcription of the speech free from errors (as if
the recognition process had been flawless).
3.1. Computational cost and performance
The memory required for a model to be allocated in memory
along with the invested decoding time are two key parame-
ters to bear in mind when it comes to evaluating a speech
translation system. Table 2 shows the spatial cost (in terms
of number of transitions and branching factor) of each of the
three SFST models studied along with the relative decoding
time consumed. Regarding the time units, they are relative
to the baseline W-SF ST model, that is, given that the test was
translated in 1 time unit by W-SF ST , the time units required
by the PW-S FS T and PM -S FS T was picked up.
Transitions BF <Time>
W-SFST 114,531 3.27 1.00
PW-SFS T 121,265 3.25 0.76
PM -SF ST 127,312 3.21 0.71
Table 2. Spatial cost, in terms of number of transitions and
branching factor (BF), and the relative amount of time re-
quired by each model for text-input translation (dimension-
less magnitude).
Doubtless, it is the performance, measured in terms of
translation accuracy or error rate what counts for the evalu-
ation of both speech and text translation. Translation results
were assessed under the commonly used automatic evalua-
tion metrics: bilingual evaluation under study (BLEU [12])
and word error rate (WER). Table 3 shows speech translation
results with the three approaches mentioned above and the
different architectures. The recognition WER for decoupled
architecture was obtained trough previous ASR experiments
reported in [11] with the same set of moprhemes. We would
like to emphasize that speech translation with integrated ar-
chitecture gives both the transcription and the translation of
speech in the same decoding step, as a result, and thus, each
model gives its own recognition-word-error-rate.
Recognition Translation
WER WER BLEU
Integrated
W-SFST 6.26 47.5 47.6
PW-S FS T 6.12 48.4 48.0
PM -SF ST 6.06 47.8 48.6
Decoupled
W-SFST 4.93 46.9 47.3
PW-S FS T 4.93 48.5 49.0
PM -SF ST 4.93 47.8 49.3
Verbatim
W-SFST 0 45.6 48.6
PW-S FS T 0 46.5 50.4
PM -SF ST 0 46.7 50.7
Table 3. Speech translation results provided by different
translation models (W-SF ST ,PW-S FS T,P M-S FS T) under either
integrated or decoupled architectures. The verbatim transla-
tion is also shown as a baseline.
3.2. Discussion
Both PM -SF ST and P W-SF ST models outperform the base-
line W-SFST with 95% confidence under 1,000 bootstrap sam-
ples following the statistical significance test described in [13]
with the BLEU evaluation measure. Nevertheless, the differ-
ences between PM -SF S T and PW-S FS T are marginal.
Comparing the two architectures considered, the transla-
tion results are similar. Furthermore, taking into account that
the LM used for speech transcription in ASR with decoupled
architecture and the SFST used to both recognize and trans-
late speech counted on the same amount of data, one could ex-
pect that the parameters of the latter would not be as well con-
sidered, and accordingly, the performance of the integrated
architecture would be worse for recognition purposes.
The differences in translation performance between speech
translation with the decoupled architecture and the verbatim
translation are small. There are two factors that have influ-
ence on this fact: on the one hand, the input of the speech
translation was not very degraded; on the other hand, the
transducer shows certain capacity to deal with input errors by
mechanisms such as smoothing.
With respect to the size and time-efficiency of the models
(summarized in Table 2), as it is obvious, the phrase-based
8
models (both PM -SF S T and PW-S FS T) are bigger than W-
SF ST. Nevertheless, the branching factor is smaller, which in-
dicates that the phrase-based models are more restrictive than
the word-based in that, on average, they allow for a smaller
number of transitions per state. Note that in the smoothed
W-SFST all the strings have non-zero probability while in
the phrase-based approaches only those strings built up in
terms of the existing phrases have a non-zero probability.
Regarding decoding time (in Table 2) there is a correlation
with the branching factor. The higher the branching factor,
the higher the required time, and thus, the P M -SF ST model
shows significant time reductions.
4. CONCLUDING REMARKS AND FUTURE WORK
For natural language processing applications when the lan-
guage under study is morphologically rich, it might be useful
to make use of morphology. By using morpheme-like units,
statistics collected over a given database could be improved,
and accordingly, the parameters describing statistical models.
As far as speech translation is concerned, there is a further
interest on the use of morphemes as lexical unit, and it is pre-
cisely that the way in which the morphemes were extracted
kept a low morpheme to word ratio avoiding so acoustic con-
fusion.
In this work we have dealt with Basque to Spanish speech
translation. Morpheme-based speech translation has been
proposed in terms of morphemes and within the finite-state
framework. The models have been assessed under a limited-
domain task giving as a result improvements in both transla-
tion accuracy and decoding time.
As far as future work is concerned, the generation of tar-
get words from morphemes given a source out of vocabu-
lary word is still an open problem that might, as well, be ex-
plored from the statistical approach. That is, instead of doing
analysing, as in our case, generation might be tackled.
5. REFERENCES
[1] G. Labaka, N. Stroppa, A. Way, and K. Sarasola,
“Comparing rule-based and data-driven approaches to
Spanish-to-Basque machine translation,” in Proc. Ma-
chine Translation Summit XI, 2007.
[2] E. Minkov, K. Toutanova, and H. Suzuki, “Generating
complex morphology for machine translation, in Proc.
45st Annual Meeting of the Asocciation for Computa-
tional Linguistics, 2007, pp. 128–135.
[3] S. Virpioja, J. J. V¨
ayrynen, M. Creutz, and M. Sade-
niemi, “Morphology-aware statistical machine transla-
tion based on morphs induced in an unsupervised man-
ner,” in Proc. Machine Translation Summit XI, 2007, pp.
491–498.
[4] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,
R. Zens, et al., “Moses: Open source toolkit for sta-
tistical machine translation,” Proceedings of the 45th
Annual Meeting of the Association for Computational
Linguistics Companion, pp. 177–180, 2007.
[5] P. Karageorgakis, A. Potamianos, and I. Klasinas, “To-
wards incorporating language morphology into statisti-
cal machine translation systems,” in Proc. Automatic
Speech Recogn. and Underst. Workshop (ASRU), 2005.
[6] F. Casacuberta, M. Federico, H. Ney, and E. Vidal, “Re-
cent efforts in spoken language translation, IEEE Sig-
nal Processing Magazine, vol. 25, no. 3, pp. 80–88,
2008.
[7] F. Casacuberta and E. Vidal, “Learning finite-state mod-
els for machine translation,” Machine Learning, vol. 66,
no. 1, pp. 69–91, 2007.
[8] M. Mohri, F. Pereira, and M. Riley, “AT&T FSM Li-
braryTM and Finite-State Machine Library,” 2003.
[9] A. P´
erez, M. I. Torres, and F. Casacuberta, “Speech
translation with phrase based stochastic finite-state
transducers,” in Proc. IEEE 32nd International Confer-
ence on Acoustics, Speech, and Signal Processing 2007,
vol. IV, pp. 113–116, IEEE.
[10] M. Creutz and K. Lagus, “Inducing the morphological
lexicon of a natural language from unannotated text,
in Proc. International and Interdisciplinary Conference
on Aadaptive Knowledge Representation and Reason-
ing, 2005.
[11] V. G. Guijarrubia, M. I. Torres, and R. Justo,
“Morpheme-based automatic speech recognition of
basque,” in Proc. 4th Iberian Conference on Pattern
Recognition and Image Analysis, 2009, pp. 386–393,
Springer-Verlag.
[12] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu,
“BLEU: a method for automatic evaluation of machine
translation,” in Proc. 40th Annual Meeting of the Asso-
ciation Computational Linguistics, 2002, pp. 311–318.
[13] M. Bisani and H. Ney, “Bootstrap estimates for
confidence intervals in ASR performance evaluation,
in Proc. IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2004, vol. 1, pp. 409–
412.
9
... The phrase-based SFST model over giati approach was first defined Phrase-based SFST over giati approach in [Pérez et al., 2007a] under the k-TSS topology and applied to both speech and text translation. Later, in [González & Casacuberta, 2007] a similar approach was presented relying on n-grams and applied to text translation of big corpora. ...
Article
Full-text available
In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morpheme-like units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish, and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced.
Article
Full-text available
This work presents an algorithm for the unsupervised learn-ing, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexi-con stores parameters related to both the "meaning" and "form" of the morphs it contains. These parameters af-fect the role of the morphs in words. The model is imple-mented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are ob-tained for Finnish and almost as good results are obtained in the English task.
Conference Paper
Full-text available
In formal language theory finite-state transducers are well-know models for “input-output” rational mappings between two languages. Even if more powerful, recursive models can be used to account for more complex mappings, it has been argued that the input-output relations underlying most usual natural language pairs are essentially rational. Moreover, the relative simplicity of these mappings has recently lead to the development of techniques for learning finite-state transducers from a training set of input-output sentence pairs of the languages considered. Following these arguments, in the last few years a number of machine translation systems have been developed based on stochastic finite-state transducers. Here we review the statistical statement of Machine Translation and how the corresponding modelling, learning and search problems can be solved by using stochastic finite-state transducers. We also review the results achieved by the systems developed under this paradigm. After presenting the traditional approach, where transducer learning is mainly solved under the grammatical inference framework, we propose a new approach where learning is explicitly considered as a statistical estimation problem and the whole stochastic finite-state transducer learning problem is solved by expectation maximisation.
Conference Paper
Full-text available
We present a novel method for predicting in-flected word forms for generating morpho-logically rich languages in machine trans-lation. We utilize a rich set of syntactic and morphological knowledge sources from both source and target sentences in a prob-abilistic model, and evaluate their contribu-tion in generating Russian and Arabic sen-tences. Our results show that the proposed model substantially outperforms the com-monly used baseline of a trigram target lan-guage model; in particular, the use of mor-phological and syntactic features leads to large gains in prediction accuracy. We also show that the proposed method is effective with a relatively small amount of data.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Article
Full-text available
In this paper, we compare the rule-based and data-driven approaches in the context of Spanish-to-Basque Machine Translation. The rule-based system we consider has been developed specifically for Spanish-to-Basque machine translation, and is tuned to this language pair. On the contrary, the data-driven system we use is generic, and has not been specifically designed to deal with Basque. Spanish-to-Basque Machine Translation is a challenge for data-driven approaches for at least two reasons. First, there is lack of bilingual data on which a data-driven MT system can be trained. Second, Basque is a morphologically-rich agglutinative language and translating to Basque requires a huge generation of morphological information, a difficult task for a generic system not specifically tuned to Basque. We present the results of a series of experiments, obtained on two different corpora, one being “in-domain” and the other one “out-of-domain” with respect to the data-driven system. We show that n-gram based automatic evaluation and edit-distance-based human evaluation yield two different sets of results. According to BLEU, the data-driven system outperforms the rule-based system on the in-domain data, while according to the human evaluation, the rule-based approach achieves higher scores for both corpora.
Conference Paper
Full-text available
Stochastic finite-state transducers constitute a type of word-based models that allow an easy integration with acoustic model for speech translation. The aim of this work is to develop a novel approach to phrase-based statistical finite-state transducers. In this work, we explore the use of linguistically motivated phrases to build phrase-based models. The proposed phrase-based transducer has been tested and compared to a word-based equivalent machine, yielding promising results in the reported preliminary text and speech translation experiments
Article
In formal language theory, finite-state transducers are well-know models for simple “input-output” mappings between two languages. Even if more powerful, recursive models can be used to account for more complex mappings, it has been argued that the input-output relations underlying most usual natural language pairs can essentially be modeled by finite-state devices. Moreover, the relative simplicity of these mappings has recently led to the development of techniques for learning finite-state transducers from a training set of input-output sentence pairs of the languages considered. In the last years, these techniques have lead to the development of a number of machine translation systems. Under the statistical statement of machine translation, we overview here how modeling, learning and search problems can be solved by using stochastic finite-state transducers. We also review the results achieved by the systems we have developed under this paradigm. As a main conclusion of this review we argue that, as task complexity and training data scarcity increase, those systems which rely more on statistical techniques tend produce the best results.
Conference Paper
In formal language theory finite-state transducers are well-know models for "input-output" rational mappings between two languages. Even if more powerful, recursive models can be used to account for more complex mappings, it has been argued that the input-output relations underlying most usual natural language pairs are essentially rational. Moreover, the relative simplicity of these mappings has recently lead to the development of techniques for learning finite-state transducers from a training set of input-output sentence pairs of the languages considered. Following these arguments, in the last few years a number of machine translation systems have been developed based on stochastic finite-state transducers. Here we review the statistical statement of Machine Translation and how the corresponding modelling, learning and search problems can be solved by using stochastic finite-state transducers. We also review the results achieved by the systems developed under this paradigm. After presenting the traditional approach, where transducer learning is mainly solved under the grammatical inference framework, we propose a new approach where learning is explicitly considered as a statistical estimation problem and the whole stochastic finite-state transducer learning problem is solved by expectation maximisation.
Conference Paper
In this work, we focus on studying a morpheme-based speech recognition system for Basque, an highly inflected language that is official language in the Basque Country (northern Spain). Two different techniques are presented to decompose the words into their morphological units. The morphological units are then integrated into an Automatic Speech Recognition System, and those systems are then compared to a word-based approach in terms of accuracy and processing speed. Results show that whereas the morpheme-based approaches perform similarly from an accuracy point of view, they can be significantly faster than the word-based system when applied to a weather-forecast task.