Conference Paper

Unsupervised morphology rivals supervised morphology for Arabic MT

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-the-art Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We evaluate our model on datasets in three languages: Arabic, English and Turkish. We compare our performance against five state-of-the-art unsupervised systems: Morfessor Baseline (Virpioja et al., 2013), Morfessor CatMAP (Creutz and Lagus, 2005), AGMorph (Sirts and Goldwater, 2013), the Lee Segmenter (Lee et al., 2011;Stallard et al., 2012) and the system of Poon et al. (2009). Our model consistently equals or outperforms these systems across the three languages. ...
... (2011) present a model that takes advantage of syntactic context to perform better morphological segmentation. Stallard et al. (2012) improve on this approach using the technique of Maximum Marginal decoding to reduce noise. Their best system considers entire sentences, while our approach (and the morphological analyzers described above) operates at the vocabulary level without regarding sentence context. ...
... 11 We show results for the Compounding grammar, which we find has the best average performance over the languages. The Lee Segmenter (Lee et al., 2011), improved upon by using Maximum Marginal decoding in Stallard et al. (2012), has achieved excellent performance on the Arabic (ATB) dataset. We perform comparison experiments with the model 2 (M2) of the segmenter, which employs latent POS tags, and does not require sentence context which is not available for other languages in the dataset. ...
Article
Most state-of-the-art systems today produce morphological analysis based only on orthographic patterns. In contrast, we propose a model for unsupervised morphological analysis that integrates orthographic and semantic views of words. We model word formation in terms of morphological chains, from base words to the observed words, breaking the chains into parent-child relations. We use log-linear models with morpheme and word-level features to predict possible parents, including their modifications, for each word. The limited set of candidate parents for each word render contrastive estimation feasible. Our model consistently matches or outperforms five state-of-the-art systems on Arabic, English and Turkish.
... Morphological analysis plays an increasingly important role in many language processing applications. Recent research has demonstrated that adding information about word structure increases the quality of translation systems and alleviates sparsity in language modeling (Chahuneau et al., 2013b;Habash, 2008;Kirchhoff et al., 2006;Stallard et al., 2012). ...
... Recent work has demonstrated that even morphological analyzers that use little or no supervision can help improve performance in language modeling and machine translation (Chahuneau et al., 2013b;Stallard et al., 2012). It has also been shown that segmentation lattices improve the quality of machine translation systems (Dyer, 2009). ...
Article
Full-text available
We explore the impact of morphological segmentation on keyword spotting (KWS). Despite potential benefits, state-of-the-art KWS systems do not use morphological information. In this paper, we augment a state-of-the-art KWS system with sub-word units derived from supervised and unsupervised morphological segmentations, and compare with phonetic and syllabic segmentations. Our experiments demonstrate that morphemes improve overall performance of KWS systems. Syllabic units, however, rival the performance of morphological units when used in KWS. By combining morphological, phonetic and syllabic segmentations, we demonstrate substantial performance gains.
... There are many works using minimally supervised to unsupervised models of morphology for connecting morphologically related words and identifying optimal (and at times application dependent) segmentations (Smith and Eisner, 2005;Creutz and Lagus, 2005;Snyder and Barzilay, 2008;Poon et al., 2009;Dreyer and Eisner, 2011;Stallard et al., 2012;Sirts and Goldwater, 2013;Narasimhan et al., 2015;Sennrich et al., 2016;Eskander et al., 2016b;Ataman et al., 2017;Ataman and Federico, 2018;Eskander et al., 2018). In this paper, we compare to two popular language agnostic segmentation systems: MORFESSOR (Creutz and Lagus, 2005) and BPE (Sennrich et al., 2016). ...
Conference Paper
Full-text available
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring language specific knowledge, but no direct supervision. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morphosyntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. We demonstrate the utility of de-lexical segmentation on several dialects of Arabic. We consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
... This allows us to use morphological analyzers that are based on unsupervised models that can be quickly trained on new languages using only monolingual data. As Stallard et al. (2012) have shown, MT improvements obtained from unsupervised analyzers can rival those obtained from language-specific analyzers that are based on extensive linguistic knowledge. ...
Article
Full-text available
We describe BBN’s contribution to the machine translation (MT) task in the LoReHLT 2016 evaluation, focusing on the techniques and methodologies employed to build the Uyghur–English MT systems in low-resource conditions. In particular, we discuss the data selection process, morphological segmentation of the source, neural network feature models, and our use of a native informant and related language resources. Our final submission for the evaluation was ranked first among all participants.
... A number of data-driven approaches have been proposed that learn to segment words into smaller units from data (Demberg, 2007;Sami Virpioja and Kurimo, 2013) and shown to improve phrasebased MT (Fishel and Kirik, 2010;Stallard et al., 2012). Recently, with the advent of neural MT, a few sub-word-based techniques have been proposed that segment words into smaller units to tackle the limited vocabulary and unknown word problems (Sennrich et al., 2016;Wu et al., 2016). ...
Article
Full-text available
Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.
... A number of data-driven approaches have been proposed that learn to segment words into smaller units from data (Demberg, 2007;Sami Virpioja and Kurimo, 2013) and shown to improve phrasebased MT (Fishel and Kirik, 2010;Stallard et al., 2012). Recently, with the advent of neural MT, a few sub-word-based techniques have been proposed that segment words into smaller units to tackle the limited vocabulary and unknown word problems (Sennrich et al., 2016;Wu et al., 2016). ...
... Baselines We compare our approach against the state-of-the-art unsupervised method of Narasimhan et al. (2015) which outperforms a number of alternative approaches (Creutz and Lagus, 2005;Virpioja et al., 2013;Sirts and Goldwater, 2013;Lee et al., 2011;Stallard et al., 2012;Poon et al., 2009). For this baseline, we report the results of the publicly available implementation of the technique (NBJ'15), as well as our own improved reimplementation (NBJ-Imp). ...
Article
This paper focuses on unsupervised modeling of morphological families, collectively comprising a forest over the language vocabulary. This formulation enables us to capture edgewise properties reflecting single-step morphological derivations, along with global distributional properties of the entire forest. These global properties constrain the size of the affix set and encourage formation of tight morphological families. The resulting objective is solved using Integer Linear Programming (ILP) paired with contrastive estimation. We train the model by alternating between optimizing the local log-linear model and the global ILP objective. We evaluate our system on three tasks: root detection, clustering of morphological families and segmentation. Our experiments demonstrate that our model yields consistent gains in all three tasks compared with the best published results.
... Character-based translation has also been investigated with phrase-based models, which proved especially successful for closely related languages (Vilar et al., 2007; Tiedemann, 2009; Neubig et al., 2012). The segmentation of morphologically complex words such as compounds is widely used for SMT, and various algorithms for morpheme segmentation have been investigated (Nießen and Ney, 2000; Koehn and Knight, 2003; Virpioja et al., 2007; Stallard et al., 2012 ). Segmentation algorithms commonly used for phrase-based SMT tend to be conservative in their splitting decisions, whereas we aim for an aggressive segmentation that allows for open-vocabulary translation with a compact network vocabulary, and without having to resort to back-off dictionaries. ...
... A common approach in such a case is to take several samples and report the average result. Maximum marginal decoding (MMD) (Johnson and Goldwater 2009;Stallard et al. 2012) that constructs a marginal distribution from several independent samples and returns their mean value has been shown to improve the sampling-based models' results about 1-2 percentage points. Although the AG model uses sampling for training, the MMD is not applicable here because during test time the segmentations are obtained using parsing. ...
Article
This article presents a comparative study of a subfield of morphology learning referred to as minimally supervised morphological segmentation. In morphological segmentation, word forms are segmented into morphs, the surface forms of morphemes. In the minimally supervised data-driven learning setting, segmentation models are learned from a small number of manually annotated word forms and a large set of unannotated word forms. In addition to providing a literature survey on published methods, we present an in-depth empirical comparison on three diverse model families, including a detailed error analysis. Based on the literature survey, we conclude that the existing methodology contains substantial work on generative morph lexicon-based approaches and methods based on discriminative boundary detection. As for which approach has been more successful, both the previous work and the empirical evaluation presented here strongly imply that the current state of the art is yielded by the discriminative boundary detection methodology.
... There is a large body of work studying the best form of segmentation when translating from a morphologically complex source language (Sadat and Habash, 2006;Stallard et al., 2012), where the segmentation can be used as a simple preprocessing step, or to create an input lattice (Dyer et al., 2008). Recently, there has been a growing interest in segmentation on the target side (Oflazer and Durgar El-Kahlout, 2007), which introduces a question of how to perform proper desegmentation (Badr et al., 2008). ...
... The segmentation of morphologically complex words such as compounds is widely used for SMT, and various algorithm for morpheme segmentation have been investigated (Nießen and Ney, 2000;Koehn and Knight, 2003;Virpioja et al., 2007;Stallard et al., 2012). Segmentation algorithms commonly used for phrase-based SMT tend to be conservative in their splitting decisions, whereas we aim for an aggressive segmentation that allows for open-vocabulary translation with a compact network vocabulary, and without having to resort to back-off dictionaries. ...
Article
Full-text available
Neural machine translation (NMT) models typically operate with a fixed vocabulary, so the translation of rare and unknown words is an open problem. Previous work addresses this problem through back-off dictionaries. In this paper, we introduce a simpler and more effective approach, enabling the translation of rare and unknown words by encoding them as sequences of subword units, based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 0.8 and 1.5 BLEU, respectively.
Chapter
Neural machine translation (NMT) is a popular paradigm used in the task of automatically translating text from a language to another. One of the challenges of NMT is to handle out-of-vocabulary (OOV) and low frequency words. Recent work shows that splitting words into subword units outperforms the classic word-level practice. However, many subword units have been proposed and it is difficult to find the one that is best suited for a language pair. In this paper, we conduct experiments using different subword segmentation methods including Byte Pair Encoding, Character Embedding, and so on in English-Vietnamese translation tasks. Experimental results on our dataset show that subword encoding yields better performance to regular encoding. We also give analyses on the results to suggest which subword method is best suited for English-Vietnamese NMT tasks.
Article
Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.
Conference Paper
Morphological models are used in many natural language processing tasks including machine translation and speech recognition. We investigated probabilistic and grouping methods to develop a morphological root identification model for Afaan Oromo. In this paper, we have experimentally shown that the proposed methods can improve the morphological root identification performance of some state-of-the-art methods.
Conference Paper
Full-text available
Arabic Dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialectal sentences are selected from a large corpus of Arabic web text, and translated using Amazon's Mechanical Turk. We use this data to build Dialectal Arabic MT systems, and find that small amounts of dialectal data have a dramatic impact on translation quality. When translating Egyptian and Levantine test sets, our Dialectal Arabic MT system performs 6.3 and 7.0 BLEU points higher than a Modern Standard Arabic MT system trained on a 150M-word Arabic-English parallel corpus.
Conference Paper
Full-text available
This paper describes a full two-level morphological description of Turkish word structures. The description has been implemented using the PC-KIMMO environment and is based on a root word lexicon of about 23,000 root words. The phonetic rules of contemporary Turkish (spoken in Turkey) have been encoded using 22 two-level rules while the morphotactics of the agglutinative word structures have been encoded as finite-state machines for verbal, nominal paradigms and other categories. Almost all the special cases of, and exceptions to phonological and morphological rules have been taken into account. In this paper, we describe the rules and the finite state machines along with examples and a discussion of how various special cases were handled. We also describe some known limitations and problems with this description.
Article
Full-text available
In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morpheme-like units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish, and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced.
Conference Paper
Full-text available
We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level lan- guage models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically sig- nificant improvements over the classic model based on BLEU and human judgments.
Conference Paper
Full-text available
Conference Paper
Full-text available
If two translation systems differ differ in perfor- mance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling meth- ods to compute statistical significance of test results, and validate them on the concrete example of the BLEU score. Even for small test sizes of only 300 sentences, our methods may give us assurances that test result differences are real.
Conference Paper
Full-text available
We propose a backoff model for phrase- based machine translation that translates unseen word forms in foreign-language text by hierarchical morphological ab- stractions at the word and the phrase level. The model is evaluated on the Europarl corpus for German-English and Finnish- English translation and shows improve- ments over state-of-the-art phrase-based models.
Conference Paper
Full-text available
We address the problem of translating from morphologically poor to morphologically rich languages by adding per-word linguistic in- formation to the source language. We use the syntax of the source sentence to extract information for noun cases and verb persons and annotate the corresponding words accord- ingly. In experiments, we show improved performance for translating from English into Greek and Czech. For English-Greek, we re- duce the error on the verb conjugation from 19% to 5.4% and noun case agreement from 9% to 6%.
Conference Paper
Full-text available
This paper extends the training and tuning regime for phrase-based statistical machine translation to obtain fluent translations into morphologically complex languages (we build an English to Finnish translation system). Our methods use unsupervised morphology induction. Unlike previous work we focus on morphologically productive phrase pairs -- our decoder can combine morphemes across phrase boundaries. Morphemes in the target language may not have a corresponding morpheme or word in the source language. Therefore, we propose a novel combination of post-processing morphology prediction with morpheme-based translation. We show, using both automatic evaluation scores and linguistically motivated analyses of the output, that our methods outperform previously proposed ones and provide the best known results on the English-Finnish Europarl translation task. Our methods are mostly language independent, so they should improve translation into other target languages with complex morphology.
Conference Paper
Full-text available
This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.
Conference Paper
Full-text available
Morphological segmentation breaks words into morphemes (the basic semantic units). It is a key component for natural language pro- cessing systems. Unsupervised morphologi- cal segmentation is attractive, because in ev- ery language there are virtually unlimited sup- plies of text, but very few labeled resources. However, most existing model-based systems for unsupervised morphological segmentation use directed generative models, making it dif- ficult to leverage arbitrary overlapping fea- tures that are potentially helpful to learning. In this paper, we present the first log-linear model for unsupervised morphological seg- mentation. Our model uses overlapping fea- tures such as morphemes and their contexts, and incorporates exponential priors inspired by the minimum description length (MDL) principle. We present efficient algorithms for learning and inference by combining con- trastive estimation with sampling. Our sys- tem, based on monolingual features only, out- performs a state-of-the-art system by a large margin, even when the latter uses bilingual in- formation such as phrasal alignment and pho- netic correspondence. On the Arabic Penn Treebank, our system reduces F1 error by 11% compared to Morfessor.
Article
Full-text available
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.
Article
We present our investigations on unsupervised Turkish morphological segmentation (TMS) for statistical machine translation (SMT), which has not been addressed in previous work (Table 1). We perform in vivo testing of the TMS perfor-mance in a phrase-based SMT system developed for the Turkish-English task at IWSLT 1 . We com-pare unsupervised segmentation against two base-lines: (1) no segmentation, i.e., word-based trans-lation, (2) a state-of-the-art supervised segmenta-tion comprising morphological analysis + disam-biguation + manually-crafted rules (Mermer et al., 2009) that performed very well in IWSLT 2010. We set out with an existing unsupervised seg-mentation tool, Morfessor (Creutz and Lagus, 2007). The original search algorithm of Morfes-sor aims to satisfy the MAP objective by greed-ily searching for the segmentation that results in the highest posterior probability according to the generative model. However, the greedy search in the high-dimensional combinatorial search space often gets stuck in local optima. We instead pro-pose to approximate the posterior distribution of segmentations via Gibbs sampling. We decide the segmentation location for a word by draw-ing a sample from the distribution proportional to the posterior probability of the model given the existing state of segmentation for the rest of the words. We ran the Gibbs sampler for 2000 itera-tions over the (dynamic) sub-word vocabulary. Ta-ble 2 shows that Gibbs sampling is able to find bet-ter segmentations in terms of model scores (pre-viously unattainable in greedy search). However, this search improvement does not translate over to the translation performance (Table 3). This sug-gests a model mismatch, which can be expected in this case since the segmentation model uses only monolingual observations. Hence we extend the generative model to incor-porate both sides of the parallel corpus via trans- = fseg P (M f)P (f seg |M f)P (f |f seg)P (e|f seg). Here, e and f are the two sides of the parallel corpus and M f is the segmentation model for f that results in the segmentation f seg . Note that P (f |f seg) is either 1 or 0 indicating legal segmentations of f . In searching for the MAP segmentation model M * f , we approximate the summation with the max operation. We model the first two components as in the monolingual case while for the translation com-ponent P (e|f) we use IBM Model 1. To cope with the increased computational load, we propose a search algorithm that enables parallel computa-tion instead of the original sequential search. We also devise a method of computing approximate IBM Model-1 translation likelihood incrementally from an adjacent segmentation state to speed-up the computation.
Conference Paper
The connection between part-of-speech (POS) categories and morphological properties is well-documented in linguistics but underutilized in text processing systems. This paper proposes a novel model for morphological segmentation that is driven by this connection. Our model learns that words with common affixes are likely to be in the same syntactic category and uses learned syntactic categories to refine the segmentation boundaries of words. Our results demonstrate that incorporating POS categorization yields substantial performance gains on morphological segmentation of Arabic.
Conference Paper
We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.
Conference Paper
In this paper, we propose a novel string-to- dependency algorithm for statistical machine translation. With this new framework, we em- ploy a target dependency language model dur- ing decoding to exploit long distance word relations, which are unavailable with a tra- ditional n-gram language model. Our ex- periments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string-to- string system on the NIST 04 Chinese-English evaluation set.
Conference Paper
Statistical machine translation is quite ro- bust when it comes to the choice of in- put representation. It only requires con- sistency between training and testing. As a result, there is a wide range of possi- ble preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evalu- ate different methods for combining pre- processing schemes resulting in improved translation quality.
Conference Paper
We present four techniques for online han- dling of Out-of-Vocabulary words in Phrase- based Statistical Machine Translation. The techniques use spelling expansion, morpho- logical expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the per- formance of these techniques and combine them. Our results show a consistent improve- ment over a state-of-the-art baseline in terms of BLEU and a manual error analysis.
Conference Paper
One of the reasons nonparametric Bayesian inference is attracting attention in computa- tional linguistics is because it provides a prin- cipled way of learning the units of generaliza- tion together with their probabilities. Adaptor grammars are a framework for defining a va- riety of hierarchical nonparametric Bayesian models. This paper investigates some of the choices that arise in formulating adap- tor grammars and associated inference proce- dures, and shows that they can have a dra- matic impact on performance in an unsuper- vised word segmentation task. With appro- priate adaptor grammars and inference proce- dures we achieve an 87% word token f-score on the standard Brent version of the Bernstein- Ratner corpus, which is an error reduction of over 35% over the best previously reported re- sults for this corpus.
Article
This article presents several techniques for integrating information from a rule-based machine translation (RBMT) system into a statistical machine translation (SMT) framework. These techniques are grouped into three parts that correspond to the type of information integrated: the morphological, lexical, and system levels. The first part presents techniques that use information from a rule-based morphological tagger to do morpheme splitting of the Arabic source text. We also compare with the results of using a statistical morphological tagger. In the second part, we present two ways of using Arabic diacritics to improve SMT results, both based on binary decision trees. The third part presents a system combination method that combines the outputs of the RBMT and the SMT systems, leveraging the strength of each. This article shows how language specific information obtained through a deterministic rule-based process can be used to improve SMT, which is mostly language-independent.