Article

Hybrid Arabic-French Machine Translation using Syntactic Re-ordering and Morphological Pre-processing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic-French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Their syntactic reordering rules, however, did not result in any significant improvement. Mohamed and Sadat (2015) used handcrafted morphological reordering rules to reorder the source-side Arabic sentences in an Arabic-to-French translation task. Their rules attempt to reorder both the pronouns and verbs of the source-side Arabic sentences in a way that matches the target French language. ...
... Arabic-to-English:Chen et al. (2006),Habash (2007),Carpuat et al. (2010),Carpuat et al. (2012),Bisazza et al. (2012) English-to-Arabic:Elming and Habash (2009), Elming (2008),Badr et al. (2009) Arabic-Others:Sadat and Mohamed (2013),Mohamed and Sadat (2015),Alqudsi et al. (2019) Word Alignment Arabic-to-English:Ittycheriah and Roukos (2005), Fossum et al. (2008), Hermjakob (2009), Gao et al. (2010), Riesa and Marcu (2010), Khemakhem et al. (2015) English-to-Arabic: Ellouze et al. (2018), Berrichi and Mazroui (2018) Language Models Arabic-to-English: Brants et al. (2007), Carter and Monz (2010), Niehues et al. (2011) English-to-Arabic: Khemakhem et al. (2013) Other Arabic SMT Studies Arabic-to-English: Habash (2008), Marton et al. (2012) English-to-Arabic: Toutanova et al. Arabic & English: Sajjad et al. (2017), Oudah et al. (2019) Arabic-Others: Aqlan et al. ...
Thesis
Given that Arabic is one of the most widely used languages in the world, the task of Arabic Machine Translation has recently received a great deal of attention from the research community. Indeed, the amount of focus that has been devoted to this task has led to some important achievements and improvements. However, the current state of Arabic Machine Translation systems has not reached the quality achieved for some other languages such as English and French. In this thesis, we are interested in the task of English-to-Arabic Machine Translation for which we propose several contributions: First, we propose a method that handles both long- and short-distance word reorderings in the context of English-to-Arabic Statistical Machine Translation (SMT). Secondly, we propose a method for named entity transliteration that can accurately transliterate English named entities into Arabic. Finally, our main contribution concerns re-ranking the n-best list in the context of English-to-Arabic Neural Machine Translation (NMT). Our solution uses a set of sophisticated features that cover lexical, syntactic, and semantic aspects of the n-best list candidates. All our contributions are evaluated carefully and the results obtained for the tests we have carried out show the effectiveness of our proposals.
... Arabic NLP has developed numerous tools using MT techniques to analyze the language in both written and spoken forms (Marie-Sainte, Alalyani, Alotaibi, Ghouzali, & Abunadi, 2019). There have been various proposals for processing Arabic, such as Leavitt's (1994) MORPHE, a morphological rule compiler; Soudi and Cavalli-Sforza's (2003) work on interfacing Arabic morphology with sentence generation systems; Habash's (2010) introduction to Arabic NLP; and Mohamed and Sadat's (2015) hybrid translation approach. This approach couples an Arabic-French statistical MT system using the Moses decoder with additional morphological rules to simplify the source language morphology. ...
Article
Full-text available
Ambiguity in some syntactic structures of the same language has always posed problems to the human translator and to machine translation. These problems become more complex for the Machine Translation of genetically unrelated languages such as Arabic, English and French. Arabic Lexical ambiguity in Natural Language Processing (NLP) also poses problems when the semantic fields of Arabic words differ from those of English for instance. This often occurs when two or more words from Arabic equate to a single word in English. Semantic gaps between the two languages are also a source of ambiguity in Natural Language Processing. We shall deal with some cases of ambiguity in machine translation from Arabic to English and French and vice versa. The questions addressed in this paper relate to segmentation, determination / non-determination, coordination and the issue of the word as a meaningful and functional unit. Some aspects of the segmentation of constituents into grammatical categories and their comparison with structures of Arabic English and French are addressed in this paper.
... Several AMT research studies [91,52,78,81,2,65] proved that the use of Hybrid techniques in MT improve the translation quality. ...
Article
Full-text available
In a world where linguistic and logical barriers are dissolving and communication must be possible on demand and in a language one is familiar with, Machine Translation (MT) has become essential. This natural language processing sub-field is now the most exciting but also the most difficult topic to solve, as it automates the translation process and decreases the reliance on human translators. Researchers are attempting to translate English into their native language and vice versa, but achieving perfect MT has proven to be a difficult task for researchers all over the world. This paper provides a high-level overview of this subject reviewing the accomplishments and findings of various MT studies in the context of Arabic language, which is known by its richness and complexity. At the end of the survey, open challenges and issues are discussed. Therefore, this study might be useful for future research and for selecting the best method for a certain purpose.
... Arabic is a right to left Semitic language. It is cursive, agglutinative, highly inflectional and derivational (Abandah et al., 2015;AbdelRaouf, Higgins, Pridmore & Khalil, 2010;Mohamed & Sadat, 2015). ...
Article
In this paper, we propose to build a morpho-semantic knowledge graph from Arabic vocalized corpora. Our work focuses on classical Arabic as it has not been deeply investigated in related works. We use a tool suite which allows analyzing and disambiguating Arabic texts, taking into account short diacritics to reduce ambiguities. At the morphological level, we combine Ghwanmeh stemmer and MADAMIRA which are adapted to extract a multi-level lexicon from Arabic vocalized corpora. At the semantic level, we infer semantic dependencies between tokens by exploiting contextual knowledge extracted by a concordancer. Both morphological and semantic links are represented through compressed graphs, which are accessed through lazy methods. These graphs are mined using a measure inspired from BM25 to compute one-to-many similarity. Indeed, we propose to evaluate the morpho-semantic Knowledge Graph in the context of Arabic Information Retrieval (IR). Several scenarios of document indexing and query expansion are assessed. That is, we vary indexing units for Arabic IR based on different levels of morphological knowledge, a challenging issue which is not yet resolved in previous research. We also experiment several combinations of morpho-semantic query expansion. This permits to validate our resource and to study its impact on IR based on state-of-the art evaluation metrics.
... Their syntactic reordering rules, however, did not result in any significant improvement. Mohamed and Sadat [56] used handcrafted morphological reordering rules to reorder the source-side Arabic sentences in an Arabicto-French translation task. Their rules attempt to reorder both the pronouns and verbs of the source-side Arabic sentences in a way that matches the target French language. ...
Article
Given that Arabic is one of the most widely used languages in the world, the task of Arabic Machine Translation (MT) has recently received a great deal of attention from the research community. Indeed, the amount of research that has been devoted to this task has led to some important achievements and improvements. However, the current state of Arabic MT systems has not reached the quality achieved for some other languages. Thus, much research work is still needed to improve it. This survey paper introduces the Arabic language, its characteristics, and the challenges involved in its translation. It provides the reader with a full summary of the important research studies that have been accomplished with regard to Arabic MT along with the most important tools and resources that are available for building and testing new Arabic MT systems. Furthermore, the survey paper discusses the current state of Arabic MT and provides some insights into possible future research directions.
... Both parsers gave f-scores higher than 90 for gold POS and on textual input, they give an average f-score of 87.6. The automatic prediction of syntactic information for a language is helpful to perform several NLP tasks including machine translation [5,6,7], automatic speech recognition [8], text to speech [9], text summarization [10] and social media text analysis [11]. The constituency parsing for Urdu was initiated to analyze syntax-prosody relationship for an annotated speech corpus. ...
Article
Full-text available
This paper presents an analysis of experiments with statistical and neural parsing techniques for Urdu, a widely spoken South Asian language. We demonstrate state of the art constituency parsing results for an Urdu treebank. Urdu is a morphologically rich and is characterized by free word order. Language representation (e.g. input type, lemmatization, word clusters), part of speech tag set, phrase labels and the size of a training corpus are crucial for parsing such languages. In this paper, probabilistic context-free grammars, data-oriented parsing, and recursive neural network based models have been experimented with several linguistic features which show improvements in the parsing results. Features include syntactic sub-categorization of POS tags, empirically learned horizontal and vertical markovizations and lexical head words. These features enable dependency information for case markers and add phrasal and lexical context to the parse trees. The data-oriented parsing and recursive neural network model give an f-score of 87.1 by considering gold POS tags in the test set, on textual input, they show a performance with f-scores of 83.4 and 84.2, respectively. To overcome the issue of data sparsity due to the morphological richness, lemmatization and unsupervised word clustering have been performed. A treebank should cover most probable word orders of the language so that models can learn various orders accurately. To analyze the order coverage of the treebank and learning capability of different parsers, a test set has been prepared conditioning different word orders. This test set is evaluated with the best performing parsing models and with gold POS tags, f-scores are above 90 and on textual input, the average f-score is 87.6.
... Several research on the AMT proved that the use of HMT improve the translation quality, Matusov et al., presented a MT systems that combine five MT system (Multi-engine) to translate from Arabic-to-English with the goal of improving translation quality achieving (55) BLEU score [75]. Mohamed & Sadat presented the HMT approach by introducing the use of morphological rule with SMT to reduces the Arabic morphology level to a closer level to French in translating form Arabic to French providing better results [76]. Habash et al., conduct an evaluation of multiple system for translation form Arabic to English using the HMT approach [77]. ...
Article
Full-text available
As the interest on Arabic Language continue to increase worldwide for several factors including political, technological, social and cultural. The significant increase in the Arabic electronic textual information in conjunction with the boom of using social networking and openness to different cultures provided a huge collective knowledge source. Such high demand and urgent need for effective technologies and tools to process and translate information from/to Arabic motivated the researchers in Arabic Machine Translation (AMT) in both the Western and Arab world. This paper aim to explore AMT approaches, challenges and proposed solutions, providing a survey for the research activity conducted on this field and evolutions of the existing related current MT solutions.
Chapter
Full-text available
Machine translation (MT) aims to remove linguistic barriers and enables communication by allowing languages to be automatically translated. The availability of a substantial parallel corpus determines the quality of translations produced by corpus-based MT systems. This paper aims to develop a corpus-based bidirectional statistical machine translation (SMT) system for Punjabi-English, Punjabi-Hindi, and Hindi-English language pairs. To create a parallel corpus for English, Hindi, and Punjabi, the IIT Bombay Hindi-English parallel corpus is used. This paper discusses preprocessing steps to create the Hindi, Punjabi, and English corpus. This corpus is used to develop MT models. The accuracy of the MT system is carried out using an automated tool: Bilingual Evaluation Understudy (BLEU). The BLEU score claimed is 17.79 and 19.78 for Punjabi to English bidirectional MT system, 33.86 and 34.46.46 for Punjabi to Hindi bidirectional MT system, 23.68 and 23.78 for Hindi to English bidirectional MT system.KeywordsMachine translationSMTCorpus-basedParallel corpusBLEU
Conference Paper
Full-text available
In statistical machine translation, data sparsity is a challenging problem especially for languages with rich morphology and inconsistent orthography, such as Persian. We show that orthographic preprocessing and morphological segmentation of Persian verbs in particular improves the translation quality of Persian-English by 1.9 BLEU points on a blind test set.
Article
Full-text available
We describe a substitution-based system for hybrid machine translation (MT) that has been extended with machine learning components controlling its phrase selection. The approach is based on a rule-based MT (RBMT) system which creates template translations. Based on the rule-based generation parse tree and target-to-target alignments, we identify the set of "interesting" translation candidates from one or more translation engines which could be substituted into our translation templates. The substitution process is either controlled by the output from a binary classifier trained on feature vectors from the different MT engines, or it is depending on weights for the decision factors, which have been tuned using MERT. We are able to observe improvements in terms of BLEU scores over a baseline version of the hybrid system.
Article
Full-text available
Most of the existing, easily available paral- lel texts to train a statistical machine trans- lation system are from international organi- zations that use a particular jargon. In this paper, we consider the automatic adaptation of such a translation model to the news do- main. The initial system was trained on more than 200M words of UN bitexts. We then ex- plore large amounts of in-domainmonolingual texts to modify the probability distribution of the phrase-table and to learn new task-specific phrase-pairs. This procedure achieved an im- provement of 3.5 points BLEU on the test set in an Arabic/French statistical machine trans- lation system. This result compares favorably with other large state-of-the-art systems for this language pair.
Article
Full-text available
We describe two methods to improve SMT accuracy using shallow syntax information. First, we use chunks to refine the set of word alignments typically used as a starting point in SMT systems. Second, we extend anN -gram- based SMT system with chunk tags to better account for long-distance reorderings. Exper- iments are reported on an Arabic-English task showing significant improvements. A human error analysis indicates that long-distance re- orderings are captured effectively.
Article
Full-text available
In this work, the creation of a large-scale Arabic to French statistical machine translation system is presented. We introduce all necessary steps from corpus aquisition, preprocessing the data to training and optimizing the system and eventual evaluation. Since no corpora existed previously, we collected large amounts of data from the web. Arabic word segmentation was crucial to reduce the overall number of unknown words. We describe the phrase-based SMT system used for training and generation of the translation hypotheses. Results on the second CESTA evaluation campaign are reported. The setting was in the medical domain. The prototype reaches a favorable BLEU score of 40.8%.
Article
Full-text available
To date, there are no fully automated systems addressing the community's need for funda-mental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automati-cally tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accu-rate tools that have been developed for En-glish text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an ¡ £ ¢ ¥ ¤ £ ¦ score of 99.12, the SVM-POS tagger achieves an ac-curacy of 95.49%, and the SVM-BP chunker yields an ¡ ¢ ¤ £ ¦ score of 92.08.
Article
Full-text available
We present a novel morphological analysis technique which induces a morphological and syntactic symmetry between two languages with highly asymmetrical morphological structures to improve statistical machine translation qualities. The technique pre-supposes fine-grained segmentation of a word in the morphologically rich language into the sequence of prefix(es)-stem-suffix(es) and part-of-speech tagging of the parallel corpus. The algorithm identifies morphemes to be merged or deleted in the morphologically rich language to induce the desired morphological and syntactic symmetry. The technique improves Arabic-to-English translation qualities significantly when applied to IBM Model 1 and Phrase Translation Models trained on the training corpus size ranging from 3,500 to 3.3 million sentence pairs.
Article
Full-text available
Prague Arabic Dependency Treebank not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP). This paper delivers an overview of the recent and most interesting results, findings and innovations within the project.
Article
Full-text available
Résumé. Distinguer les constructions verbe-sujet (VS) des propositions principales ("matrice") et subordonnées ("non-matrice") améliore notre nouveau modèle de réordonnancement pour l'alignement des mots en Traduction Automatique Statistique (TAS) arabe-anglais (Carpuat et al., 2010). D'une part, la majorité des constructions verbe-sujet (VS) dans les propositions principales doivent être réordonnancées en anglais, alors que l'ordre du verbe et du sujet est préservé dans la moitié des cas de constructions VS subordonnées. D'autre part, nous constatons que notre analyseur syntaxique parvient à mieux identifier les constructions VS des propositions principales. Ces observations nous amènent à limiter le réordon-nancement des constructions VS à celles des propositions principales lors de l'alignement des mots. Cette technique améliore substantiellement la performance d'un système de TAS conventionnel, et d'un sys-tème qui réordonnance toutes les constructions VS. L'amélioration des mesures BLEU et TER obtenue par simple réordonnancement représente presque la moitié de l'amélioration obtenue lorsque le modèle d'alignement des mots est entraîné sur un corpus parallèle d'une taille cinq fois supérieure. Abstract. We improve our recently proposed technique for integrating Arabic verb-subject construc-tions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium-and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remar-kably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger. Mots-clés : Analyse morpho-syntaxique de l'arabe, Traduction automatique statistique, VS, VSO.
Article
Full-text available
The Arabic language has far richer sys-tems of inflection and derivation than En-glish which has very little morphology. This morphology difference causes a large gap between the vocabulary sizes in any given parallel training corpus. Segmen-tation of inflected Arabic words is a way to smooth its highly morphological na-ture. In this paper, we describe some statistically and linguistically motivated methods for Arabic word segmentation. Then, we show the efficiency of proposed methods on the Arabic-English BTEC and NIST tasks.
Article
Full-text available
Phrase re-ordering is a well-known obstacle to robust machine translation for language pairs with significantly different word order-ings. For Arabic-English, two languages that usually differ in the ordering of subject and verb, the subject and its modifiers must be accurately moved to produce a grammatical translation. This operation requires more than base phrase chunking and often defies current phrase-based statistical decoders. We present a conditional random field sequence classi-fier that detects the full scope of Arabic noun phrase subjects in verb-initial clauses at the F β=1 61.3% level, a 5.0% absolute improve-ment over a statistical parser baseline. We suggest methods for integrating the classifier output with a statistical decoder and present preliminary machine translation results.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Conference Paper
Full-text available
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, word-based POS tagging yields the best results, with a word accuracy of 94.74%.
Conference Paper
Full-text available
In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particularly severe. In addition, much of the morphological variation seen in Czech words is not reflected in either the morphology or syntax of a language like English. In this work, we show that using morphological analysis to modify the Czech input can improve a Czech-English machine translation system. We investigate several different methods of incorporating morphological information, and show that a system that combines these methods yields the best results. Our final system achieves a BLEU score of .333, as compared to .270 for the baseline word-to-word system.
Article
Full-text available
Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.
Article
Full-text available
We present and compare various methods for computing word alignments using statistical or heuristic models. We consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. We present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. As evaluation criterion, we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We evaluate the models on the German-English Verbmobil task and the French-English Hansards task. We perform a detailed analysis of various design decisions of our statistical alignment system and evaluate these on training corpora of various sizes. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. In the Appendix, we present an efficient training algorithm for the alignment models presented.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Article
Arabic is a morphologically rich and complex language, which presents significant challenges for natural language processing and machine translation. In this paper, we describe an ongoing effort to build our first Arabic-French phrase- based machine translation system using the Moses decoder among other linguistic tools. The results show an improvement in the quality of translation and a gain in terms of Bleu score after introducing a pre-processing scheme for Arabic and applying some rules based on morphological variations of the source language. The proposed approach is completed without increasing the amount of training data or changing radically the algorithms that can affect the translation or training engines.
Conference Paper
We present a Machine-Learning-based framework for hybrid Machine Translation. Our approach combines translation output from several black-box source systems. We define an extensible, total order on translation output and use this to decompose the n-best translations into pairwise system comparisons. Using joint, binarised feature vectors we train an SVM-based classifier and show how its classification output can be used to generate hybrid translations on the sentence level. Evaluations using automated metrics shows promising results. An interesting finding in our experiments is the fact that our approach allows to leverage good translations from otherwise bad systems as the combination decision is taken on the sentence instead of the corpus level. We conclude by summarising our findings and by giving an outlook to future work, e.g., on probabilistic classification or the integration of manual judgements.
Article
We present translation results on the shared task "Exploiting Parallel Texts for Statistical Machine Translation" gener-ated by a chart parsing decoder operating on phrase tables augmented and general-ized with target language syntactic cate-gories. We use a target language parser to generate parse trees for each sentence on the target side of the bilingual train-ing corpus, matching them with phrase table lattices built for the corresponding source sentence. Considering phrases that correspond to syntactic categories in the parse trees we develop techniques to aug-ment (declare a syntactically motivated category for a phrase pair) and general-ize (form mixed terminal and nonterminal phrases) the phrase table into a synchro-nous bilingual grammar. We present re-sults on the French-to-English task for this workshop, representing significant im-provements over the workshop's baseline system. Our translation system is avail-able open-source under the GNU General Public License.
Article
We describe an approach to automatic source-language syntactic preprocessing in the context of Arabic-English phrase-based machine translation. Source-language labeled dependencies, that are word aligned with target language words in a parallel corpus, are used to automatically extract syntactic reordering rules in the same spirit of Xia and McCord (2004) and Zhang et al. (2007). The extracted rules are used to reorder the source-language side of the training and test data. Our results show that when using monotonic decoding and translations for unigram source-language phrases only, source-language reordering gives very significant gains over no reordering (25% relative increase in BLEU score). With decoder distortion turned on and with access to all phrase translations, the differences in BLEU scores are diminished. However, an analysis of sentence-level BLEU scores shows reordering outperforms no-reordering in over 40% of the sentences. These results suggest that the approach holds big promise but much more work on Arabic parsing may be needed.
Conference Paper
We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.
Conference Paper
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data. Dans le présent document, nous étudions l’effet de différentes décisions de prétraitement au niveau des mots en arabe sur la qualité de la TA statistique. Les résultats que nous avons obtenus montrent que, compte tenu de grandes quantités de données de formation, la séparation de proclitiques uniquement donne de meilleurs résultats. Toutefois, en présence de petites quantités de données de formation, il est préférable d’appliquer un prétraitement similaire à celui souvent appliqué à l’anglais, utilisant des étiquettes syntactiques, ainsi que la résolution de l’ambiguïté et l’analyse morphologique évoluée. En outre, le choix du prétraitement approprié entraîne une hausse significative de la cote BLEU en cas de changement de genre entre les données de formation et d’essai.
Article
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and new ideas are constantly introduced. This survey presents a tutorial overview of the state of the art. We describe the context of the current research and then move to a formal problem description and an overview of the main subproblems: translation modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and a discussion of future directions.
Article
We have put together a corpus of 242 abstracts of Arabic documents using the Proceedings of the Saudi Arabian National Conferences as a source. All these abstracts involve computer science and information systems. We also designed and built an automatic information retrieval system from scratch to handle Arabic data. The system was implemented in the C language using the GCC compiler and runs on IBM/PCs and compatible microcomputers. We have implemented both automatic and manual indexing techniques for this corpus. A long series of experiments using measures of recall and precision has demonstrated that automatic indexing is at least as effective as manual indexing and more effective in some cases. Since automatic indexing is both cheaper and faster, our results suggest that we can achieve a wider coverage of the literature with less money and produce as good results as with manual indexing. We have also compared the retrieval results using words as index terms versus stems and roots, and confirmed the results obtained by Al-Kharashi and Abu-Salem with smaller corpora that root indexing is more effective than word indexing. © 1997 John Wiley & Sons, Inc.
Article
SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Is Arabic part of speech tagging feasible without word segmentation? Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL
  • M Emad
  • S Kübler
Emad, M., Kübler, S., 2010. Is Arabic part of speech tagging feasible without word segmentation? In: HLT/ACL 2010, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, CA, June, pp. 705–708.
Master Thesis in collaboration with Google Inc
  • S Sultan
Sultan, S. 2011. Applying Morphology to English-Arabic SMT. Master Thesis in collaboration with Google Inc. May 2011. Swiss Federal Institute of Technology, Zurich, Swiss (2011).
Arabic preprocessing for Statistical Machine Translation: schemes and techniques Pre-processing and language analysis for Arabic to French Statistical Machine Translation Translation model adaptation for an Arabic/French news translation system by lightly-supervised training
  • F Sadat
  • H F Habash
  • E Mohamed
Sadat, F., Habash, H., 2006. Arabic preprocessing for Statistical Machine Translation: schemes and techniques, 2006. In: Proceedings of COLING-ACL 2006, Sydney, Australia, 17–21 July. Sadat, F., Mohamed, E., 2013. Pre-processing and language analysis for Arabic to French Statistical Machine Translation. In: Proceedings of TALN 2013, Les Sables d'Olonne, 17–21 June. Schwenk, H., Senellart, J., 2009. Translation model adaptation for an Arabic/French news translation system by lightly-supervised training. In: MT Summit.
BLEU: A Method for Automatic Evaluation of Machine Translation IBM Research Division, Yorktown Heights Creating a large-scale Arabic to French Statistical Machine Translation system
  • K Papineni
  • S Roukos
  • T Ward
  • W Zhu
Papineni, K., Roukos, S., Ward, T., Zhu, W., 2001. BLEU: A Method for Automatic Evaluation of Machine Translation. Technical Report RC22176 (W0109-022). IBM Research Division, Yorktown Heights, NY. Sˇ, H., El Isbihani, A., Ney, H., 2006. Creating a large-scale Arabic to French Statistical Machine Translation system. In: Proceedings of LREC 2006 (5th International Conference on Language Resources and Evaluation), Genoa, Italy, May, pp. 855–858.
Statistical Machine Translation Foundations of Statistical Natural Language Processing A systematic comparison of various statistical alignment models
  • A Lopez
  • C D Manning
  • H Schuetze
Lopez, A., 2008. Statistical Machine Translation. In: ACM Comp. Surveys, Vol. 40, August. Manning, C.D., Schuetze, H., 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Och, F.J., Ney, H., 2003. A systematic comparison of various statistical alignment models. Comput. Linguistics 29 (1), 19–51.
The Prague Arabic Dependency Treebank
  • N Habash
  • O Rambow
  • R Ryan
  • J Q8 Hajič
  • P Zemánek
Habash, N., Rambow, O., Ryan, R., 2010. The MADA and TOKAN Manual. Q8 Hajič, J., Zemánek, P., 2004. The Prague Arabic Dependency Treebank. Development in Data and Tools.
Computer Speech and Language – Special Issue on Hybrid Machine Translation 00 (2014) 000–000 NAACL), Workshop on Statistical Machine Translation
  • E Hlt-Mohamed
  • F Sadat
In Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics Annual Meeting (HLT- Mohamed E. and Sadat F./Computer Speech and Language – Special Issue on Hybrid Machine Translation 00 (2014) 000–000 NAACL), Workshop on Statistical Machine Translation, New York City, pages 15-22 (2006).
Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks
  • M Diab
  • K Hacioglu
  • D Jurafsky
Diab, M., Hacioglu, K. et Jurafsky, D. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Boston, MA (2004).
NP Subject Detection in Verb-initial Arabic clauses
  • S Green
  • C Sathi
  • C D Manning
Green, S., Sathi, C., and Manning, C. D. NP Subject Detection in Verb-initial Arabic clauses. In Proceedings of the Third Workshop on Computational Approaches to Arabic Script-based Languages (CAASL3), (2009).
Syntactic pre-processing for statistical machine translation
  • N Habash
Habash, N. Syntactic pre-processing for statistical machine translation. In Proceedings of the Machine Translation Summit (MT-Summit), Copenhagen (2007).
Pre-processing and Language Analysis for Arabic to French Statistical Machine Translation
  • Mohamed Sadat
Sadat, F and Mohamed, E. Pre-processing and Language Analysis for Arabic to French Statistical Machine Translation. In proceedings of TALN 2013, Les Sables d'Olonne. June 17-21, (2013).
Creating a Large-Scale Arabic to French Statistical Machine Translation System
  • H Saˇsa
  • A El Isbihani
  • H Ney
Saˇsa H., El Isbihani, A, Ney H. Creating a Large-Scale Arabic to French Statistical Machine Translation System. In Proceedings of LREC 2006 (5th International Conference on Language Resources and Evaluation), pp. 855-858, Genoa, Italy, May (2006).