Conference Paper

An Application of Lexicalized Grammars in English-Persian Translation.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Increasing the domain of locality by using Tree Adjoining Grammars (TAG) caused some applications, such as machine translation, to employ it for the disambigu ation process. Successful experiments of employing TAG in French-English and Korean-English machine translation encouraged us to use it for another language pairs with very divergent properti es, Persian and English. Using Synchronous TAG (S-TAG) for this pair of languages can benefit from syntactic and semantic f eatures for transferring the source into the target language. H ere, we report our experiments in translating English into Persian. Al so, we present a model for lexical selection disambiguation based on the decision trees notion. An automatic learning method of the r equired decision trees from a sample data set is introduced , too.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Amtrup et al. [6] developed a Persian to English MT system, which uses knowledge-based architecture to disambiguate the Persian lexical in the unification-based CFG formalism. While, Faili et al. [11], use a similar idea on an enriched grammar model, that is tree adjoining grammar (TAG). Whereas, Mosavi et al. [28], demonstrate a corpus-based approach to there disambiguation phase in English-Persian MT system. ...
... In the syntactic transfer phase, which contains all the node-to-node correspondences between syntactical elements of the S-TAG, target syntactical model are built. This transfer is only a structural transfer, by which the correct related structure of Persian is generated [11]. The second transfer phase is the lexical transfer, in which the lexical items of the input sentence are transferred into the corresponding Persian words, using a WSD method. ...
... This information represent different attributes of the sentence which are language dependent, and are instantiated in the feature-based lexicalized tree adjoining grammar (FB-LTAG) framework [8]. get their right values by a unification process performed during the parsing process [11]. The defined attributes are features that contain syntactic as well as semantic information of the input sentences. ...
Conference Paper
Full-text available
In this paper, we demonstrate an experiment of a machine translation (MT) system for two different languages, English and Persian. We also describe a model for word sense disambiguation (WSD) task inside the MT system, which uses decision trees automatically learned from a training data set, as its disambiguation formalism. Our evaluations can be divided into two different categories: evaluation on the whole MT system and evaluation on the WSD component. The experiments on the whole MT, shows that this system gets 16% with respect to NIST measure, while the evaluation on WSD using a corpus contains 860 aligned sentences shows that this component disambiguates 81.4% of ambiguous word correctly.
... Several kinds of MT systems that are designed in this work are baseline SMT, Factored-based SMT, SMT augmented by a rule based compound verbs detection modules called Verbaware SMT and Statistical Post Editing (SPE) MT on an existing Rule-based MT [4]. These systems are compared to two available English to Persian translator: Google Translator 1 as a pure SMT and rule-based MT based on TAG formalism [5]. ...
... For creating train data of this hybrid system, Persian-Persian Parallel Corpus is needed, so source side of parallel corpus must be translated by human. But because of much cost of translate by human, English side of parallel corpus translated with RBMT that created by [5]. ...
Conference Paper
Full-text available
Comparison of several kinds of English-Persian Statistical Machine Translation systems is reported in this paper. A large parallel corpus containing about 6 million tokens on each side has been developed for training the proposed SMT system. In development of the parallel corpus, a noisy filtering system based on MaxEnt classifier has been innovated to distinguish between correct and incorrect sentence pairs. By using the generated parallel corpus, a variety of SMT systems on English to Persian languages has been developed. Several variations on SMT, such as hybrid MT or statistical post editing MT has been proposed in this paper. The whole systems were tested on two different types of test set, one extracted randomly from parallel corpus and the other containing formal English sentences extracted from English learning book. The results shows hybrid system of SMT augmented by a rule based detection of English phrasal verb and Persian compound verb improves the baseline significantly. Also, state-of-the-art results on English-Persian translation are obtained by Verb-aware SMT with respect to BLEU measure.
... This is due to the fact that computer based texts are available more than ever before, and easier to use for various data tasks. The success of part-of-speech tagging by using the Hidden Markov Model (HMM) (Charniak, 1997b;Church, 1988) also attracted the attention of computational linguists to the lexical analysis, language modeling, and machine translation by using various statistical methods (Feili and Ghassem-Sani, 2004;Charniak, 1996). ...
... In other words, Persian has only borrowed a large number of words from Arabic. Therefore, in spite of this influence, it does not affect the syntactic and morphological structure of Persian (Feili and Ghassem-Sani, 2004;Bateni, 1995). ...
Article
Grammar induction, also known as grammar inference, is one of the most important research areas in the domain of natural language processing. Availability of large corpora has encouraged many researchers to use statistical methods for grammar induction. This problem can be divided into three different categories of supervised, semi-supervised, and unsupervised, based on type of the required data set for the training phase. Most current inductive methods are supervised, which need a bracketed data set for their training phase; but the lack of this kind of data set in many languages, encouraged us to focus on unsupervised approaches. Here, we introduce a novel approach, which we call history-based inside-outside (HIO), for unsupervised grammar inference, by using part-of-speech tag sequences as the only source of lexical information. HIO is an extension of the inside-outside algorithm enriched by using some notions of history based approaches. Our experiments on English and Persian languages show that by adding some conditions to the rule assumptions of the induced grammar, one can achieve acceptable improvement in the quality of the output grammar.
... The Shiraz machine translation system is mainly targeted at translating news material. Faili et al. (2004;2005) and Faili (2009) propose a rule-based English to Persian machine translation system based on a rich formalism named tree-adjoining grammar (TAG). Later, they introduce an enhancement of the system with trained decision trees as a word-sense disambiguation module and also get the benefit of a statistical parser to generate intermediate syntactical structure during transfer phase. ...
... The Shiraz machine translation system is mainly targeted at translating news material. Faili et al. (2004;2005) and Faili (2009) propose a rule-based English to Persian machine translation system based on a rich formalism named tree-adjoining grammar (TAG). Later, they introduce an enhancement of the system with trained decision trees as a word-sense disambiguation module and also get the benefit of a statistical parser to generate intermediate syntactical structure during transfer phase. ...
... The Shiraz machine translation system is mainly targeted at translating news material. Faili et al. (2004;2005) and Faili (2009) propose a rule-based English to Persian machine translation system based on a rich formalism named tree-adjoining grammar (TAG). Later, they introduce an enhancement of the system with trained decision trees as a word-sense disambiguation module and also get the benefit of a statistical parser to generate intermediate syntactical structure during transfer phase. ...
Article
Full-text available
In this paper, an attempt to develop a phrase-based statistical machine translation between English and Persian languages (PersianSMT) is described. Creation of the largest English-Persian parallel corpus yet presented by the use of movie subtitles is a part of this work. Two major goals are followed here: the first one is to show the main problems observed in the output of the PersianSMT system and set a baseline for further experiments and the second one is to check whether movie subtitles can provide a good quality corpus for the development of a general purpose translator or not. In the end, translations made by the PersianSMT system equipped with different language models are evaluated on test sets of different domains and the results are compared to the Google statistical machine translator. According to the obtained BLEU scores, the proposed SMT system strongly outperforms the Google translator in translating both in-domain (movie subtitle) and out-of-domain sentences.
... Ces dernières années ont été développés divers outils et ressources TAL pour le persan. Ils s'agit essentiellement d'étiqueteurs morpho-syntaxiques (QasemiZadeh & Rahimi, 2006;Tasharofi et al., 2007;Shamsfard & Fadaee, 2008), d'analyseurs syntaxiques (Hafezi, 2004;Dehdari & Lonsdale, 2008) et de systèmes de traduction automatique (Feili & Ghassem-Sani, 2004;Saedi et al., 2009). ...
Conference Paper
Full-text available
Résumé. Dans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MElt fa , un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement motivé. Nous avons également développé une nouvelle version du corpus BijanKhan : cette nouvelle version contient des corrections significatives de la tokenisation ainsi qu'un réétiquetage à l'aide des nouvelles catégories. Cette nouvelle version du corpus a enfin été utilisée pour l'entraînement de MElt fa , notre étiqueteur morphosyntaxique pour le persan librement disponible, s'appuyant à la fois sur ce nouvel inventaire de catégories, sur PerLex 2 et sur le système d'étiquetage MElt (Denis & Sagot, 2009). Abstract. In this paper, we present a new version of PerLex, the morphological lexicon for the Persian language, a corrected and partially re-annotated version of the BijanKhan corpus (BijanKhan, 2004) and MElt fa , a new freely available POS-tagger for the Persian language. After PerLex's first version (Sagot & Walther, 2010), we propose an improved version of our morphological lexicon. Apart from a partial manual validation, PerLex 2 now relies on a set of linguistically motivated POS. Based on these POS, we also developped a new version of the BijanKhan corpus. This new version contains significant corrections of the tokenisation. It has been re-tagged according to the new set of POS. The new version of the BijanKhan corpus has been used to develop MElt fa , our new freely-available POS-tagger for the Persian language, based on the new POS set, PerLex 2 and the MElt tagging system (Denis & Sagot, 2009).
... It enables them to share their capabilities together. For instance, the model enables the XTAG-based applications such as Machine Translation (Faili and Ghassem-Sani, 2004) to benefit from available advantages of MICA grammar and parser. Moreover it provides a way of using the XTAG semantic representations for the MICA's grammar. ...
Article
LTAG is a rich formalism for performing NLP tasks such as semantic interpretation, parsing, machine translation and information retrieval. Depend on the specific NLP task, different kinds of LTAGs for a language may be developed. Each of these LTAGs is enriched with some specific features such as semantic representation and statistical information that make them suitable to be used in that task. The distribution of these capabilities among the LTAGs makes it difficult to get the benefit from all of them in NLP applications. This paper discusses a statistical model to bridge between two kinds LTAGs for a natural language in order to benefit from the capabilities of both kinds. To do so, an HMM was trained that links an elementary tree sequence of a source LTAG onto an elementary tree sequence of a target LTAG. Training was performed by using the standard HMM training algorithm called Baum–Welch. To lead the training algorithm to a better solution, the initial state of the HMM was also trained by a novel EM-based semi-supervised bootstrapping algorithm. The model was tested on two English LTAGs, XTAG (XTAG-Group, 2001) and MICA's grammar (Bangalore et al., 2009) as the target and source LTAGs, respectively. The empirical results confirm that the model can provide a satisfactory way for linking these LTAGs to share their capabilities together.
... Machine Translation is one of the interesting fields in NLP having a vast usage, from translation of simple sentences to web pages or domain-specific documents. Although, there are many translation systems developed to translate from / to English, there have been very restricted efforts in developing translators from/to Persian [1][2]. The existing Persian translators have too many problems especially facing complex phrases and sentences and word sense disambiguation. ...
Article
Full-text available
Machine Translation is one of the interesting fields in NLP having a vast usage, from translation of simple sentences to web pages or domain-specific documents. Although, there are many translation systems developed to translate from / to English, there have been very restricted efforts in developing translators from/to Persian [1-2]. The existing Persian translators have too many problems especially facing complex phrases and sentences and word sense disambiguation. This paper introduces PENTRANS; the bidirectional Persian –English Machine translation project. PENTRANS consists of three main components: (1) source sentence analysis, (2) Lexical (and structural) transfer and (3) target sentence synthesis. The first and the last components are almost the same at the two sides (English to Persian and Persian to English translation). The first component is responsible for preprocessing input sentences and making essential structures for translation. Tokenization, morphological analysis, chunking and syntax parsing of the input sentence are the main tasks done in the first component. Extracted syntactic and morphological structure of the input will be saved to be used further in the last component The second component translates words and phrases and solves the WSD problems. To solve the ambiguity problem in English to Persian translator, an extension of Lesk algorithm [3] which covers the semantic aspects is introduced. In our approach, we use WordNet lexical ontology instead of a simple dictionary. For each word sense, its synset and gloss and ancestors up to two levels in the hyponymy hierarchy, are used [4]. To improve accuracy, POS and WSD tags (extracted from eXtended WordNet) are included. Therefore a new scoring method is proposed which takes these important points into consideration. Our contribution is to compare the grammatically related words indicated by grammatical parser to limit the search space and boost the process. To assign a correct Persian word we developed a bilingual dictionary, containing the translations of WordNet senses. In Persian-English side we use a combination of knowledge based and rule based approaches to WSD problem, providing a prototype of the essential knowledge base, including a rule base and a Persian name entity recognizer. In our proposed approach, some factors must be considered to resolve the ambiguity of each ambiguous word, such as its own POS tag and its neighbors'. Facing ambiguity, the system searches the knowledge base to find the exact applicable rule. If no rule was found, the procedure solves ambiguity by using co-occurrences and collocations (mostly extracted from Wikipedia sentences and a children dictionary [5]). Therefore the key feature of this component of the system is the usage of grammatical role and a predefined knowledge in order to improve the WSD results. In the last component at first, the extracted morphological structure of the input words will be applied to the translated stems to make the target words. Then using a bottom-up approach it transfers the syntactic structure of the sentence and make a correct sentence in the target language. In this approach the structure of each Target constituent will be built upon the structure of its smaller constituents.
... The English-Persian RBMT system we use in this work makes use of Synchronous Tree Adjoining-grammars (S-TAG) to better connect the two languages of divergent properties [4]. This system is a classical transfer system consisting of three main components: (i) analysis of the source language into a tree structure, (ii) transfer from the source language tree to a target language structure, and (iii) generating the output translation from the target structure. ...
Article
We aim at obtaining a system that would benefit from both of the available English-Persian RBMT and Statistical Machine Translation (SMT) systems to better translate movie subtitles. The used RBMT system is basically designed to produce formal translations of formal English sentences whereas the SMT system is trained on a bilingual corpus in the domain of conversational language. We have prepared a 4-million-token English-Persian parallel corpus by aligning movie subtitles to train the translation model in the statistical machine translation. In our evaluation, although the statistical post editing module highly improves the performance of the RBMT system in the domain of movie subtitles, it cannot reach the performance of the SMT in standalone translation which contradicts the previous reports. These results indicate that statistical post-editing can be effectively used to adapt the domain of RBMT system to a new domain. However, the idea that such a combination of two systems will result in an improved system was not proven for the spoken Persian language.
... In many applications like information retrieval and Rulebased machine translation systems, accurate deep parse structure of a sentence is required; hence a lot of research is being done on introducing methods to produce deep hierarchical syntactical structure of a given natural language sentence [6]. Over the last decade, there has been a great increase in the performance of parsers. ...
Article
Full-text available
Full-Parsing systems able to analyze sentences robustly and completely at an appropriate accuracy can be useful in many computer applications like information retrieval and machine translation systems. Increasing the domain of locality by using tree-adjoining-grammars (TAG) caused some researchers to use it as a modeling formalism in their language application. But parsing with a rich grammar like TAG faces two main obstacles: low parsing speed and a lot of ambiguous syntactical parses. In order to decrease the parse time and these ambiguities, we use an idea of combining statistical chunker based on TAG formalism, with a heuristically rule-based search method to achieve the full parses. The partial parses induced from statistical chunker are basically resulted from a system named supertagger, and are followed by two different phases: error detection and error correction, which in each phase, different completion heuristics apply on the partial parses. The experiments on Penn Treebank show that by using a trained probability model considerable improvement in full-parsing rate is achieved.
... A lot of ambiguity in output of this parser and the lack of effective disambiguation method caused many problems in the applications that use the XTAG parser [10]. Moreover, the low speed of this parser is another problem when using it in practical applications such as transfer-based MT " s [9]. The low speed beside the ambiguities of XTAG parser encouraged some researchers to improve the parsing time and develop a disambiguation process. ...
Conference Paper
Full-text available
MICA is a fast and accurate dependency parser for English that uses an automatically LTAG derived from Penn Treebank (PTB) using the Chen's approach. However, there is no semantic representation related to its grammar. On the other hand, XTAG grammar is a hand crafted LTAG that its elementary trees were enriched with the semantic representation by experts. The linguistic knowledge embedded in the XTAG grammar caused it to being used in wide variety of natural language applications. However, the current XTAG parser is not as fast and accurate as well as the MICA parser. Generating an XTAG derivation tree from a MICA dependency structure could make a bridge between these two notions and gets the benefits of both models. Also, by having this conversion, the applications that use the XTAG parser, may get the helps from MICA parser too. In addition, it can enrich the MICA's grammar by semantic representation of XTAG grammar. In this paper, an unsupervised sequence tagger that maps any sequence of MICA elementary trees onto an XTAG elementary trees sequence is presented. The proposed sequence tagger is based on a Hidden Markov Model (HMM) proceeded by an EM-based algorithm for setting its initial parameters values. The trained model is tested on a part of PTB and about 82% accuracy for the detected sequences is achieved.
... Parsing, in its classic definition, is a method of assigning structural descriptions to a given sentence or clause with the help of a set of elementary structures in the form of rules, that are defined in a given grammar and a set of operations to combine these structures. In many applications like Rule-based machine translation systems, accurate deep parse structure of a sentence is required; hence a lot of research is being done on introducing methods to produce deep hierarchical syntactical structure of a given natural language sentence [8]. However, due to the inherent ambiguity in the natural languages, this has proved itself to be challenging. ...
Conference Paper
Full-text available
Increasing the domain of locality by using tree-adjoining-grammars (TAG) encourages some researchers to use it as a modeling formalism in their language application. But parsing with a rich grammar like TAG faces two main obstacles: low parsing speed and a lot of ambiguous syntactical parses. We uses an idea of the shallow parsing based on a statistical approach in TAG formalism, named supertagging, which enhanced the standard POS tags in order to employ the syntactical information about the sentence. In this paper, an error-driven method in order to approaching a full parse from the partial parses based on TAG formalism is presented. These partial parses are basically resulted from supertagger which is followed by a simple heuristic based light parser named light weight dependency analyzer (LDA). Like other error driven methods, the process of generation the deep parses can be divided into two different phases: error detection and error correction, which in each phase, different completion heuristics applied on the partial parses. The experiments on Penn Treebank show considerable improvements in the parsing time and disambiguation process.
... Other recent work in the development of NLP tools and resources for Persian processing is mostly focused on designing part-of-speech taggers (QasemiZadeh and Rahimi, 2006; Tasharofi et al., 2007; Shamsfard and Fadaee, 2008), parsers (Hafezi, 2004; Dehdari and Lonsdale, 2008) or automatic translation systems (Feili and Ghassem-Sani, 2004; Saedi et al., 2009). ...
Article
Full-text available
We introduce PerLex, a large-coverage and freely-available morphological lexicon for the Persian language. We describe the main features of the Persian morphology, and the way we have represented it within the Alexina formalism, on which PerLex is based. We focus on the methodology we used for constructing lexical entries from various sources, as well as the problems related to typographic normalisation. The resulting lexicon shows a satisfying coverage on a reference corpus and should therefore be a good starting point for developing a syntactic lexicon for the Persian language.
... Outre les travaux réalisés dans le cadre du projet Shiraz, d'autres outils d'analyse morphologique ou de lemmatisation ont été développés, mais n'ont pas conduit à la construction d'un lexique à large couverture . On peut citer les travaux de Dehdari & Lonsdale (2008)sentiellement d'étiqueteurs morpho-syntaxiques (QasemiZadeh & Rahimi, 2006; Tasharofi et al., 2007; Shamsfard & Fadaee, 2008), d'analyseurs syntaxiques (Hafezi, 2004; Dehdari & Lonsdale, 2008) et de systèmes de traduction automatique (Feili & Ghassem-Sani, 2004; Saedi et al., 2009). ...
Article
Full-text available
Nous présentons PerLex, un lexique morphologique du persan à large couverture et librement disponible, accompagné d'une chaîne de traitements de surface pour cette langue. Nous décrivons quelques caractéristiques de la morphologie du persan, et la façon dont nous l'avons représentée dans le formalisme lexical Alexina, sur lequel repose PerLex. Nous insistons sur la méthodologie que nous avons employée pour construire les entrées lexicales à partir de diverses sources, ainsi que sur les problèmes liés à la normalisation typographique. Le lexique obtenu a une couverture satisfaisante sur un corpus de référence, et devrait donc constituer un bon point de départ pour le développement d'un lexique syntaxique du persan.
Article
Most of efficient computational approaches in NLP tasks are supervised methods which need annotated corpora. But the lack of supervised data in Persian encourages researchers to increase their interests and efforts on unsupervised and semi-supervised approaches. This paper presents a novel semi-supervised approach which called Genetic-based inside-outside (GIO), for Persian grammar inference for inducing a grammar model in a PCFG formalism. GIO is an extension of the inside-outside algorithm enriched by some notions of genetic algorithm. In pure genetic algorithm for grammar induction, randomly generated initial population make it computationally expensive, so we used inside-outside algorithm to generate initial population. Our experiments show that our approach's result is better than other applied methods for Persian grammar induction.
Conference Paper
Though the lack of semantic representation of automatically extracted LTAGs is an obstacle in using these formalism, due to the advent of some powerful statistical parsers that were trained on them, these grammars have been taken into consideration more than before. Against of this grammatical class, there are some widely usage manually crafted LTAGs that are enriched with semantic representation but suffer from the lack of efficient parsers. The available representation of latter grammars beside the statistical capabilities of former encouraged us in constructing a link between them. Here, by focusing on the automatically extracted LTAG used by MICA [4] and the manually crafted English LTAG namely XTAG grammar [32], a statistical approach based on HMM is proposed that maps each sequence of former elementary trees onto a sequence of later elementary trees. To avoid of converging the HMM training algorithm in a local optimum state, an EM-based learning process for initializing the HMM parameters were proposed too. Experimental results show that the mapping method can provide a satisfactory way to cover the deficiencies arises in one grammar by the available capabilities of the other.
Conference Paper
PEnT1 is an automatic English to Persian text translator. It translates simple English sentences into Persian, exploiting a combination of rule based and semantic approaches. It covers all the twelve tenses in English in both passive and active verbs for indicative, negative, interrogative sentences. In this paper, introducing PEnT1, we propose a new WSD method by presenting a hybrid measure to score different senses of a word. We also discuss prototyping some linguistic resources to test our methods.
Conference Paper
Grammar induction is one of the most important research areas of the natural language processing. The lack of a large Treebank, which is required in supervised grammar induction, in some natural languages such as Persian encouraged us to focus on unsupervised methods. We have found the Inside-Outside algorithm, introduced by Lari and Young, as a suitable platform to work on, and augmented IO with a history notion. The result is an improved unsupervised grammar induction method called History-based IO (HIO). Applying HIO to two very divergent natural languages (i.e., English and Persian) indicates that inducing more conditioned grammars improves the quality of the resultant grammar. Besides, our experiments on ATIS and WSJ show that HIO outperforms most current unsupervised grammar induction methods.
Article
In this paper, we present our attempts to design and implement a large-coverage computational grammar for the Persian language based on the Generalized Phrase Structured Grammar (GPSG) model. This grammatical model was developed for continuous speech recognition (CSR) applications, but is suitable for other applications that need the syntactic analysis of Persian. In this work, we investigate various syntactic structures relevant to the modern Persian language, and then describe these structures according to a phrase structure model. Noun (N), Verb (V), Adjective (ADJ), Adverb (ADV), and Preposition (P) are considered basic syntactic categories, and X-bar theory is used to define Noun phrases, Verb phrases, Adjective phrases, Adverbial phrases, and Prepositional phrases. However, we have to extend Noun phrase levels in X-bar theory to four levels due to certain complexities in the structure of Noun phrases in the Persian language. A set of 120 grammatical rules for describing different phrase structures of Persian is extracted, and a few instances of the rules are presented in this paper. These rules cover the major syntactic structures of the modern Persian language. For evaluation, the obtained grammatical model is utilized in a bottom-up chart parser for parsing 100 Persian sentences. Our grammatical model can take 89 sentences into account. Incorporating this grammar in a Persian CSR system leads to a 31% reduction in word error rate.
Article
Full-text available
The paper presents a tabular interpretation for a kind of 2-Stack Automata. These automata may be used to describe various parsing strategies, ranging from purely top-down to purely bottom-up, for LIGs and TAGs. The tabular interpretation ensures, for all strategies, a time complexity in O(n 6 ) and space complexity in O(n 5 ) where n is the length of the input string. Introduction 2-Stack automata [2SA] have been identified as possible operational devices to describe parsing strategies for Linear Indexed Grammars [LIG] or Tree Adjoining Grammars [TAG] (mirroring the traditional use of Push-Down Automata [PDA] for ContextFree Grammars [CFG]). Different variants of 2SA (or not so distant Embedded Push-Down Automata) have been proposed, some to describe top-down strategies (Vijay-Shanker, 1988; Becker, 1994), some to describe bottom-up strategies (Rambow, 1994; Nederhof, 1998; Alonso Pardo et al., 1997), but none (that we know) that are able to describe both kinds of strategies. Th...
Article
This paper demonstrates that a systematic solution to the divergence problem can be derived from the forrealization of two types of information: (1) the linguistically grounded classes upon which lexical-semantic divergences are based; and (2) the techniques by which lexical-semantic divergences are resolved. This forrealization is advantageous in that it facilitates the design and implementation of the system, allows one to make an evaluation of the status of the system, and provides a basis for proving certain important properties about the system
Article
This document describes a sizable grammar of English written in the TAG formalism and implemented for use with the XTAG system. This report and the grammar described herein supersedes the TAG grammar described in an earlier 1995 XTAG technical report. The English grammar described in this report is based on the TAG formalism which has been extended to include lexicalization, and unification-based feature structures. The range of syntactic phenomena that can be handled is large and includes auxiliaries (including inversion), copula, raising and small clause constructions, topicalization, relative clauses, infinitives, gerunds, passives, adjuncts, it-clefts, wh-clefts, PRO constructions, noun-noun modifications, extraposition, determiner sequences, genitives, negation, noun-verb contractions, sentential adjuncts and imperatives. This technical report corresponds to the XTAG Release 8/31/98. The XTAG grammar is continuously updated with the addition of new analyses and modification of old ones, and an online version of this report can be found at the XTAG web page at http://www.cis.upenn.edu/~xtag/