Article

Automated generalization of phrasal paraphrases from the web

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Rather than creating and storing thou-sands of paraphrase examples, para-phrase templates have strong representation capacity and can be used to generate many paraphrase examples. This paper describes a new template representation and generalization method. Combing a semantic diction-ary, it uses multiple semantic codes to represent a paraphrase template. Using an existing search engine to extend the word clusters and generalize the exam-ples. We also design three metrics to measure our generalized templates. The experimental results show that the rep-resentation method is reasonable and the generalized templates have a higher precision and coverage.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These differences make sentence category without a clear unified standard [1] . In this paper, the classi cation of complex sentences is based on Jiaoyan Jia [2] , which puts the sentences into the joint complex sentence, subordinate complex sentence and multiple complex sentence in three categories. The joint sentence and compound sentence contains five kinds of small class. ...
Article
Full-text available
Based on the paraphrasing of Chinese simple sentences, the complex sentence paraphrasing by using templates are studied. Through the classification of complex sentences, syntactic analysis and structural analysis, the proposed methods construct complex sentence paraphrasing templates that the associated words are as the core. The part of speech tagging is used in the calculation of the similarity between the paraphrasing sentences and the paraphrasing template. The joint complex sentence can be divided into parallel relationship, sequence relationship, selection relationship, progressive relationship, and interpretive relationship’s complex sentences. The subordinate complex sentence can be divided into transition relationship, conditional relationship, hypothesis relationship, causal relationship and objective relationship’s complex sentences. Joint complex sentence and subordinate complex sentence are divided to associated words. By using pretreated sentences, the preliminary experiment is carried out to decide the threshold between the paraphrasing sentence and the template. A small scale paraphrase experiment shows the method is availability, acquire the coverage rate of paraphrasing template 40.20% and the paraphrase correct rate 62.61%.
... According to Li et al. (2005), paraphrase patterns are more useful than paraphrase instances due to the fact that a single paraphrase pattern can be used to generate many paraphrase instances. In other words, paraphrase patterns are more general than paraphrase instances. ...
Article
Full-text available
Recent advances in natural language processing have increased the popularity of paraphrase extraction. Most of the attention, however, has been focused on the extraction methods only without taking the resource factor into the consideration. Unknowingly, there is a strong relationship between them and the resource factor also plays an equally important role in paraphrase extraction. In addition, almost all of the previous studies have been focused on corpus-based methods that extract paraphrases from corpora based solely on syntactic similarity. Despite the popularity of corpus-based methods, a considerable amount of research has consistently shown that these methods are vulnerable to several types of erroneous paraphrases. For these reasons, it is necessary to evaluate whether the trend is moving in a positive direction. This paper reviews the major research on paraphrase extraction methods in detail. It begins by exploring the definition of paraphrase from different perspectives to provide a better understanding of the concept of paraphrase extraction. It then studies the characteristics and potential uses of different types of paraphrase resources. After that, it divides paraphrase extraction methods into four main categories: heuristic-based, knowledge-based, corpus-based and hybrid-based and summarizes their strengths and weaknesses. This paper concludes with some potential open research issues for future directions.
... Weigang Li... et al. en (Li, 2005) describen una nueva representación de paráfrasis en plantillas y un método de generalización. Para ello reciben como entrada un conjunto de paráfrasis que son analizadas con un diccionario semántico que desambigua el sentido de las palabras y define los espacios en blanco con un código, es decir, los lugares donde podrá ir otra palabra que mantenga cierta similitud con la que había según ese código. ...
Article
Full-text available
Se ha desarrollado un método para la detección de paráfrasis basado en una combinación de técnicas y herramientas del procesamiento del lenguaje natural, lo que permite conocer cuándo dos frases son semánticamente equivalentes. Esto resulta de gran importancia para los sistemas que intentan hacer “pensar” a la computadora, por lo que sus aplicaciones son múltiples: extracción de información, búsqueda de respuesta, localización de información, traducción automática, generación de resúmenes, detección de plagios, etc. En el caso específico de la educación se han realizados trabajos vinculados con la evaluación y la detección de fraudes. Se expone un estudio del estado del arte, se plasman los elementos teóricos que sirven de base para el desarrollo de la propuesta. Para el análisis de los resultados obtenidos se utilizan los indicadores de exactitud, precisión y cobertura, calculando la F-medida y comparando sus valores con los obtenidos por sistemas internacionales probados sobre el mismo corpus.
Chapter
Because the processing of complex sentences is more easily generating ambiguity, it is more difficult that deal with a complex sentence than a simple sentence in natural language processing. A method is proposed that make use of structure characteristic of a sentence. The method paraphrases those sentences with associated words by extracting paraphrase template and paraphrase rules, matching paraphrase sentence with paraphrase template by calculating similarity between original sentences with paraphrase template by keyword and restriction words. To evaluate paraphrasing performance of the method, paraphrase experiments have been done and the experiment results are discussed.
Article
By using part of speech information to similarity calculation of paraphrase sentences, the similarity calculation method is improved. The improved method includes multi information of paraphrase text, such as, matching words information and part of speech information. A small-scale comparing experiment of double negation sentence paraphrase for two method of similarity calculation has been done, and the experimental results indicated effect of improvement method.
Conference Paper
In this work, we present a scenario where contextual targeted paraphrasing of sub-sentential phrases is performed automatically to support the task of text revision. Candidate paraphrases are obtained from a preexisting repertoire and validated in the context of the original sentence using information derived from the Web. We report on experiments on French, where the original sentences to be rewritten are taken from a rewriting memory automatically extracted from the edit history of Wikipedia.
Article
By analyzing the structure of large amount sentences in Chinese, extracting paraphrase templates based on key items, which can used to paraphrase some sentences with special structure. Matching of paraphrase template with sentences is through calculating similarity of paraphrase sentence with paraphrase template. With fixing keywords and structure auxiliary words which reflect the structure of sentences in the templates and combining the qualifier and replacing them by variables, and achieve exactly structural matching in sentence level and enhanced the coverage of the templates. To evaluate performance of the method, experiments have been done and got the coverage rate of templates and precision of paraphrase 65.4% and 75.82% respectively.
Conference Paper
Paraphrase recognition is the basic of paraphrase researches. However, most of the existing researches mainly focus on the acquirement of paraphrases from a certain text corpus, or their methods are restricted to certain conditions. There is not a method that can decide whether two sentences are paraphrases generally. This paper presents a combination of rule and supervised learning method to recognize paraphrases. In this method, we make use of the classification of paraphrases and adopt different approaches to recognize paraphrases according to the types they belong to. And the key point is how to use a variety of strategies to get the semantic similarity of two sentences. As the system is mainly for question answering (QA), evaluations are conducted on a corpus of sentence pairs mainly collected from a QA system, Baidu zhidao. Results show that the precision exceeds 75% on the simple sentences whose syntax analyses are correct, which is significantly higher than most of the existing methods.
Conference Paper
In most applications of paraphrasing, contextual information should be considered since a word may have different paraphrases in different contexts. This paper presents a method that automatically acquires lexical context-specific paraphrases from the web. The method includes two main stages, candidate paraphrase extraction and paraphrase validation. Evaluations were conducted on a news title corpus whereby the context-specific paraphrasing method was compared with the Chinese synonymous thesaurus. Results show that the precision of our method is above 60% and the recall is above 55%, which outperforms the thesaurus significantly.
Conference Paper
Lexical paraphrasing aims at acquiring word-level paraphrases. It is critical for many Natural Lan- guage Processing (NLP) applications, such as Question Answering (QA), Information Extraction (IE), and Machine Translation (MT). Since the meaning and usage of a word can vary in distinct contexts, different paraphrases should be acquired according to the contexts. However, most of the existing researches focus on constructing para- phrase corpora, in which little contextual con- straints for paraphrase application are imposed. This paper presents a method that automatically acquires context-specific lexical paraphrases. In this method, the obtained paraphrases of a word depend on the specific sentence the word occurs in. Two stages are included, i.e. candidate paraphrase extraction and paraphrase validation, both of which are mainly based on web mining. Evaluations are conducted on a news title corpus and the presented method is compared with a paraphrasing method that exploits a Chinese thesaurus of synonyms -- Tongyi Cilin (Extended) (CilinE for short). Results show that the f-measure of our method (0.4852) is significantly higher than that using CilinE (0.1127). In addition, over 85% of the correct paraphrases derived by our method cannot be found in CilinE, which suggests that our method is effective in ac- quiring out-of-thesaurus paraphrases.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Conference Paper
Full-text available
One of the key issues in spoken language translation is how to deal with unrestricted expressions in spontaneous utterances. This research is centered on the development of a Chinese paraphraser that automatically paraphrases utterances prior to transfer in Chinese-Japanese spoken language translation. In this paper, a pattern-based approach to paraphrasing is proposed for which only morphological analysis is required. In addition, a pattern construction method is described through which paraphrasing patterns can be efficiently learned from a paraphrase corpus and human experience. Using the implemented paraphraser and the obtained patterns, a paraphrasing experiment was conducted and the results were evaluated.
Conference Paper
Full-text available
We apply statistical machine translation (SMT) tools to generate novel paraphrases of input sentences in the same language. The system is trained on large volumes of sentence pairs automatically extracted from clustered news articles available on the World Wide Web. Alignment Error Rate (AER) is measured to gauge the quality of the resulting corpus. A monotone phrasal decoder generates contextual replacements. Human evaluation shows that this system outperforms baseline paraphrase generation techniques and, in a departure from previ- ous work, offers better coverage and scal- ability than the current best-of-breed paraphrasing approaches.
Article
Full-text available
We are trying to find paraphrases from Japanese news articles which can be used for Information Extraction. We focused on the fact that a single event can be reported in more than one article in different ways. However, certain kinds of noun phrases such as names, dates and numbers behave as "anchors" which are unlikely to change across articles. Our key idea is to identify these anchors among comparable articles and extract portions of expressions which share the anchors. This way we can extract expressions which convey the same information. Obtained paraphrases are generalized as templates and stored for future use.
Article
Full-text available
In the 2002 State of the Union address and at the West Point graduation ceremony, President Bush articulated a new strategy for the United States in dealing with the threat of terrorism with weapons of mass destruction: preventive war. The president argued pessimistically that "time is not on our side" and that we could not afford to "wait on events, while dangers gather." Americans must be ready for "pre-emptive action" to defend our lives.
Article
Full-text available
We introduce a method for learning query transformations that improves the ability to retrieve answers to questions from an information retrieval system. During the training stage the method involves automatically learning phrase features for classifying questions into different types, automatically generating candidate query transformations from a training set of question/answer pairs, and automatically evaluating the candidate transforms on target information retrieval systems such as real-world general purpose search engines. At run time, questions are transformed into a set of queries, and re-ranking is performed on the documents retrieved. We present a prototype search engine, Tritus, that applies the method to web search engines. Blind evaluation on a set of real queries from a web search engine log shows that the method significantly outperforms the underlying web search engines as well as a commercial search engine specializing in question answering. Keywords Web search, query expansion, question answering, information retrieval 1.
Article
This paper studies the potential of identifying lexical paraphrases within a single corpus, fo-cusing on the extraction of verb paraphrases. Most previous approaches detect individual paraphrase instances within a pair (or set) of "comparable" corpora, each of them contain-ing roughly the same information, and rely on the substantial level of correspondence of such corpora. We present a novel method that successfully detects isolated paraphrase in-stances within a single corpus without relying on any a-priori structure and information. A comparison suggests that an instance-based approach may be combined with a vector-based approach in order to assess better the paraphrase likelihood for many verb pairs.
Article
Automatically acquiring synonymous words (synonyms) from corpora is a challenging task. For this task, methods that use only one kind of resources are inadequate because of low precision or low recall. To improve the per-formance of synonym extraction, we propose a method to extract synonyms with multiple resources including a monolingual dictionary, a bilingual corpus, and a large monolingual corpus. This approach uses an ensemble to combine the synonyms extracted by individ-ual extractors which use the three resources. Experimental results prove that the three re-sources are complementary to each other on synonym extraction, and that the ensemble method we used is very effective to improve both precisions and recalls of extracted synonyms.
Article
One of the main challenges in question-answering is the potential mismatch between the expressions in questions and the expressions in texts. While humans appear to use infer-ence rules such as "X writes Y" implies "X is the author of Y" in answering questions, such rules are generally unavailable to question-answering systems due to the inherent difficulty in constructing them. In this paper, we present an unsupervised algorithm for discovering inference rules from text. Our algorithm is based on an extended version of Harris' Distributional Hypothesis, which states that words that occurred in the same con-texts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize that their meanings are similar. We use examples to show that our system discovers many inference rules easily missed by humans.
Conference Paper
Abundant Chinese paraphrasing resource on Internet can be attained from different Chinese translations of one foreign masterpiece. Paraphrases corpus is the corpus that includes sentence pairs to convey the same information. The irregular characteristics of the real monolingual parallel texts, especially without the strictly aligned paragraph boundaries between two translations, bring a challenge to alignment technology. The traditional alignment methods on bilingual texts have some difficulties in competency for doing this. A new method for aligning real monolingual parallel texts using sentence pair's length and location information is described in this paper. The model was motivated by the observation that the location of a sentence pair with certain length is distributed in the whole text similarly. And presently, a paraphrases corpus with about fifty thousand sentence pairs is constructed.
Conference Paper
Automatically acquiring synonymous collocation pairs such as <turn on, OBJ, light> and <switch on, OBJ, light> from corpora is a challenging task. For this task, we can, in general, have a large monolingual corpus and/or a very limited bilingual corpus. Methods that use monolingual corpora alone or use bilingual corpora alone are apparently inadequate because of low precision or low coverage. In this paper, we propose a method that uses both these resources to get an optimal compromise of precision and coverage. This method first gets candidates of synonymous collocation pairs based on a monolingual corpus and a word thesaurus, and then selects the appropriate pairs from the candidates using their translations in a second language. The translations of the candidates are obtained with a statistical translation model which is trained with a small bilingual corpus and a large monolingual corpus. The translation information is proved as effective to select synonymous collocation pairs. Experimental results indicate that the average precision and recall of our approach are 74% and 64% respectively, which outperform those methods that only use monolingual corpora and those that only use bilingual corpora.
Article
We address the text-to-text generation problem of sentence-level paraphrasing --- a phenomenon distinct from and more difficult than word- or phrase-level paraphrasing. Our approach applies multiple-sequence alignment to sentences gathered from unannotated comparable corpora: it learns a set of paraphrasing patterns represented by word lattice pairs and automatically determines how to apply these patterns to rewrite new sentences. The results of our evaluation experiments show that the system derives accurate paraphrases, outperforming baseline systems.
Article
While paraphrasing is critical both for interpretation and generation of natural language, current systems use manual or semi-automatic methods to collect paraphrases. We present an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text. Our approach yields phrasal and single word lexical paraphrases as well as syntactic paraphrases.
Paraphrasing Paraphrased
  • Graeme Hirst
Graeme Hirst. Paraphrasing Paraphrased. In Proceedings of the Second International Workshop on Paraphrasing, 2003
A Statistical Dependency Parser of Chinese under Small Training Data. Workshop: Beyond shallow analyses -Formalisms and statistical modeling for deep analyses
  • Jinshan Ma
  • Yu Zhang
  • Ting Liu
  • Sheng Li
Jinshan Ma, Yu Zhang, Ting Liu, and Sheng Li. A Statistical Dependency Parser of Chinese under Small Training Data. Workshop: Beyond shallow analyses -Formalisms and statistical modeling for deep analyses, IJCNLP-04, 4 2004.
Acquiring paraphrase templates from document/abstract pairs
  • Hal Daumé
  • Daniel Marcu
Hal Daumé III and Daniel Marcu. Acquiring paraphrase templates from document/abstract pairs. In NL Seminar in ISI, 2003
Inferring Strategies for Sentence Ordering in Multidocument News Summarization . The Second International Workshop on Paraphrasing: Paraphrase Acquisition and Applications
  • Regina Barzilay
  • Noemie Elhadad
  • Kathleen R Mckeown
Regina Barzilay, Noemie Elhadad, Kathleen R. McKeown. 2003. Inferring Strategies for Sentence Ordering in Multidocument News Summarization. The Second International Workshop on Paraphrasing: Paraphrase Acquisition and Applications