Conference Paper

Combining Sentence Length with Location Information to Align Monolingual Parallel Texts

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Abundant Chinese paraphrasing resource on Internet can be attained from different Chinese translations of one foreign masterpiece. Paraphrases corpus is the corpus that includes sentence pairs to convey the same information. The irregular characteristics of the real monolingual parallel texts, especially without the strictly aligned paragraph boundaries between two translations, bring a challenge to alignment technology. The traditional alignment methods on bilingual texts have some difficulties in competency for doing this. A new method for aligning real monolingual parallel texts using sentence pair's length and location information is described in this paper. The model was motivated by the observation that the location of a sentence pair with certain length is distributed in the whole text similarly. And presently, a paraphrases corpus with about fifty thousand sentence pairs is constructed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Rather than creating and storing thou-sands of paraphrase examples, para-phrase templates have strong representation capacity and can be used to generate many paraphrase examples. This paper describes a new template representation and generalization method. Combing a semantic diction-ary, it uses multiple semantic codes to represent a paraphrase template. Using an existing search engine to extend the word clusters and generalize the exam-ples. We also design three metrics to measure our generalized templates. The experimental results show that the rep-resentation method is reasonable and the generalized templates have a higher precision and coverage.
Article
Full-text available
We describe a set of paraphrase patterns for questions which we derived from a corpus of questions, and report the result of using them in the automatic recogni- tion of question paraphrases. The aim of our paraphrase patterns is to factor out different syntactic variations of interroga- tive words, since the interrogative part of a question adds a syntactic superstructure on the sentence part (i.e., the rest of the question), thereby making it difficult for an automatic system to analyze the ques- tion. The patterns we derived are rules which map surface syntactic structures to semantic case frames, which serve as the canonical representation of questions. We also describe the process in which we acquired question paraphrases, which we used as the test data. The results obtained by using the patterns in paraphrase recog- nition were quite promising.
Article
Full-text available
Automatic evaluation of translation qual-ity has proved to be useful when the target language is English. In this paper the eval-uation of translation into Japanese is stud-ied. An existing method based on n-gram similarity between translations and refer-ence sentences is difficult to apply to the evaluation of Japanese because of the ag-glutinativeness and variation of semanti-cally similar expressions in Japanese. The proposed method applies a set of para-phrasing rules to the reference sentences in order to increase the similarity score for the expressions that differ only in their writing styles. Experimental results show the paraphrasing rules improved the cor-relation between automatic evaluation and human evaluation from 0.80 to 0.93.
Conference Paper
Full-text available
In this paper we present a noun phrase coreference resolution system which aims to enhance the identification of the coreference realized by string matching. For this purpose, we make two extensions to the standard learning-based resolution framework. First, to improve the recall rate, we introduce an additional set of fea- tures to capture the different matching patterns between noun phrases. Second, to improve the precision, we modify the in- stance selection strategy to allow non- anaphors to be included during training instance generation. The evaluation done on MEDLINE data set shows that the combination of the two extensions pro- vides significant gains in the F-measure.
Conference Paper
Full-text available
Researchers in both machine translation (e.g., Brown ., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts, based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL/DCI.
Article
Full-text available
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique.
Article
Full-text available
We are trying to find paraphrases from Japanese news articles which can be used for Information Extraction. We focused on the fact that a single event can be reported in more than one article in different ways. However, certain kinds of noun phrases such as names, dates and numbers behave as "anchors" which are unlikely to change across articles. Our key idea is to identify these anchors among comparable articles and extract portions of expressions which share the anchors. This way we can extract expressions which convey the same information. Obtained paraphrases are generalized as templates and stored for future use.
Article
Full-text available
We present a Question Answering system for technical domains which makes an intelligent use of paraphrases to increase the likelihood of finding the answer to the user's question. The system implements a simple and e#cient logic representation of questions and answers that maps paraphrases to the same underlying semantic representation.
Article
Full-text available
We used the memory-based learner Timbl (Daelemans et al., 2002) to find names in English and German newspaper text. A first system used only the training data, and a number of gazetteers. The results show that gazetteers are not beneficial in the English case, while they are for the German data. Type-token generalization was applied, but also reduced performance.
Article
Full-text available
We introduce and analyze a new algorithm for linear classification which combines Rosenblatt 's perceptron algorithm with Helmbold and Warmuth's leave-one-out method. Like Vapnik 's maximal-margin classifier, our algorithm takes advantage of data that are linearly separable with large margins. Compared to Vapnik's algorithm, however, ours is much simpler to implement, and much more efficient in terms of computation time. We also show that our algorithm can be efficiently used in very high dimensional spaces using kernel functions. We performed some experiments using our algorithm, and some variants of it, for classifying images of handwritten digits. The performance of our algorithm is close to, but not as good as, the performance of maximal-margin classifiers on the same problem, while saving significantly on computation time and programming effort. 1 Introduction One of the most influential developments in the theory of machine learning in the last few years is Vapnik's work on supp...
Article
Full-text available
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment. 1 Introduction While muchwork has already been done on the automatic alignment of parallel corpora (Brown et al. 1991; Kay & Roscheisen 1993; Gale & Church 1993; Church 1993; Chen 1993; Wu 1994), there are several problems which have not been fully addressed by many of these alignment algorithms. First, many corpora are noisy; segments from the source language can be totally missing from the target language or can be substituted with a target language segment which is not a translation. Similarly, the target language may include segments whose translation is totally missing from the source corpus. ...
Article
Full-text available
This paper describes resolve, a system that uses decision trees to learn how to classify coreferent phrases in the domain of business joint ventures. An experiment is presented in which the performance of resolve is compared to the performance of a manually engineered set of rules for the same task. The results show that decision trees achieve higher performance than the rules in two of three evaluation metrics developed for the coreference task. In addition to achieving better performance than the rules, resolve provides a framework that facilitates the exploration of the types of knowledge that are useful for solving the coreference problem. 1 Introduction The goal of an Information Extraction #IE# system is to identify information of interest from a collection of texts. Within a particular text, objects of interest are often referenced in di#erent places and in di#erentways. One of the manychallenges facing an IE system is to determine which references refer to which ...
Article
Full-text available
In the 2002 State of the Union address and at the West Point graduation ceremony, President Bush articulated a new strategy for the United States in dealing with the threat of terrorism with weapons of mass destruction: preventive war. The president argued pessimistically that "time is not on our side" and that we could not afford to "wait on events, while dangers gather." Americans must be ready for "pre-emptive action" to defend our lives.
Article
Full-text available
The problem of organizing information for multidocument summarization so that the generated summary is coherent has received relatively little attention. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, this is not the case for multidocument summarization where summary sentences may be drawn from different input articles. In this paper, we propose a methodology for studying the properties of ordering information in the news genre and describe experiments done on a corpus of multiple acceptable orderings we developed for the task. Based on these experiments, we implemented a strategy for ordering information that combines constraints from chronological order of events and topical relatedness. Evaluation of our augmented algorithm shows a significant improvement of the ordering over two baseline strategies.
Article
Full-text available
In this paper, we present a learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small, annotated corpus and the task includes resolving not just a certain type of noun phrase (e.g., pronouns) but rather general noun phrases. It also does not restrict the entity types of the noun phrases; that is, coreference is assigned whether they are of "organization," "person," or other types. We evaluate our approach on common data sets (namely, the MUC-6 and MUC-7 coreference corpora) and obtain encouraging results, indicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches. Our system is the first learning-based system that offers performance comparable to that of state-of-the-art nonlearning systems on these data sets.
Article
Full-text available
In a recent paper, Gale and Church describe an inexpensive method for aligning bitext, based exclusively on sentence lengths [Gale and Church, 1991]. While this method produces surprisingly good results (a success rate around 96%), even better results are required to perform such tasks as the computer-assisted revision of translations. In this paper, we examine some of the weaknesses of Gale and Church's program, and explain how just a small amount of linguistic knowledge would help to overcome these weaknesses. We discuss how cognates provide for a cheap and reasonably reliable source of linguistic knowledge. To illustrate this, we describe a modification to the program in which the criterion is cognates rather than sentence lengths. Finally, we show how better and more efficient results may be obtained by combining the two criteria --- length and "cognateness". Our method can be generalized to accommodate other sources of linguistic knowledge, and experimentation shows that it produc...
Article
Full-text available
There have been a number of recent papers on aligning parallel texts at the sentence level, e.g., Brown et al (1991), Gale and Church (to appear), Isabelle (1992), Kay and Ro .. senschein (to appear), Simard et al (1992), WarwickArmstrong and Russell (1990). On clean inputs, such as the Canadian Hansards, these methods have been very successful (at least 96% correct by sentence). Unfortunately, if the input is noisy (due to OCR and/or unknown markup conventions), then these methods tend to break down because the noise can make it difficult to find paragraph boundaries, let alone sentences. This paper describes a new program, char_align, that aligns texts at the character level rather than at the sentence/paragraph level, based on the cognate approach proposed by Simard et al. 1. Introduction Parallel texts have recently received considerable attention in machine translation (e.g., Brown et al, 1990), bilingual lexicography (e.g., Klavans and Tzoukermann, 1990), and terminology resea...
Article
Full-text available
A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited (MULTEXT, CELEX, AGROVOC, WordNet1. 6, and Microsoft Word97 thesaurus). In Proccedings, 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), pages 341--348, University of Maryland. June 20-26, 1999. 1
Article
In this paper, we propose an answer seeking algorithm for question answering that integrates structural matching and paraphrasing, and report the results of our empirical evaluation conducted with the aim of examining effects of incorporating those two components. According to the results, the contribution of structural matching and paraphrasing was not so large as expected. Based on error analysis, we conclude that structural matching-based approaches to answer seeking require technologies for (a) coreference resolution, (b) processing of parse forests instead of parse trees, and
Article
One of the main challenges in question-answering is the potential mismatch between the expressions in questions and the expressions in texts. While humans appear to use infer-ence rules such as "X writes Y" implies "X is the author of Y" in answering questions, such rules are generally unavailable to question-answering systems due to the inherent difficulty in constructing them. In this paper, we present an unsupervised algorithm for discovering inference rules from text. Our algorithm is based on an extended version of Harris' Distributional Hypothesis, which states that words that occurred in the same con-texts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize that their meanings are similar. We use examples to show that our system discovers many inference rules easily missed by humans.
Article
A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.
Article
This book is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. The book also introduces Bayesian analysis of learning and relates SVMs to Gaussian Processes and other kernel based learning methods. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc. Their first introduction in the early 1990s lead to a recent explosion of applications and deepening theoretical analysis, that has now established Support Vector Machines along with neural networks as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and application of these techniques. The concepts are introduced gradually in accessible and self-contained stages, though in each stage the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally the book will equip the practitioner to apply the techniques and an associated web site will provide pointers to updated literature, new applications, and on-line software.
Conference Paper
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into categories such as person, organization, and date. It is a key technology of Information Extraction and Open-Domain Question Answering. First, we show that an NE recognizer based on Support Vector Machines (SVMs) gives better scores than conventional systems. However, off-the-shelf SVM classifiers are too inefficient for this task. Therefore, we present a method that makes the system substantially faster. This approach can also be applied to other similar tasks such as chunking and part-of-speech tagging. We also present an SVM-based feature selection method and an efficient training method.
Conference Paper
We extend previous work on tree kernels to estimate the similarity between the dependency trees of sen- tences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility of different features such as Wordnet hypernyms, parts of speech, and entity types, and find that the depen- dency tree kernel achieves a 20% F1 improvement over a "bag-of-words" kernel.
Article
We present an application of kernel methods to extracting relations from unstructured natural language sources. We introduce kernels defined over shallow parse representations of text, and design efficient algorithms for computing the kernels. We use the devised kernels in conjunction with Support Vector Machine and Voted Perceptron learning algorithms for the task of extracting person-affiliation and organization-location relations from text. We experimentally evaluate the proposed methods and compare them with feature-based learning algorithms, with promising results.
Article
I.n this paper, we describe a new corpus-based ap proach to prepositional phrase attachment disambiguation, and 10resent results comparing perlbrmance of this algorithm with ol,her corpus-based approaches to this problem.
Article
In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.
Article
This paper reports on a large-scale, end-toend relation and event extraction system. At present, the system extracts a total of 100 types of relations and events, which represents a much wider coverage than is typical of extraction systems. The system consists of three specialized pattern-based tagging modules, a high-precision coreference resolution module, and a configurable template generation module. We report quantitative evaluation results, analyze the results in detail, and discuss future directions.
Article
While paraphrasing is critical both for interpretation and generation of natural language, current systems use manual or semi-automatic methods to collect paraphrases. We present an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text. Our approach yields phrasal and single word lexical paraphrases as well as syntactic paraphrases.
Article
In this paper, we present a nearly unsupervised learning methodology for automatically extracting paraphrases from the Web. Starting with one single linguistic expression of a semantic relationship, our learning algorithm repeatedly samples the Web, in order to build a corpus of potential new examples of the same relationship. Sampling steps alternate with validation steps, during which implausible paraphrases are filtered out using an EM-based unsupervised clustering procedure. This learning machinery is built on top of an existing question-answering (QA) system and the learnt paraphrases will eventually be used to improve its recall. We focus here on the learning aspect of this system and report preliminary results.
Article
Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. In this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. The rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules, ease of finding and implementing improvements to the tagger, and better portability from one tag set, cor- pus genre or language to another. Perhaps the biggest contribution of this work is in demonstrating that the stochastic method is not the only viable method for part of speech tagging. The fact that a simple rule-based tagger that automatically learns its rules can perform so well should offer encouragement for researchers to further explore rule-based tagging, searching for a better and more expressive set of rule templates and other variations on the simple but effective theme described below.
Article
In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ig- nore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statisti- cal word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately 0.4% on Canadian Hansard data, which is a significant improvement over previous results. The algorithm is language indepen- dent.
Article
We describe our experience with automatic alignment of sentences in parallel EnglishChinese texts. Our report concerns three related topics: (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus; (2) experiments addressing the applicability of Gale & Church's (1991) length-based statistical method to the task of alignment involving a non-Indo-European language; and (3) an improved statistical method that also incorporates domain-specific lexical cues. The Hong Kong University of Science & Technology Technical Report Series Department of Computer Science ii 1 Introduction Recently, a number of automatic techniques for aligning sentences in parallel bilingual corpora have been proposed (Kay & Roscheisen 1988; Catizone et al. 1989; Gale & Church 1991; Brown et al. 1991). Such corpora contain the same material that has been translated by human experts into two languages. The goal of alignment is to identify matching sentences between the languages. Alignment is the first...
Article
In this paper we describe a statistical technique for aligning sentences with their trauslations in two parMid corpora. In addition to certain anchor points that are available in our da. ta, the only information about the sentences that we use for calculating Mignments is the number of tokens that they contain. Because we make no use of the lexicaJ details of the sentence, the Mignment com- putation is fast and therefore practicM for application to very large collections of text. We have used this thnlque to Mign several million sentences in the English-French Hansard corpora and have achieved an ccuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.
Article
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning. 1
Article
The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions, In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance, First we introduce MULDER, which we believe to be the first general-purpose, fully-automated question-answering system available on the web. Second, we describe MULDER's architecture, which relies on multiple search-engine queries, natural-language parsing, and a novel voting procedure to yield reliable answers coupled with high recall. Finally, we compare MULDER's performance to that of Google and AskJeeves on questions drawn from the TREC-8 question answering track. We find that MULDER's recall is more than a factor of three higher than that of AskJeeves, In addition, we find that Google requires 6.6 times as much user effort to achieve the same level of recall as MULDER.
Automatic chinese-english paragraph segmentation and alignment
  • W Bin
  • L Qin
  • Z Xiang
  • W. Bin
Wang Bin, Liu Qin, Zhang Xiang. 2000. Automatic Chinese-English Paragraph Segmentation and Alignment. Journal of Software, 11(11):1547-1553 (Chinese)