Article

Bootstrapping for extracting relations from large corpora

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A new approach of relation extraction is described in this paper. It adopts a bootstrapping model with a novel iteration strategy, which generates more precise examples of specific relation. Compared with previous methods, the proposed method has three main advantages: first, it needs less manual intervention; second, more abundant and reasonable information are introduced to represent a relation pattern; third, it reduces the risk of circular dependency occurrence in bootstrapping. Scalable evaluation methodology and metrics are developed for our task with comparable techniques over TianWang 100G corpus. The experimental results show that it can get 90% precision and have excellent expansibility.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
Full-text available
The entity and relation recognition, i.e. (1) assigning semantic classes to entities in a sentence, and (2) determining the relations held between entities, is an important task in areas such as information extraction. Subtasks (1) and (2) are typically carried out sequentially, but this approach is problematic: the errors made in subtask (1) are propagated to subtask (2) with an accumulative effect; and, the information available only in subtask (2) cannot be used in subtask (1). To address this problem, we propose a method that allows subtasks (1) and (2) to be associated more closely with each other. The process is performed in three stages: firstly, employing two classifiers to do subtasks (1) and (2) independently; secondly, recognizing an entity by taking all the entities and relations into account, using a model called the Entity Relation Propagation Diagram; thirdly, recognizing a relation based on the results of the preceding stage. The experiments show that the proposed method can improve the entity and relation recognition in some degree.
Conference Paper
Full-text available
We propose a tree-similarity-based unsupervised learning method to extract relations between Named Entities from a large raw corpus. Our method regards relation extraction as a clustering problem on shallow parse trees. First, we modify previous tree kernels on relation extraction to estimate the similarity between parse trees more efficiently. Then, the similarity between parse trees is used in a hierarchical clustering algorithm to group entity pairs into different clusters. Finally, each cluster is labeled by an indicative word and unreliable clusters are pruned out. Evaluation on the New York Times (1995) corpus shows that our method outperforms the only previous work by 5 in F-measure. It also shows that our method performs well on both high-frequent and less-frequent entity pairs. To the best of our knowledge, this is the first work to use a tree similarity metric in relation clustering.
Conference Paper
Full-text available
Paraphrase recognition is a critical step for nat- ural language interpretation. Accordingly, many NLP applications would benefit from high coverage knowledge bases of paraphrases. However, the scal- ability of state-of-the-art paraphrase acquisition ap- proaches is still limited. We present a fully unsuper- vised learning algorithm for Web-based extraction of entailment relations, an extended model of para- phrases. We focus on increased scalability and gen- erality with respect to prior work, eventually aiming at a full scale knowledge base. Our current imple- mentation of the algorithm takes as its input a verb lexicon and for each verb searches the Web for re- lated syntactic entailment templates. Experiments show promising results with respect to the ultimate goal, achieving much better scalability than prior Web-based methods.
Article
Full-text available
In the 2002 State of the Union address and at the West Point graduation ceremony, President Bush articulated a new strategy for the United States in dealing with the threat of terrorism with weapons of mass destruction: preventive war. The president argued pessimistically that "time is not on our side" and that we could not afford to "wait on events, while dangers gather." Americans must be ready for "pre-emptive action" to defend our lives.
Article
Full-text available
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention, and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
Article
One of the main challenges in question-answering is the potential mismatch between the expressions in questions and the expressions in texts. While humans appear to use infer-ence rules such as "X writes Y" implies "X is the author of Y" in answering questions, such rules are generally unavailable to question-answering systems due to the inherent difficulty in constructing them. In this paper, we present an unsupervised algorithm for discovering inference rules from text. Our algorithm is based on an extended version of Harris' Distributional Hypothesis, which states that words that occurred in the same con-texts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize that their meanings are similar. We use examples to show that our system discovers many inference rules easily missed by humans.
Article
Automatic paraphrase discovery is an important but challenging task. We propose an unsupervised method to discover paraphrases from a large untagged corpus, without requiring any seed phrase or other cue. We focus on phrases which connect two Named En-tities (NEs), and proceed in two stages. The first stage identifies a keyword in each phrase and joins phrases with the same keyword into sets. The second stage links sets which involve the same pairs of individual NEs. A total of 13,976 phrases were grouped. The ac-curacy of the sets in representing para-phrase ranged from 73% to 99%, depending on the NE categories and set sizes; the accuracy of the links for two evaluated domains was 73% and 86%.
Article
Discovering the significant relations embedded in documents would be very useful not only for information retrieval but also for question answering and summarization. Prior methods for relation discovery, however, needed large annotated corpora which cost a great deal of time and effort. We propose an unsupervised method for relation discovery from large corpora. The key idea is clustering pairs of named entities according to the similarity of context words intervening between the named entities. Our experiments using one year of newspapers reveals not only that the relations among named entities could be detected with high recall and precision, but also that appropriate labels could be automatically provided for the relations.
Article
This paper develops a method for recognizing relations and entities in sentences, while taking mutual dependencies among them into account. E.g., the kill (Johns, Oswald) relation in: "J. V. Oswald was murdered at JFK after his assassin, K. F. Johns..." depends on identifying Oswald and Johns as people, JFK being identified as a location, and the kill relation between Oswald and Johns; this, in turn, enforces that Oswald and Johns are people.
Conference Paper
. The World Wide Web is a vast resource for information. At the same time it is extremely distributed. A particular type of data such as restaurant lists may be scattered across thousands of independent information sources in many different formats. In this paper, we consider the problem of extracting a relation for such a data type from all of these sources automatically. We present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. To test our technique we use it to extract a relation of (author,title) pairs from the World Wide Web. 1 Introduction The World Wide Web provides a vast source of information of almost all types, ranging from DNA databases to resumes to lists of favorite restaurants. However, this information is often scattered among many web servers and hosts, using many different formats. If these chunks of information could be extracted from the World Wide Web and integrated into a str...
Article
In this paper we explore the power of surface text patterns for open-domain question answering systems. In order to obtain an optimal set of patterns, we have developed a method for learning such patterns automatically. A tagged corpus is built from the Internet in a bootstrapping process by providing a few hand-crafted examples of each question type to Altavista. Patterns are then automatically extracted from the returned documents and standardized. We calculate the precision of each pattern, and the average precision for each question type. These patterns are then applied to find answers to new questions.
Article
When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large quantity of labeled data, this paper advocates using a small amount of seed information and a large collection of easily-obtained unlabeled data. Bootstrapping initializes a learner with the seed information; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% precision for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a...
Domain portable pattern acquisition with dynamic training corpus
  • Xingjie Zeng
  • Fang Li
  • Dongmo Zhang
  • X. Zeng
Xingjie Zeng, Fang Li, and Dongmo Zhang. Domain portable pattern acquisition with dynamic training corpus. Computer Simulation, 22(2005)4, 259–263 (in Chinese).
A bootstrapping method for acquisition of bi-relations and bi-relational patterns
  • Jifa Jiang
  • Shuxi Wang
  • J. Jiang