Conference Paper

Identifying Relations for Open Information Extraction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woepos. More than 30% of ReVerb's extractions are at precision 0.8 or higher---compared to virtually none for earlier systems. The paper concludes with a detailed analysis of ReVerb's errors, suggesting directions for future work.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This system does not presuppose a predefined set of relations and is targeted at all relations that can be extracted. Open IE is currently being developed in its second generation in systems such as ReVerb [12], OLLIE [12], and ClausIE [8], which extend from previous Open IE systems such as TextRunner [2], StatSnowBall [26], and WOE [23]. Figure 1 summarizes the differences of traditional IE systems and the new information extraction systems which are called Open IE [9,10]. ...
... This system does not presuppose a predefined set of relations and is targeted at all relations that can be extracted. Open IE is currently being developed in its second generation in systems such as ReVerb [12], OLLIE [12], and ClausIE [8], which extend from previous Open IE systems such as TextRunner [2], StatSnowBall [26], and WOE [23]. Figure 1 summarizes the differences of traditional IE systems and the new information extraction systems which are called Open IE [9,10]. ...
... Three grammatical structures in ReVerb[12]. ...
Preprint
Open Information Extraction (Open IE) systems aim to obtain relation tuples with highly scalable extraction in portable across domain by identifying a variety of relation phrases and their arguments in arbitrary sentences. The first generation of Open IE learns linear chain models based on unlexicalized features such as Part-of-Speech (POS) or shallow tags to label the intermediate words between pair of potential arguments for identifying extractable relations. Open IE currently is developed in the second generation that is able to extract instances of the most frequently observed relation types such as Verb, Noun and Prep, Verb and Prep, and Infinitive with deep linguistic analysis. They expose simple yet principled ways in which verbs express relationships in linguistics such as verb phrase-based extraction or clause-based extraction. They obtain a significantly higher performance over previous systems in the first generation. In this paper, we describe an overview of two Open IE generations including strengths, weaknesses and application areas.
... f (·) returns the document frequency for a token. (5) Morph Normalization: We make use of multiple morphological normalization operations like tense removal, pluralization, capitalization and others as used in [12] for finding out equivalent NPs. We show in Section 8.2 that this information helps in improving performance. ...
... ReVerb45K is a significantly extended version of the Ambiguous dataset, containing more than 20x NPs. ReVerb45K is constructed by intersecting information from the following three sources: ReVerb Open KB [12], Freebase entity linking information from [13], and Clueweb09 corpus [7]. Firstly, for every triple in ReVerb, we extracted the source text from Clueweb09 corpus from which the triple was generated. ...
... • Morphological Normalization: As used in [12], this involves applying simple normalization operations like removing tense, pluralization, capitalization etc. over NPs and relation phrases. • Paraphrase Database (PPDB): Using PPDB 2.0 [29], we clustered two NPs together if they happened to share a common paraphrase. ...
Preprint
Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.
... A verb has a unique role in a sentence because it maintains dependency relation with its syntactic arguments such as the subject and the object. Therefore, it is possible to use the distribution of immediate arguments of a verb to represent its meaning, such as ReVerb (Fader, Soderland, and Etzioni 2011). Such an approach is a form of "bagof-words" (BoW) approach. ...
... In this section, we first show how we prepare the data for argument conceptualization. Then, we use some example concepts generated by our algorithm to show the advantage of our algorithm (AC) against selectional preference (SP), FrameNet (Baker, Fillmore, and Lowe 1998) and Re-Verb (Fader, Soderland, and Etzioni 2011), as well as our baseline approach (BL) which considers equal weight for each argument (see Section 3). We also quantitatively evaluate the accuracies of AC, BL and SP on Probase. ...
Preprint
Verbs play an important role in the understanding of natural language text. This paper studies the problem of abstracting the subject and object arguments of a verb into a set of noun concepts, known as the "argument concepts". This set of concepts, whose size is parameterized, represents the fine-grained semantics of a verb. For example, the object of "enjoy" can be abstracted into time, hobby and event, etc. We present a novel framework to automatically infer human readable and machine computable action concepts with high accuracy.
... Fortunately, with the proliferation of general-purpose knowledge bases (or knowledge graphs), e.g., Cyc project [Lenat and Guha 1989], Wikipedia, Freebase [Bollacker et al. 2008], Know-ItAll [Etzioni et al. 2004], TextRunner [Banko et al. 2007], ReVerb [Fader et al. 2011], Ollie [Mausam et al. 2012], WikiTaxonomy [Ponzetto and Strube 2007], Probase [Wu et al. 2012], DBpedia [Auer et al. 2007], YAGO [Suchanek et al. 2007], NELL [Mitchell et al. 2015] and Knowledge Vault [Dong et al. 2014], we have an abundance of available world knowledge. We call these knowledge bases world knowledge [Gabrilovich and Markovitch 2005], because they are universal knowledge that are either collaboratively annotated by human labelers or automatically extracted from big data. ...
... There exist some world knowledge bases. They are either collaboratively constructed by humans (such as Cyc project [Lenat and Guha 1989], Wikipedia, Freebase [Bollacker et al. 2008]) or automatically extracted from big data (such as KnowItAll [Etzioni et al. 2004], TextRunner [Banko et al. 2007], ReVerb [Fader et al. 2011], Ollie [Mausam et al. 2012], WikiTaxonomy [Ponzetto and Strube 2007], Probase [Wu et al. 2012], DBpedia [Auer et al. 2007], YAGO [Suchanek et al. 2007], NELL [Mitchell et al. 2015] and Knowledge Vault [Dong et al. 2014]). Since we assume the world knowledge is given, we skip this step in this study. ...
Preprint
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.
... Uninformative Extraction. Following Fader et al. (2011), uninformative extractions are extractions that omit critical information. This type of error is caused by improper handling of relation phrases that are expressed by a combination of a verb with a noun, such as light verb constructions (LVCs). ...
... This class describes relations which were not found by a particular system. According to Fader et al. (2011), missing extractions are often caused by argument-finding heuristics choosing the wrong arguments, or failing to extract all possible arguments. One example is the case of coordinating conjunctions. ...
Preprint
Full-text available
We report results on benchmarking Open Information Extraction (OIE) systems using RelVis, a toolkit for benchmarking Open Information Extraction systems. Our comprehensive benchmark contains three data sets from the news domain and one data set from Wikipedia with overall 4522 labeled sentences and 11243 binary or n-ary OIE relations. In our analysis on these data sets we compared the performance of four popular OIE systems, ClausIE, OpenIE 4.2, Stanford OpenIE and PredPatt. In addition, we evaluated the impact of five common error classes on a subset of 749 n-ary tuples. From our deep analysis we unreveal important research directions for a next generation of OIE systems.
... Here we describe some important details during the data construction process. There exists several open IE algorithms in literature, including TextRunner (Yates et al. 2007), Reverb (Fader, Soderland, and Etzioni 2011), OLLIE (Schmitz et al. 2012). The result of an open IE algorithm has the same format with an assertion. ...
... TextRunner (Yates et al. 2007) is a pioneering Open IE work which aims at constructing a general model that expresses a relation based on Part-of-Speech and Chunking features. Re-Verb (Fader, Soderland, and Etzioni 2011) restricts the predicates to verbal phrases and extracts them based on grammatical structures. ClausIE (Del Corro and Gemulla 2013) employs hand-crafted grammatical patterns based on the result of dependency parsing trees to detect and extract clausebased assertions. ...
Preprint
We present assertion based question answering (ABQA), an open domain question answering task that takes a question and a passage as inputs, and outputs a semi-structured assertion consisting of a subject, a predicate and a list of arguments. An assertion conveys more evidences than a short answer span in reading comprehension, and it is more concise than a tedious passage in passage-based QA. These advantages make ABQA more suitable for human-computer interaction scenarios such as voice-controlled speakers. Further progress towards improving ABQA requires richer supervised dataset and powerful models of text understanding. To remedy this, we introduce a new dataset called WebAssertions, which includes hand-annotated QA labels for 358,427 assertions in 55,960 web passages. To address ABQA, we develop both generative and extractive approaches. The backbone of our generative approach is sequence to sequence learning. In order to capture the structure of the output assertion, we introduce a hierarchical decoder that first generates the structure of the assertion and then generates the words of each field. The extractive approach is based on learning to rank. Features at different levels of granularity are designed to measure the semantic relevance between a question and an assertion. Experimental results show that our approaches have the ability to infer question-aware assertions from a passage. We further evaluate our approaches by incorporating the ABQA results as additional features in passage-based QA. Results on two datasets show that ABQA features significantly improve the accuracy on passage-based~QA.
... Typically it takes as input morpho-syntactically annotated text and produces a set of triples (E 1 , R, E 2 ), where E 1 and E 2 are entities and R is a relation in which E 1 and E 2 participate as a pair. In case of ontology induction or information extraction in open domain (as described, e.g., in [1], [2], [3], [4]) no restrictions are imposed on R. There are many types of relations that can be extracted this way, such as quality, part or behavior [5]. ...
... This method is additionally extended with a technique that we call pseudo-subclass boosting which increases the number of extracted relations. 2 The relation was valid in the past only It is worth noting that automatic detection of IS-A patterns is possible. Experiments described in [18] show that hand-crafted ontologies like WordNet can be used successfully as a training set for such pattern discovery task. ...
Preprint
Pattern-based methods of IS-A relation extraction rely heavily on so called Hearst patterns. These are ways of expressing instance enumerations of a class in natural language. While these lexico-syntactic patterns prove quite useful, they may not capture all taxonomical relations expressed in text. Therefore in this paper we describe a novel method of IS-A relation extraction from patterns, which uses morpho-syntactical annotations along with grammatical case of noun phrases that constitute entities participating in IS-A relation. We also describe a method for increasing the number of extracted relations that we call pseudo-subclass boosting which has potential application in any pattern-based relation extraction method. Experiments were conducted on a corpus of about 0.5 billion web documents in Polish language.
... This is straightforward from dependency parses, which are available for many languages. It is also possible without a parser, with some language-specific work (Fader et al., 2011). We describe our approach in Section 3. ...
... Our method relies on extracting binary predicates between entities from sentences. Various representations have been suggested for binary predicates, such as Reverb patterns (Fader et al., 2011), dependency paths (Lin and Pantel, 2001;Yao et al., 2011), and binarized predicate-argument relations derived from a CCG-parse (Lewis and Steedman, 2013). Our approach is formalism-independent, and is compatible with any method of expressing binary predicates. ...
... Based on this observation, this paper introduces a novel mechanism to paraphrase relations as summarized in Figure 1. NEWSSPIKE first applies the ReVerb open information extraction (IE) system (Fader et al., 2011) on the news streams to obtain a set of (a 1 , r, a 2 , t) tuples, where the a i are the arguments, r is a relation phrase, and t is the time-stamp of the corresponding news article. When (a 1 , a 2 , t) suggests a real word event, the relation r of (a 1 , r, a 2 , t) is likely to describe that event (e.g. ...
... • Strip HTML tags from the news articles using Boilerpipe (Kohlschütter et al., 2010); keep only the title and first paragraph of each article. • Extract shallow relation tuples using the OpenIE system (Fader et al., 2011). ...
... were extracted from ClueWeb09 using the ReVerb open IE system (Fader et al., 2011). Lin et al. (2012) released a subset of these triples 5 where they were able to substitute the subject arguments with KB entities. ...
... Finally, although Freebase has thousands of properties, open information extraction (Banko et al., 2007;Fader et al., 2011;Masaum et al., 2012) and associated question answering systems (Fader et al., 2013) work over an even larger open-ended set of properties. The drawback of this regime is that the noise and the difficulty in canonicalization make it hard to perform reliable composition, thereby nullifying one of the key benefits of semantic parsing. ...
... Even the largest VQA datasets can not contain all realworld concepts. So VQA models should know how to acquire VOLUME 4, 2016 Examples of such EKBs are, large scale KBs constructed by human annotation, e.g., DBpedia [11], Freebase [16], Wikidata [127] and automatic extraction from unstructured/semistructured data, e.g., YAGO [48], [80], OpenIE [12], [34], [35], NELL [22], NEIL [25], WebChild [118], ConceptNet [76]. ...
... Construction, organization, and querying of these knowledge bases are problems that have been studied for years. This has led to the development of large-scale KBs constructed by human annotation, e.g., DBpedia [11], Freebase [16], Wikidata [127]) and KBs constructed by automatic extraction from unstructured or semi-structured data, e.g., YAGO [48] [80], OpenIE [12] [34] [35], NELL [22], NEIL [25], WebChild [118], ConceptNet [76]. In structured KBs, knowledge is typically represented by a large number of triples of the form (arg1, rel, arg2). ...
Preprint
Full-text available
Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding the scene text, and (4) KB (Knowledge-Based) datasets that are designed to measure a model's ability to utilize outside knowledge. Concurrently, we explore six main paradigms of VQA models: fusion, where we discuss different methods of fusing information between visual and textual modalities; attention, the technique of using information from one modality to filter information from another; external knowledge base, where we discuss different models utilizing outside information; composition or reasoning, where we analyze techniques to answer advanced questions that require complex reasoning steps; explanation, which is the process of generating visual and textual descriptions to verify sound reasoning; and graph models, which encode and manipulate relationships through nodes in a graph. We also discuss some miscellaneous topics, such as scene text understanding, counting, and bias reduction.
... One paradigm of OpenRE is open information extraction (OpenIE) (Etzioni et al., 2008;Fader * Corresponding author et al., 2011), which directly extracts relation-related phrases from sentences based on syntactic analysis, but often produces redundant or ambiguous results. Therefore, another paradigm, namely unsupervised relation discovery (Yao et al., 2011;ElSahar et al., 2017), has been proposed to bridge the gap between closed-set RE and OpenIE by inducing new relations automatically. ...
... Methods for Open relation extraction (OpenRE) can be categorized as: tagging-based Yates et al., 2007) and clustering-based (Hu et al., 2020;Simon et al., 2019;ElSahar et al., 2017). Tagging-based methods directly extract relation-related phrases from sentences based on syntactic analysis, but often produce redundant or ambiguous results (Fader et al., 2011). Therefore, clustering-based methods have drawn more attention. ...
... As an alternative, open information extraction (OIE) is put forward [6], which extracts triples in the form of head noun phrase, relational phrase, tail noun phrase from the unstructured text; in this scenario, there is no need to designate an ontology in advance. The extracted triples together constitute a large open knowledge base (OKB), such as ReVerb [7]. While OKB comes with advantages over CKB in terms of coverage and diversity, there is a prominent issue associated with OKB, i.e., noun phrases in the triplets are not canonicalized, which may lead to redundancy and ambiguity of knowledge facts [8]. ...
... Datasets Previous works use the ReVerb45K [8] dataset for evaluation. The OIE triples of ReVerb45K are extracted by ReVerb [7] from the source text of Clueweb09 [39], and the noun phrases are annotated with the corresponding entities in Freebase. The number of triples in ReVerb45Kis 45K, each triple of which is associated with a Freebase entity. ...
Article
Full-text available
The construction of large open knowledge bases (OKBs) is integral to many knowledge-driven applications on the world wide web such as web search. However, noun phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Nevertheless, these works fail to fully exploit the synergy between clustering and KGE learning, and the methods designed for these sub-tasks are sub-optimal. To this end, we put forward a multi-task learning framework, namely MulCanon, to tackle OKB canonicalization. Specifically, diffusion model is used in the soft clustering process to improve the noun phrase representations with neighboring information, which can lead to more accurate representations. MulCanon unifies the learning objective of diffusion model, KGE model, side information and cluster assignment, and adopts a two-stage multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization benchmarks validates that MulCanon can achieve competitive canonicalization results.
... For example, Freebase [26], WordNet [27], and Wikidata [28] are well-known large KGs developed by human labor. In contrast, OpenIE [29][30][31] and YAGO [32] are developed to reduce human effort. The work most similar to ours, in addition to BertNet, is "Prompting as Probing" [33], where the authors use a prompt engineering approach to obtain knowledge from GPT-3 [2]. ...
Article
Full-text available
Pre-trained language models have become popular in natural language processing tasks, but their inner workings and knowledge acquisition processes remain unclear. To address this issue, we introduce K-Bloom—a refined search-and-score mechanism tailored for seed-guided exploration in pre-trained language models, ensuring both accuracy and efficiency in extracting relevant entity pairs and relationships. Specifically, our crawling procedure is divided into two sub-tasks. Using a few seed entity pairs to minimize the need for extensive manual effort or predefined knowledge, we expand the knowledge graph with new entity pairs around these seeds. To evaluate the effectiveness of our proposed model, we conducted experiments on two datasets that cover the general domain. Our resulting knowledge graphs serve as symbolic representations of the source pre-trained language models, providing valuable insights into their knowledge capacities. Additionally, they enhance our understanding of the pre-trained language models’ capabilities when automatically evaluated on large language models. The experimental results demonstrate that our method outperforms the baseline approach by up to 5.62% in terms of accuracy in various settings of the two benchmarks. We believe that our approach offers a scalable and flexible solution for knowledge graph construction and can be applied to different domains and novel contexts.
... Answering such questions requires models which can parse complex natural language questions, retrieve relevant subgraphs of the KG and then perform some logical, comparative and/or quantitative operations on this subgraph. Also the Knowledge Graph used in our work is orders of magnitude larger than those used in some existing works Bordes et al. 2015;Dodge et al. 2015;Fader, Soderland, and Etzioni 2011) which lie at the intersection of QA and dialog. ...
Preprint
While conversing with chatbots, humans typically tend to ask many questions, a significant portion of which can be answered by referring to large-scale knowledge graphs (KG). While Question Answering (QA) and dialog systems have been studied independently, there is a need to study them closely to evaluate such real-world scenarios faced by bots involving both these tasks. Towards this end, we introduce the task of Complex Sequential QA which combines the two tasks of (i) answering factual questions through complex inferencing over a realistic-sized KG of millions of entities, and (ii) learning to converse through a series of coherently linked QA pairs. Through a labor intensive semi-automatic process, involving in-house and crowdsourced workers, we created a dataset containing around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in our dialogs require a larger subgraph of the KG. Specifically, our dataset has questions which require logical, quantitative, and comparative reasoning as well as their combinations. This calls for models which can: (i) parse complex natural language questions, (ii) use conversation context to resolve coreferences and ellipsis in utterances, (iii) ask for clarifications for ambiguous queries, and finally (iv) retrieve relevant subgraphs of the KG to answer such questions. However, our experiments with a combination of state of the art dialog and QA models show that they clearly do not achieve the above objectives and are inadequate for dealing with such complex real world settings. We believe that this new dataset coupled with the limitations of existing models as reported in this paper should encourage further research in Complex Sequential QA.
... A substantial amount of research has been devoted to structured representations of knowledge. This led to the development of large-scale Knowledge Bases (KB) such as DBpedia [5], Freebase [7], YAGO [27,50], OpenIE [6,17,18], NELL [10], We-bChild [73,72], and ConceptNet [47]. These databases store common sense and factual knowledge in a machine readable fashion. ...
Preprint
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.
... On the other hand, we add the semantic equivalence between arguments and predicates. The method we used to get the extended paraphrase dictionary ED is not discussed, as it is the same method used in the related research studies [41,42,43,44] to get the dictionary D. ...
Preprint
Natural language question-answering over RDF data has received widespread attention. Although there have been several studies that have dealt with a small number of aggregate queries, they have many restrictions (i.e., interactive information, controlled question or query template). Thus far, there has been no natural language querying mechanism that can process general aggregate queries over RDF data. Therefore, we propose a framework called NLAQ (Natural Language Aggregate Query). First, we propose a novel algorithm to automatically understand a users query intention, which mainly contains semantic relations and aggregations. Second, to build a better bridge between the query intention and RDF data, we propose an extended paraphrase dictionary ED to obtain more candidate mappings for semantic relations, and we introduce a predicate-type adjacent set PT to filter out inappropriate candidate mapping combinations in semantic relations and basic graph patterns. Third, we design a suitable translation plan for each aggregate category and effectively distinguish whether an aggregate item is numeric or not, which will greatly affect the aggregate result. Finally, we conduct extensive experiments over real datasets (QALD benchmark and DBpedia), and the experimental results demonstrate that our solution is effective.
... Large-scale structured KBs are constructed either by manual annotation (e.g., DBpedia [22], Freebase [24] and Wikidata [28]), or by automatic extraction from unstructured/semistructured data (e.g., YAGO [27], [32], OpenIE [23], [33], [34], NELL [25], NEIL [26], WebChild [35], ConceptNet [36]). The KB that we use here is the combination of DBpedia, WebChild and ConceptNet, which contains structured information extracted from Wikipedia and unstructured online articles. ...
Preprint
Visual Question Answering (VQA) has attracted a lot of attention in both Computer Vision and Natural Language Processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA only contains questions which require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answerg triplets, through additional image-question-answer-supporting fact tuples. The supporting fact is represented as a structural triplet, such as <Cat,CapableOf,ClimbingTrees>. We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting facts.
... First, we have used a small knowledge base, which limits the coverage of perspectives we can generate. Using Freebase (Bol-lacker et al., 2008) or even open information extraction (Fader et al., 2011) would dramatically increase the number of facts and therefore the scope of possible perspectives. ...
Preprint
Full-text available
How much is 131 million US dollars? To help readers put such numbers in context, we propose a new task of automatically generating short descriptions known as perspectives, e.g. "$131 million is about the cost to employ everyone in Texas over a lunch period". First, we collect a dataset of numeric mentions in news articles, where each mention is labeled with a set of rated perspectives. We then propose a system to generate these descriptions consisting of two steps: formula construction and description generation. In construction, we compose formulae from numeric facts in a knowledge base and rank the resulting formulas based on familiarity, numeric proximity and semantic compatibility. In generation, we convert a formula into natural language using a sequence-to-sequence recurrent neural network. Our system obtains a 15.2% F1 improvement over a non-compositional baseline at formula construction and a 12.5 BLEU point improvement over a baseline description generation.
... Moreover, we have to consider more general models to complete a knowledge graph which can retrieve information from other materials than triples, because sometimes information is not included training triples to predict a required triple. There are models extracting triples from text such as OpenIE models (Fader, Soderland, and Etzioni 2011;Mausam et al. 2012;Angeli, Premkumar, and Manning 2015) and word embedding-based model (Ebisu and Ichise 2017). We think we can develop a more general model combining with these methods. ...
Preprint
Knowledge graphs are useful for many artificial intelligence (AI) tasks. However, knowledge graphs often have missing facts. To populate the graphs, knowledge graph embedding models have been developed. Knowledge graph embedding models map entities and relations in a knowledge graph to a vector space and predict unknown triples by scoring candidate triples. TransE is the first translation-based method and it is well known because of its simplicity and efficiency for knowledge graph completion. It employs the principle that the differences between entity embeddings represent their relations. The principle seems very simple, but it can effectively capture the rules of a knowledge graph. However, TransE has a problem with its regularization. TransE forces entity embeddings to be on a sphere in the embedding vector space. This regularization warps the embeddings and makes it difficult for them to fulfill the abovementioned principle. The regularization also affects adversely the accuracies of the link predictions. On the other hand, regularization is important because entity embeddings diverge by negative sampling without it. This paper proposes a novel embedding model, TorusE, to solve the regularization problem. The principle of TransE can be defined on any Lie group. A torus, which is one of the compact Lie groups, can be chosen for the embedding space to avoid regularization. To the best of our knowledge, TorusE is the first model that embeds objects on other than a real or complex vector space, and this paper is the first to formally discuss the problem of regularization of TransE. Our approach outperforms other state-of-the-art approaches such as TransE, DistMult and ComplEx on a standard link prediction task. We show that TorusE is scalable to large-size knowledge graphs and is faster than the original TransE.
... We draw on work in natural language processing, information extraction, and computer vision to distill human activites from fiction. Prior work discusses how to extract patterns from text by parsing sentences [5,8,4,6]. We adapt and extend these approaches in our text mining domain-specific language, producing an alternative that is more declarative and potentially easier to inspect and reason about. ...
Preprint
From smart homes that prepare coffee when we wake, to phones that know not to interrupt us during important conversations, our collective visions of HCI imagine a future in which computers understand a broad range of human behaviors. Today our systems fall short of these visions, however, because this range of behaviors is too large for designers or programmers to capture manually. In this paper, we instead demonstrate it is possible to mine a broad knowledge base of human behavior by analyzing more than one billion words of modern fiction. Our resulting knowledge base, Augur, trains vector models that can predict many thousands of user activities from surrounding objects in modern contexts: for example, whether a user may be eating food, meeting with a friend, or taking a selfie. Augur uses these predictions to identify actions that people commonly take on objects in the world and estimate a user's future activities given their current situation. We demonstrate Augur-powered, activity-based systems such as a phone that silences itself when the odds of you answering it are low, and a dynamic music player that adjusts to your present activity. A field deployment of an Augur-powered wearable camera resulted in 96% recall and 71% precision on its unsupervised predictions of common daily activities. A second evaluation where human judges rated the system's predictions over a broad set of input images found that 94% were rated sensible.
... In the past decade, large efforts have been undertaken to research the automatic acquisition of machine-readable knowledge on a large scale by mining large repositories of textual data (Banko et al. 2007;Carlson et al. 2010;Fader, Soderland, and Etzioni 2011;. At this, collaboratively constructed resources have been exploited, used either in isolation (Bizer et al. 2009;Ponzetto and Strube 2011;Nastase and Strube 2012), or complemented with manually assembled knowledge sources (Suchanek, Kasneci, and Weikum 2008;Navigli and Ponzetto 2012a;Gurevych et al. 2012;Hoffart et al. 2013). ...
Preprint
Full-text available
We present an approach to combining distributional semantic representations induced from text corpora with manually constructed lexical-semantic networks. While both kinds of semantic resources are available with high lexical coverage, our aligned resource combines the domain specificity and availability of contextual information from distributional models with the conciseness and high quality of manually crafted lexical networks. We start with a distributional representation of induced senses of vocabulary terms, which are accompanied with rich context information given by related lexical items. We then automatically disambiguate such representations to obtain a full-fledged proto-conceptualization, i.e. a typed graph of induced word senses. In a final step, this proto-conceptualization is aligned to a lexical ontology, resulting in a hybrid aligned resource. Moreover, unmapped induced senses are associated with a semantic type in order to connect them to the core resource. Manual evaluations against ground-truth judgments for different stages of our method as well as an extrinsic evaluation on a knowledge-based Word Sense Disambiguation benchmark all indicate the high quality of the new hybrid resource. Additionally, we show the benefits of enriching top-down lexical knowledge resources with bottom-up distributional information from text for addressing high-end knowledge acquisition tasks such as cleaning hypernym graphs and learning taxonomies from scratch.
... Since traditional supervised relation extraction methods [6,7,8] require manual annotations and are often domain-specific, nowadays many efforts focus on open information extraction, which can extract hundreds of thousands of relations from large scale of web texts using semi-supervised or unsupervised methods [9,10,11,12,13,14]. However, these relations are often not canonicalized, therefore are difficult to be mapped to an existing KB. ...
Preprint
Relation extraction is the task of identifying predefined relationship between entities, and plays an essential role in information extraction, knowledge base construction, question answering and so on. Most existing relation extractors make predictions for each entity pair locally and individually, while ignoring implicit global clues available across different entity pairs and in the knowledge base, which often leads to conflicts among local predictions from different entity pairs. This paper proposes a joint inference framework that employs such global clues to resolve disagreements among local predictions. We exploit two kinds of clues to generate constraints which can capture the implicit type and cardinality requirements of a relation. Those constraints can be examined in either hard style or soft style, both of which can be effectively explored in an integer linear program formulation. Experimental results on both English and Chinese datasets show that our proposed framework can effectively utilize those two categories of global clues and resolve the disagreements among local predictions, thus improve various relation extractors when such clues are applicable to the datasets. Our experiments also indicate that the clues learnt automatically from existing knowledge bases perform comparably to or better than those refined by human.
... cg ← cg + cp 7: end for 8: R ← selectT opk(G, c) patterns generated by Reverb from the ClueWeb09 corpus [15]. Algorithm 2 details our procedure to generate a list of open relation phrases from this output. ...
Preprint
Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. In this paper we propose a document filtering method for long-tail entities that is entity-independent and thus also generalizes to unseen or rarely seen entities. It is based on intrinsic features, i.e., features that are derived from the documents in which the entities are mentioned. We propose a set of features that capture informativeness, entity-saliency, and timeliness. In particular, we introduce features based on entity aspect similarities, relation patterns, and temporal expressions and combine these with standard features for document filtering. Experiments following the TREC KBA 2014 setup on a publicly available dataset show that our model is able to improve the filtering performance for long-tail entities over several baselines. Results of applying the model to unseen entities are promising, indicating that the model is able to learn the general characteristics of a vital document. The overall performance across all entities---i.e., not just long-tail entities---improves upon the state-of-the-art without depending on any entity-specific training data.
... However, our world undergoes open-ended growth of relations and it is not possible to handle all these emerging Jeff Bezos, an American entrepreneur, graduated from Princeton in 1986. Figure 8, extracts relation phrases and arguments (entities) from text (Banko et al., 2007;Fader et al., 2011;Mausam et al., 2012;Del Corro and Gemulla, 2013;Angeli et al., 2015;Stanovsky and Dagan, 2016;Mausam, 2016;Cui et al., 2018). Open IE does not rely on specific relation types and thus can handle all kinds of relational facts. ...
... Open RE Traditional open RE work can be categorized primarily into sequence labeling and clustering-based methods. Sequence labeling methods use syntactic or semantic features to extract relational phrases from the text as relations (Banko et al., 2007;Fader et al., 2011;Stanovsky et al., 2018;Cui et al., 2018). The relations can be limited in expressiveness, and this line of work can hardly capture global context. ...
... We define an event as a Subject-Predicate-Object (SPO) triple. To extract events from raw text, an open information extraction software -ReVerb (Fader et al., 2011) is used. ReVerb is a program that automatically identifies and extracts relationships from English sentences. ...
... This line of methods does not rely on any training data but only on previously defined rules. Reverb [15], extract manually selected syntactical and lexical features. KRAKEN [1] is designed for capturing complete facts. ...
Preprint
Full-text available
Large Language Models (LLMs) have received considerable interest in wide applications lately. During pre-training via massive datasets, such a model implicitly memorizes the factual knowledge of trained datasets in its hidden parameters. However, knowledge held implicitly in parameters often makes its use by downstream applications ineffective due to the lack of common-sense reasoning. In this article, we introduce a general framework that permits to build knowledge bases with an aid of LLMs, tailored for processing Web news. The framework applies a rule-based News Information Extractor (NewsIE) to news items for extracting their relational tuples, referred to as knowledge bases, which are then graph-convoluted with the implicit knowledge facts of news items obtained by LLMs, for their classification. It involves two lightweight components: 1) NewsIE: for extracting the structural information of every news item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the implicit knowledge facts with relational tuples extracted by NewsIE. We have evaluated our framework under different news-related datasets for news category classification, with promising experimental results.
... When it comes to Open Information Extraction the current common approaches can be divided in two bigger groups: a more traditional approach using rules [6,7] or clauses [8] or, more recently, a neural approach (machine learning) to the problem. ...
... Commercial projects like the Google KG (Singhal, 2012) or Amazon's KG (Dong et al., 2020) have usually followed these approaches. By comparison, text-based KB construction, e.g., in NELL (Mitchell et al., 2018) or ReVerb (Fader et al., 2011), has achieved less adoption. Our approach is more related to the latter approaches, as LLMs are distillations of large text corpora. ...
Preprint
Full-text available
General-domain knowledge bases (KB), in particular the "big three" -- Wikidata, Yago and DBpedia -- are the backbone of many intelligent applications. While these three have seen steady development, comprehensive KB construction at large has seen few fresh attempts. In this work, we propose to build a large general-domain KB entirely from a large language model (LLM). We demonstrate the feasibility of large-scale KB construction from LLMs, while highlighting specific challenges arising around entity recognition, entity and property canonicalization, and taxonomy construction. As a prototype, we use GPT-4o-mini to construct GPTKB, which contains 105 million triples for more than 2.9 million entities, at a cost 100x less than previous KBC projects. Our work is a landmark for two fields: For NLP, for the first time, it provides \textit{constructive} insights into the knowledge (or beliefs) of LLMs. For the Semantic Web, it shows novel ways forward for the long-standing challenge of general-domain KB construction. GPTKB is accessible at https://gptkb.org.
... Additionally, two knowledge bases, Freebase and Reverb (Wikipedia Extractions 1.1), were respectively utilized as curated and extracted knowledge bases for the designed system. [64]. Moreover, many freebase extraction methods conduct ranking relations and determine confidence levels in the extracted relations. ...
Article
Full-text available
In this paper, we present a novel approach to improving Open-Domain Knowledge Base Question Answering by leveraging both curated (Freebase) and extracted (Reverb) knowledge bases. Our hybrid system combines Span Detection using BERT (SD-BERT) for precise entity and relation span detection with a Term Frequency–Inverse Document Frequency (TF-IDF) retrieval model enhanced by function scoring, achieving a balance between efficiency and accuracy. Additionally, we explore using Contextual Late Interaction over BERT (ColBERTv2), a scalable retrieval model optimized for token-level interaction, to handle more complex queries involving multiple entities and relations. Our system demonstrates significant improvements in handling large-scale, hybrid datasets such as ReverbSimpleQuestions and SimpleQuestions, providing both high accuracy and scalability. Through extensive evaluation, we achieved a Hit@1 of 67.87% using SD-BERT + TF-IDF with confidence scoring, outperforming several benchmark systems in fact-based query retrieval while maintaining real-time performance and the best performance, 83.63%, was achieved on the extracted knowledge base using the ColBERTv2 method.
... The main challenge in IE is that computers often struggle to comprehend and process textual data. Fader et al. (2011) considered the lexical and syntactic text characteristics for developing rules for extracting information. Their objective was aimed at extracting web statements to support common-sense knowledge and question answering. ...
Conference Paper
Full-text available
The incorporation of sustainability objectives in green building (GB) projects adds complexity to their design, construction, and management. Current developments in the area of artificial intelligence, precisely natural language processing (NLP) techniques have provided great potential in analysing voluminous regulatory documents to draw insightful information relating to the standards, requirements, and codes to enhance the efficiency and accuracy of compliance checking. However, there is a dearth of attempts to tap the potential of NLP to facilitate automated compliance checking, especially within the context of green buildings. This paper, therefore, aims to assess the benefits and limitations of the current advancements in NLP-based methodologies for automated compliance checking of regulatory documents in green buildings. This paper conducts a systematic review of literature to achieve its aim. The challenges and benefits, as well as the areas of the application of NLP in automated compliance checking of regulatory documents in green buildings, are highlighted. The research offers a guide for future investigations aimed at broadening the utilisation of NLP in automating the compliance verification process for regulatory documents in green buildings and the construction sector as a whole.
... La sperimentazioneè stata eseguita utilizzando come corpus documentale il ben noto dataset costruito nel [Fader et al. 2011] 1 [7], composto da cinquecentotre frasi di diverse tipologie e complessità estratte da documenti web, cheè scaricabile dal sito del software ReVerb 2 . ...
Thesis
Full-text available
The idea behind the thesis was to realise a software module capable of automatically extracting information (RDF Triples of the type Subject-Predicate-Object with the addition of the Form because different text strings can return the same triple even if they have opposite meanings) from one or more text documents (web pages in html format stored remotely or files in txt format stored locally). The software module also returns the subject of the discourse and relations present in the documents.
Preprint
Full-text available
This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
Article
Open information extraction (OIE) methods extract plenty of OIE triples from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods
Preprint
Full-text available
This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
Article
Статья посвящена решению актуальной научной проблемы повышения эффективности об-работки и анализа текстовой информации при решении задач поиска и приобретения знаний. Ак-туальность данной задачи связана с необходимостью создания эффективных средств обработкинакапливаемого огромного количества слабо структурированных данных, содержащих важные,иногда скрытые знания, необходимые для построения эффективных систем управления сложны-ми объектами различной природы. Предлагаемый автором алгоритм поиска и приобретения зна-ний при обработке и анализе текстовой информации, отличается применением низкоуровневыхдетерминированных правил, позволяющих провести качественное упрощение текста на основеисключения из текстовой информации слов, инвариантных к смыслу. Алгоритм опирается на до-менную проработку, позволяющую сформировать списки доменно-специфичных слов, что позволя-ет обеспечить высокое качество упрощения текста. В данной задаче исходными данными явля-ются потоки текстовой информации (описание профилей), извлеченных из онлайн платформ длярекрутинга, выходная информация представляется предложениями, сформированными в видетройки «субъект-глагол-объект», отражающих гранулы знаний, полученных в процессе обработ-ки текста. Использование данного порядка единиц, составляющих предложение, обусловлено темфактом, что данный порядок наиболее распространен в русском языке, хотя в самих текстахвозможны иные вариации порядка без потери общего смысла. Основная идея алгоритма заключа-ется в разбиении большого корпуса текста на предложения с последующей фильтрацией получен-ных предложений на основании введенных пользователем ключевых слов. В последствии предло-жения разделяются на компоненты и упрощаются в зависимости от вида поступившей компо-ненты (глагольная, именная). В качестве примера в данной работе использовалась сфера марке-тинга, а ключевыми словами выступили «социальные сети». Автором разработан алгоритм поис-ка и приобретения знаний на основе технологий обработки и анализа текстов на естественномязыке, а также была выполнена программная реализация предложенного алгоритма. В качествеметодов оценки эффективности использовался ряд метрик: индекс Флэша-Кинкейда; индекс Колман-Лиау; автоматический индекс удобочитаемости. Проведенные вычислительные эксперимен-ты подтвердили эффективность предложенного алгоритма по сравнению с аналогами, исполь-зующими нейронные сети для решение подобных задач.
Preprint
Full-text available
Background Alzheimer’s disease (AD), a progressive neurodegenerative disorder, continues to increase in prevalence without any effective treatments to date. In this context, knowledge graphs (KGs) have emerged as a pivotal tool in biomedical research, offering new perspectives on drug repurposing and biomarker discovery by analyzing intricate network structures. Our study seeks to build an AD-specific knowledge graph, highlighting interactions among AD, genes, variants, chemicals, drugs, and other diseases. The goal is to shed light on existing treatments, potential targets, and diagnostic methods for AD, thereby aiding in drug repurposing and the identification of biomarkers. Results We annotated 800 PubMed abstracts and leveraged GPT-4 for text augmentation to enrich our training data for named entity recognition (NER) and relation classification. A comprehensive data mining model, integrating NER and relationship classification, was trained on the annotated corpus. This model was subsequently applied to extract relation triplets from unannotated abstracts. To enhance entity linking, we utilized a suite of reference biomedical databases and refine the linking accuracy through abbreviation resolution. As a result, we successfully identified 3,199,276 entity mentions and 633,733 triplets, elucidating connections between 5,000 unique entities. These connections were pivotal in constructing a comprehensive Alzheimer’s Disease Knowledge Graph (ADKG). We also integrated the ADKG constructed after entity linking with other biomedical databases. The ADKG served as a training ground for Knowledge Graph Embedding models with the high-ranking predicted triplets supported by evidence, underscoring the utility of ADKG in generating testable scientific hypotheses. Further application of ADKG in predictive modeling using the UK Biobank data revealed models based on ADKG outperforming others, as evidenced by higher values in the areas under the receiver operating characteristic (ROC) curves. Conclusion The ADKG is a valuable resource for generating hypotheses and enhancing predictive models, highlighting its potential to advance AD’s disease research and treatment strategies.
Chapter
This chapter introduces methods for extracting the relations between entities mentioned in electronic health records, scientific literature, reports, and similar biomedical text. The contents of this chapter will be familiar to readers with expertise in general domain Natural Language Processing (NLP). The emphasis here is on discussing the challenges and relevancy of the relation extraction task to biomedical text. The chapter provides references to the existing studies, approaches used, and available datasets for understanding and further developing new relation extraction systems.
Article
Full-text available
We propose a statistical measure for the degree of acceptability of light verb constructions, such as take a walk, based on their linguistic properties. Our measure shows good correlations with human ratings on un-seen test data. Moreover, we find that our measure correlates more strongly when the potential complements of the construction (such as walk, stroll, or run) are separated into semantically similar classes. Our analysis demonstrates the systematic nature of the semi-productivity of these constructions.
Article
Full-text available
We present an approach for extracting re-lations from texts that exploits linguistic and empirical strategies, by means of a pipeline method involving a parser, part-of-speech tagger, named entity recogni-tion system, pattern-based classification and word sense disambiguation models, and resources such as ontology, knowl-edge base and lexical databases. The rela-tions extracted can be used for various tasks, including semantic web annotation and ontology learning. We suggest that the use of knowledge intensive strategies to process the input text and corpus-based techniques to deal with unpredicted cases and ambiguity problems allows to accurately discover the relevant relations between pairs of entities in that text.
Conference Paper
Full-text available
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Furthermore, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE. StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
Conference Paper
Full-text available
Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations — more than an order of magnitude greater than any previous approach — with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. 1
Conference Paper
Full-text available
Information-extraction (IE) systems seek to distill semantic relations from natural-language text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner's precision and recall. The key to WOE's performance is a novel form of self-supervised learning for open extractors -- using heuristic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE's extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.
Conference Paper
Full-text available
The IDEX system is a prototype of an interactive dynamic Information Extraction (IE) system. A user of the system expresses an information request in the form of a topic description, which is used for an initial search in order to retrieve a relevant set of documents. On basis of this set of documents, unsupervised relation extraction and clustering is done by the system. The results of these operations can then be interactively inspected by the user. In this paper we describe the relation extraction and clustering components of the IDEX system. Preliminary evaluation results of these components are presented and an overview is given of possible enhancements to improve the relation extraction and clustering components.
Article
Full-text available
Nominalization is a highly productive phenomena in most languages. The process of nominalization ejects a verb from its syntactic role into a nominal position. The original verb is often replaced by a semantically emptied support verb (e.g., make a proposal). The choice of a support verb for a given nominalization is unpredictable, causing a problem for language learners as well as for natural language processing systems. We present here a method of discovering support verbs from an untagged corpus via low-level syntactic processing and comparison of arguments attached to verbal forms and potential nominalized forms. The result of the process is a list of potential support verbs for the nominalized form of a given predicate. 1 Introduction Nominalization, the transformation of a verbal phrase into a nominal form, is possible in most languages (Comrie and Thompson, 1990). Nominalizations are used for a variety of stylistic reasons: to avoid repetitions of a verb, to avoid awkward intr...
Article
Full-text available
We present a general framework for semantic role labeling. The framework combines a machine-learning technique with an integer linear programming-based inference procedure, which incorporates linguistic and structural constraints into a global decision process. Within this framework, we study the role of syntactic parsing information in semantic role labeling. We show that full syntactic parsing information is, by far, most relevant in identifying the argument, especially, in the very first stage—the pruning stage. Surprisingly, the quality of the pruning stage cannot be solely determined based on its recall and precision. Instead, it depends on the characteristics of the output candidates that determine the difficulty of the downstream problems. Motivated by this observation, we propose an effective and simple approach of combining different semantic role labeling systems through joint inference, which significantly improves its performance. Our system has been evaluated in the CoNLL-2005 shared task on semantic role labeling, and achieves the highest F1 score among 19 participants.
Article
Full-text available
We present a model for semantic role labeling that effectively captures the linguistic intuition that a semantic argument frame is a joint structure, with strong dependencies among the arguments. We show how to incorporate these strong dependencies in a statistical joint model with a rich set of features over multiple argument phrases. The proposed model substantially outperforms a similar state-of-the-art local model that does not include dependencies among different arguments. We evaluate the gains from incorporating this joint information on the Propbank corpus, when using correct syntactic parse trees as input, and when using automatically derived parse trees. The gains amount to 24.1% error reduction on all arguments and 36.8% on core arguments for gold-standard parse trees on Propbank. For automatic parse trees, the error reductions are 8.3% and 10.3% on all and core arguments, respectively. We also present results on the CoNLL 2005 shared task data set. Additionally, we explore considering multiple syntactic analyses to cope with parser noise and uncertainty.
Article
Open Information Extraction is a recent paradigm for machine reading from arbitrary text. In contrast to existing techniques, which have used only shallow syntactic features, we investigate the use of semantic features (semantic roles) for the task of Open IE. We compare TextRunner (Banko et al., 2007), a state of the art open extractor, with our novel extractor SRL-IE, which is based on UIUC's SRL system (Punyakanok et al., 2008). We find that SRL-IE is robust to noisy heterogeneous Web data and outperforms TextRunner on extraction quality. On the other hand, TextRunner performs over 2 orders of magnitude faster and achieves good precision in high locality and high redundancy extractions. These observations enable the construction of hybrid extractors that output higher quality results than TextRunner and similar quality as SRL-IE in much less time.
Article
A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.
Conference Paper
In this paper, we propose an unsupervised method for discovering inference rules from text, such as "X is author of Y &ap; X wrote Y", "X solved Y &ap; X found a solution to Y", and "X caused Y &ap; Y is triggered by X". Inference rules are extremely important in many fields such as natural language processing, information retrieval, and artificial intelligence in general. Our algorithm is based on an extended version of Harris' Distributional Hypothesis, which states that words that occurred in the same contexts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus.
Conference Paper
Inference in Conditional Random Fields and Hidden Markov Models is done using the Viterbi algorithm, an efficient dynamic programming algorithm. In many cases, general (non-local and non-sequential) constraints may exist over the output sequence, but cannot be incorporated and exploited in a natural way by this inference procedure. This paper proposes a novel inference procedure based on integer linear programming (ILP) and extends CRF models to naturally and efficiently support general constraint structures. For sequential constraints, this procedure reduces to simple linear programming as the inference process. Experimental evidence is supplied in the context of an important NLP problem, semantic role labeling.
Conference Paper
Determining whether a textual phrase denotes a functional relation (i.e., a relation that maps each domain element to a unique range el- ement) is useful for numerous NLP tasks such as synonym resolution and contradic- tion detection. Previous work on this prob- lem has relied on either counting methods or lexico-syntactic patterns. However, determin- ing whether a relation is functional, by ana- lyzing mentions of the relation in a corpus, is challenging due to ambiguity, synonymy, anaphora, and other linguistic phenomena. We present the LEIBNIZ system that over- comes these challenges by exploiting the syn- ergy between the Web corpus and freely- available knowledge resources such as Free- base. It first computes multiple typed function- ality scores, representing functionality of the relation phrase when its arguments are con- strained to specific types. It then aggregates these scores to predict the global functionality for the phrase. LEIBNIZ outperforms previ- ous work, increasing area under the precision- recall curve from 0.61 to 0.88. We utilize LEIBNIZ to generate the first public reposi- tory of automatically-identified functional re- lations.
Conference Paper
Even the entire Web corpus does not explicitly answer all questions, yet inference can uncover many implicit answers. But where do inference rules come from? This paper investigates the problem of learning inference rules from Web text in an unsupervised, domain-independent manner. The SHERLOCK system, described herein, is a first-order learner that acquires over 30,000 Horn clauses from Web text. SHERLOCK embodies several innovations, including a novel rule scoring function based on Statistical Relevance (Salmon et al., 1971) which is effective on ambiguous, noisy and incomplete Web extractions. Our experiments show that inference over the learned rules discovers three times as many facts (at precision 0.8) as the TEXTRUNNER system which merely extracts facts explicitly stated in Web text. 1
Conference Paper
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.
Conference Paper
The computation of selectional preferences, the admissible argument values for a relation, is a well-known NLP task with broad applicability. We present LDA-SP, which utilizes LinkLDA (Erosheva et al., 2004) to model selectional preferences. By simultaneously inferring latent topics and topic distributions over relations, LDA-SP combines the benefits of previous approaches: like traditional classbased approaches, it produces humaninterpretable classes describing each relation’s preferences, but it is competitive with non-class-based methods in predictive power. We compare LDA-SP to several state-ofthe-art methods achieving an 85 % increase in recall at 0.9 precision over mutual information (Erk, 2007). We also evaluate LDA-SP’s effectiveness at filtering improper applications of inference rules, where we show substantial improvement over Pantel et al.’s system (Pantel et al., 2007). 1
Conference Paper
Extensive knowledge bases of entailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.
Conference Paper
At present, adapting an Information Ex- traction system to new topics is an expen- sive and slow process, requiring some knowledge engineering for each new topic. We propose a new paradigm of Informa- tion Extraction which operates 'on demand' in response to a user's query. On-demand Information Extraction (ODIE) aims to completely eliminate the customization ef- fort. Given a user's query, the system will automatically create patterns to extract sa- lient relations in the text of the topic, and build tables from the extracted information using paraphrase discovery technology. It relies on recent advances in pattern dis- covery, paraphrase discovery, and ex- tended named entity tagging. We report on experimental results in which the system created useful tables for many topics, demonstrating the feasibility of this ap- proach.
Conference Paper
Traditional Information Extraction (IE) takes a relation name and hand-tagged examples of that relation as input. Open IE is a relation- independent extraction paradigm that is tai- lored to massive and heterogeneous corpora such as the Web. An Open IE system extracts a diverse set of relational tuples from text with- out any relation-specific input. How is Open IE possible? We analyze a sample of English sentences to demonstrate that numerous rela- tionships are expressed using a compact set of relation-independent lexico-syntactic pat- terns, which can be learned by an Open IE sys- tem. What are the tradeoffs between Open IE and traditional IE? We consider this question in the context of two tasks. First, when the number of relations is massive, and the rela- tions themselves are not pre-specified, we ar- gue that Open IE is necessary. We then present a new model for Open IE called O-CRF and show that it achieves increased precision and nearly double the recall than the model em- ployed by TEXTRUNNER, the previous state- of-the-art Open IE system. Second, when the number of target relations is small, and their names are known in advance, we show that O-CRF is able to match the precision of a tra- ditional extraction system, though at substan- tially lower recall. Finally, we show how to combine the two types of systems into a hy- brid that achieves higher precision than a tra- ditional extractor, with comparable recall.
Conference Paper
We are trying to extend the boundary of Information Extraction (IE) systems. Ex-isting IE systems require a lot of time and human effort to tune for a new scenario. Preemptive Information Extraction is an attempt to automatically create all feasible IE systems in advance without human in-tervention. We propose a technique called Unrestricted Relation Discovery that dis-covers all possible relations from texts and presents them as tables. We present a pre-liminary system that obtains reasonably good results.
Conference Paper
Traditionally, Information Extraction (IE) has fo- cused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new ex- traction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces T EXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and explo- ration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to per- form extraction for a handful of pre-specified re- lations, T EXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more rela- tions, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract as- sertions.
Conference Paper
Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without us- ing hand-tagged training examples. A fundamen- tal problem for both UIE and supervised IE is as- sessing the probability that extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in differ- ent documents. How does this redundancy impact the probability of correctness? This paper introduces a combinatorial "balls-and- urns" model that computes the impact of sample size, redundancy, and corroboration from multi- ple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating the model's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on av- erage, than those obtained by Pointwise Mutual In- formation (PMI) and the noisy-or model used in previous work. For supervised IE, the model's per- formance is comparable to that of Support Vector Machines, and Logistic Regression. obtained only once. Because the documents that "support" the extraction are, by and large, independently authored, our confidence in an extraction increases dramatically with the number of supporting documents. But by how much? How do we precisely quantify our confidence in an extraction given the available textual evidence? This paper introduces a combinatorial model that enables us to determine the probability that an observed extraction is correct. We validate the performance of the model empiri- cally on the task of extracting information from the Web using KNOWITALL. Our contributions are as follows: 1. A formal model that, unlike previous work, explicitly models the impact of sample size, redundancy, and dif- ferent extraction rules on the probability that an extrac- tion is correct. We analyze the conditions under which the model is applicable, and provide intuitions about its behavior in practice.
Article
Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE, operates on large text corpora without any manual tagging of relations, and indeed without any prespecified relations. Due to its open-domain and open-relation nature, Open IE is purely textual and is unable to relate the surface forms to an ontology, if known in advance. We explore the steps needed to adapt Open IE to a domain-specific ontology and demonstrate our approach of mapping domainindependent tuples to an ontology using domains from the DARPA Machine Reading Project. Our system achieves precision over 0.90 from as few as eight training examples for an NFL-scoring domain.
Conference Paper
A knowledge acquisition tool to extract semantic patterns for a memory-based information retrieval system is presented. The major goal of this tool is to facilitate the construction of a large knowledge base of semantic patterns. The system acquires semantic patterns from texts with a small amount of user interaction. It acquires new phrasal patterns from the input text, maps each element of the pattern to a meaning frame, generalizes the acquired pattern, and merges it into the current knowledge base. Interaction with the user is introduced at some decision points, where the ambiguity cannot be resolved automatically without other pieces of predefined knowledge. The acquisition process is described in detail, and a preliminary experimental result is discussed
Article
This paper describes our approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank. Our primary goal is the labeling of syntactic nodes with specific argument labels that preserve the similarity of roles such as the window in John broke the window and the window broke. After motivating the need for explicit predicate argument structure labels, we briefly discuss the theoretical considerations of predicate argument structure and the need to maintain consistency across syntactic alternations. The issues of consistency of argument structure across both polysemous and synonymous verbs are also discussed and we present our actual guidelines for these types of phenomena, along with numerous examples of tagged sentences and verb frames. Metaframes are introduced as a technique for handling similar frames among nearsynonymous verbs. We conclude with a summary of the current status of annotation process.
Article
In this paper, we propose an unsupervised method for discovering inference rules from text, such as "X is author of Y X wrote Y", "X solved Y X found a solution to Y", and "X caused Y Y is triggered by X". Inference rules are extremely important in many fields such as natural language processing, information retrieval, and artificial intelligence in general. Our algorithm is based on an extended version of Harris's Distributional Hypothesis, which states that words that occurred in the same contexts tend to be similar. Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a parsed corpus.
Article
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic portion makes use of frame semantics. The resulting database will contain (a) descriptions of the semantic frames underlying the meanings of the words described, and (b) the valence representation (semantic and syntactic) of several thousand words and phrases, each accompanied by (c) a repre- sentative collection of annotated corpus attestations, which jointly exemplify the observed linkings between "frame elements" and their syntactic realizations (e.g. grammatical function, phrase type, and other syntactic traits). This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.
Open information ex-traction using Wikipedia StatSnowball: a statistical ap-proachtoextractingentityrelationships
  • Fei Wu
  • Daniel S Nie
  • Xiaojiang Liu
  • Bo Zhang
  • Ji-Rong
  • Wen
Fei Wu and Daniel S. Weld. 2010. Open information ex-traction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 118–127, Morristown, NJ, USA. Association for Computational Linguistics. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. StatSnowball: a statistical ap-proachtoextractingentityrelationships. InWWW’09: Proceedings of the 18th international conference on Worldwideweb, pages101–110, NewYork, NY,USA. ACM.
Stretched Verb Constructions in English Routledge Studies in Germanic Linguistics. Routledge (Taylor and Francis)
  • J David
  • Allerton
David J. Allerton. 2002. Stretched Verb Constructions in English. Routledge Studies in Germanic Linguistics. Routledge (Taylor and Francis), New York.