Conference Paper

Open Information Extraction: The Second Generation.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

How do we scale information extraction to the massive size and unprecedented heterogeneity of the Web corpus? Beginning in 2003, our KnowItAll project has sought to extract high-quality knowledge from the Web. In 2007, we introduced the Open Information Extraction (Open IE) paradigm which eschews hand-labeled training examples, and avoids domain-specific verbs and nouns, to develop unlexicalized, domain-independent extractors that scale to the Web corpus. Open IE systems have extracted billions of assertions as the basis for both common-sense knowledge and novel question-answering systems. This paper describes the second generation of Open IE systems, which rely on a novel model of how relations and their arguments are expressed in English sentences to double precision/recall compared with previous systems such as TEXTRUNNER and WOE.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An important factor for incorporating such additional linguistic sources is the vast amount of text corpora that are available on the web. Information extraction systems like TextRunner [1] or Reverb [7] allow for analyzing these text sources and uncovering information, like nouns and the relations between them. ...
... For example, a car can hardly be observed in the livingroom or a dining table in the garage. Such knowledge can easily be obtained from additional textual sources using methods like TextRunner [1] or Reverb [7]. As a result the tremendous annotation effort that is required for annotating objects in images is no longer necessary. ...
... including possible scene or object categories and their relations will be of further interest. In the following extractions based on Reverb [7] are used. Reverb extracts relations and their arguments from a given sentence. ...
Preprint
This work focuses on the semantic relations between scenes and objects for visual object recognition. Semantic knowledge can be a powerful source of information especially in scenarios with few or no annotated training samples. These scenarios are referred to as zero-shot or few-shot recognition and often build on visual attributes. Here, instead of relying on various visual attributes, a more direct way is pursued: after recognizing the scene that is depicted in an image, semantic relations between scenes and objects are used for predicting the presence of objects in an unsupervised manner. Most importantly, relations between scenes and objects can easily be obtained from external sources such as large scale text corpora from the web and, therefore, do not require tremendous manual labeling efforts. It will be shown that in cluttered scenes, where visual recognition is difficult, scene knowledge is an important cue for predicting objects.
... While the studies in the field of IE were limited when the reflections in the field of IE were first started, it is seen that today information extraction studies are aimed at Web-scale, high-accuracy and OPEN IE [1,3,8,28,38,41,43,44]. ...
... OIE systems are used to detect assets from heterogeneous data sources on the web, eliminate the uncertainties of assets, detect relationships in triple format; It aims to identify and extract the relationships between entities such as the subject, object and predicate of the sentence and to present this information in a structured form as a knowledge base or knowledge graph [40][41][42][43][44][45]. ...
... An OIE system extracts different triples (arg1, rel, arg2) from each sentence in a text, usually in RDF format, that represent key propositions or claims [44][45][46][47][48]. ...
Article
In the last 20 years, e-mail, instant messaging, documents, blogs, news, text communication in the transfer of information over the web, as a result of the presentation and transmission of information as a result of the Web the dramatic increase in the amount of data in digital environments has increased the importance of studies in the field of knowledge extraction from unstructured data. Since the 2000s, one of the primary goals of researchers in the field of artificial intelligence has been to extract knowledge from heterogeneous data sources on the World Wide Web, including real-life entities and semantic relationships between entities, and to display them in machine-readable format. Advances in natural language processing and information extraction have increased the importance of large-scale knowledge bases in complex applications, resulting in scalable information extraction from semi-structured and unstructured heterogeneous data sources on the Web, and the detection of entities and relationships; It enabled the automatic creation of prominent knowledge bases in this field such as DbPedia, YAGO, NELL, Freebase, Probase, Google Knowledge Vault, IBM Watsons, which contain millions of semantic relationships between hundreds of thousands of entities, and displaying the created information in machine-readable format. Within the scope of this article; Web-scale(end-to-end) knowledge extraction from heterogeneous data sources, methods, challenges and opportunities are provided.
... An alternative semi-supervised approach is called Open Information Extraction (OIE): a paradigm that aims to move away from both (1) manual coding of extraction rules for each specific relation, and (2) manual annotation of training examples for each specific relation [249]. To overcome the limitations of working with a predefined set of relations, OIE aims to find unlexicalised extractors [253]. A small set of extraction rules is then used to bootstrap the collection of training data, regardless of the relation or argument types [254]. ...
... Issues with early OIE systems, such as TextRunner and WOE [252,284], included the incoherence of both relation predicate selection and detecting the correct argument boundaries [253]. ReVerb and the R2A2 system introduce improvements on these issues [37,253]. ...
... Issues with early OIE systems, such as TextRunner and WOE [252,284], included the incoherence of both relation predicate selection and detecting the correct argument boundaries [253]. ReVerb and the R2A2 system introduce improvements on these issues [37,253]. However, the recall of the inflexible Rule-Based extractions used during bootstrapping remains limited [278]. ...
Thesis
Full-text available
Engineering inspired by biology – recently termed biom* – has led to various groundbreaking technological developments. Example areas of application include aerospace engineering and robotics. However, biom* is not always successful and only sporadically applied in industry. The reason is that a systematic approach to biom* remains at large, despite the existence of a plethora of methods and design tools. In recent years computational tools have been proposed as well, which can potentially support a systematic integration of relevant biological knowledge during biom*. However, these so-called Computer-Aided Biom* (CAB) tools have not been able to fill all the gaps in the biom* process. This thesis investigates why existing CAB tools fail, proposes a novel approach – based on Information Extraction – and develops a proof-of-concept for a CAB tool that does enable a systematic approach to biom*. Key contributions include: 1) a disquisition of existing tools guides the selection of a strategy for systematic CAB, 2) a dataset of 1,500 manually-annotated sentences, 3) a novel Information Extraction approach that combines the outputs from a supervised Relation Extraction system and an existing Open Information Extraction system. The implemented exploratory approach indicates that it is possible to extract a focused selection of relations from scientific texts with reasonable accuracy, without imposing limitations on the types of information extracted. Furthermore, the tool developed in this thesis is shown to i) speed up a trade-off analysis by domain-experts, and ii) also improve the access to biology information for non-experts.
... Typically it takes as input morpho-syntactically annotated text and produces a set of triples (E 1 , R, E 2 ), where E 1 and E 2 are entities and R is a relation in which E 1 and E 2 participate as a pair. In case of ontology induction or information extraction in open domain (as described, e.g., in [1], [2], [3], [4]) no restrictions are imposed on R. There are many types of relations that can be extracted this way, such as quality, part or behavior [5]. ...
... T P T P + F P (4) where T P is the number of relations scored as correct and F P is the number of relations scored as erroneous. ...
Preprint
Pattern-based methods of IS-A relation extraction rely heavily on so called Hearst patterns. These are ways of expressing instance enumerations of a class in natural language. While these lexico-syntactic patterns prove quite useful, they may not capture all taxonomical relations expressed in text. Therefore in this paper we describe a novel method of IS-A relation extraction from patterns, which uses morpho-syntactical annotations along with grammatical case of noun phrases that constitute entities participating in IS-A relation. We also describe a method for increasing the number of extracted relations that we call pseudo-subclass boosting which has potential application in any pattern-based relation extraction method. Experiments were conducted on a corpus of about 0.5 billion web documents in Polish language.
... However, both graphs have similar average degrees (6.2112 and 5.7706), suggesting comparable overall connectivity per node. The number of self-loops is slightly higher in Graph 1 (70 vs. 33), though this does not significantly impact global structure. The clustering coefficients (0.1363 and 0.1434) indicate moderate levels of local connectivity, with Graph 2 exhibiting slightly more pronounced local clustering. ...
Preprint
Full-text available
We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected 'hub' concepts and the shifting influence of 'bridge' nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework's potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.
... DBpedia extractor is used to generate triples using annotated field information in Wikipedia. OpenIE [3] used POS and chunker data while ClauseIE [2] uses a parser to output a set of word triples. ...
Preprint
Full-text available
The web contains vast repositories of unstructured text. We investigate the opportunity for building a knowledge graph from these text sources. We generate a set of triples which can be used in knowledge gathering and integration. We define the architecture of a language compiler for processing subject-predicate-object triples using the OpenNLP parser. We implement a depth-first search traversal on the POS tagged syntactic tree appending predicate and object information. A parser enables higher precision and higher recall extractions of syntactic relationships across conjunction boundaries. We are able to extract 2-2.5 times the correct extractions of ReVerb. The extractions are used in a variety of semantic web applications and question answering. We verify extraction of 50,000 triples on the ClueWeb dataset.
... A substantial amount of research has been devoted to structured representations of knowledge. This led to the development of large-scale Knowledge Bases (KB) such as DBpedia [5], Freebase [7], YAGO [27,50], OpenIE [6,17,18], NELL [10], We-bChild [73,72], and ConceptNet [47]. These databases store common sense and factual knowledge in a machine readable fashion. ...
Preprint
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.
... Large-scale structured KBs are constructed either by manual annotation (e.g., DBpedia [22], Freebase [24] and Wikidata [28]), or by automatic extraction from unstructured/semistructured data (e.g., YAGO [27], [32], OpenIE [23], [33], [34], NELL [25], NEIL [26], WebChild [35], ConceptNet [36]). The KB that we use here is the combination of DBpedia, WebChild and ConceptNet, which contains structured information extracted from Wikipedia and unstructured online articles. ...
Preprint
Visual Question Answering (VQA) has attracted a lot of attention in both Computer Vision and Natural Language Processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA only contains questions which require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answerg triplets, through additional image-question-answer-supporting fact tuples. The supporting fact is represented as a structural triplet, such as <Cat,CapableOf,ClimbingTrees>. We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting facts.
... First generation Open IE systems can suffer from problems such as extracting incoherent and uninformative relations. Incoherent extractions are circumstances when the system extracts relation phrases that present a meaningless interpretation of the content [2,11]. For example, TextRunner and WOE would extract a triple such as (Peter, thought, his career as a scientist) from the sentence "Peter thought that John began his career as a scientist", which is clearly incoherent because "Peter" could not be taken as the first argument for relation "began" with the second argument "his career as a scientist". ...
Preprint
Open Information Extraction (Open IE) systems aim to obtain relation tuples with highly scalable extraction in portable across domain by identifying a variety of relation phrases and their arguments in arbitrary sentences. The first generation of Open IE learns linear chain models based on unlexicalized features such as Part-of-Speech (POS) or shallow tags to label the intermediate words between pair of potential arguments for identifying extractable relations. Open IE currently is developed in the second generation that is able to extract instances of the most frequently observed relation types such as Verb, Noun and Prep, Verb and Prep, and Infinitive with deep linguistic analysis. They expose simple yet principled ways in which verbs express relationships in linguistics such as verb phrase-based extraction or clause-based extraction. They obtain a significantly higher performance over previous systems in the first generation. In this paper, we describe an overview of two Open IE generations including strengths, weaknesses and application areas.
... The TruthfulQA benchmark has been instrumental in highlighting the issues of truthfulness in AIgenerated content (Clark et al., 2018b). The ARC challenge further emphasizes the complexity of reasoning required from AI systems beyond simple fact retrieval (Etzioni et al., 2011). Our work is aligned with these challenges, as we seek to understand and improve the truthfulness and reasoning capacity of LLMs when they are trained to replicate student behaviors. ...
... OpenIE systems such as Reverb (Etzioni et al., 2011) also extract verb-anchored dependency triples from large text corpus. In contrast to such approaches, we focus on how latent embedding of verbs in such triples can be combined with explicit background knowledge to improve coverage of existing KBs. ...
... Even the largest VQA datasets can not contain all realworld concepts. So VQA models should know how to acquire VOLUME 4, 2016 Examples of such EKBs are, large scale KBs constructed by human annotation, e.g., DBpedia [11], Freebase [16], Wikidata [127] and automatic extraction from unstructured/semistructured data, e.g., YAGO [48], [80], OpenIE [12], [34], [35], NELL [22], NEIL [25], WebChild [118], ConceptNet [76]. ...
Preprint
Full-text available
Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding the scene text, and (4) KB (Knowledge-Based) datasets that are designed to measure a model's ability to utilize outside knowledge. Concurrently, we explore six main paradigms of VQA models: fusion, where we discuss different methods of fusing information between visual and textual modalities; attention, the technique of using information from one modality to filter information from another; external knowledge base, where we discuss different models utilizing outside information; composition or reasoning, where we analyze techniques to answer advanced questions that require complex reasoning steps; explanation, which is the process of generating visual and textual descriptions to verify sound reasoning; and graph models, which encode and manipulate relationships through nodes in a graph. We also discuss some miscellaneous topics, such as scene text understanding, counting, and bias reduction.
... The benefit of this method is that it may be used with texts from any domain. Some OpenIE systems that can extract information from the free text include: KnowItAll [52], TEX-TRUNNER [53], REVERB [54], SRL-IE [55], OLLIE [56], and RELNOUN [57]. ...
Thesis
Full-text available
Information Extraction (IE) refers to the process of automatically extracting structured data from unstructured sources to enable the utilization of such data by other applications. Extracting relations from textual sources, which seeks to detect the semantic relation represented between entities ref-erenced in the texts, is a common sub-problem. The objective of the RE task is to develop automatic extractors that can identify and extract structured, relational information from unstructured sources like natural language text. Assigning a relationship label to a pair of entities may be considered a classification problem. As a result, supervised machine learning methodologies can be employed. It is essential to pre-process the data using methods from natural language processing to organize the textual contents into meaningful data structures before extracting relations from the unprocessed text. In addition , as relations are represented between entities, it is necessary to locate the entities using an entity extraction technique, which is another information extraction sub-problem. Relation extraction methods that use entity-annotated text are called pipeline approaches. Relations can be represented between two or more than two entities which are known as binary relations and N-ary relations, respectively. This thesis limits our research to binary relations using pipeline approaches.
... The adoption of SRL in machine understanding is not a novelty in the Open IE panorama; in fact, its potential has been already recognized and exploited in several Open IE system implementations -see, for example, Exemplar (de Sá Mesquita et al., 2013) and SRL-IE (Christensen et al., 2010). This constitutes an important legacy that guarantees some important benefits to the Cnosso method devised in the next section, and addresses some key points to improve the results of Open IE techniques -as pointed out in Etzioni et al. (2011). ...
... Second-generation Open IE systems aim to fine-tune prior paradigms in order to overcome their aforementioned limits: incoherent extractions and uninformative extractions (Etzioni et al., 2011;Vo & Bagheri, 2016). What differentiates the first and second-generation Open IE systems is that the latter focus deeply on a thorough linguistic analysis of sentences, obtaining significantly higher performance. ...
Article
Full-text available
Tenders are powerful means of investment of public funds and represent a strategic development resource. Despite the efforts made so far by governments at national and international levels to digitalise documents related to the Public Administration sector, most of the information is still available in an unstructured format only. With the aim of bridging this gap, we present OIE4PA, our latest study on extracting and classifying relations from tenders of the Public Administration. Our work focuses on the Italian language, where the availability of linguistic resources to perform Natural Language Processing tasks is considerably limited. Nevertheless, OIE4PA adopts a multilingual approach so it can be applied to several languages by providing appropriate training data. Rather than purely training a classifier on a portion of the extracted relations, the backbone idea of our learning strategy is to put a supervised method based on self-training to the proof and to assess whether or not it improves the performance of the classifier. For evaluation purposes, we built a dataset composed of 2,000 triples which have been manually annotated by two human experts. The in-vitro evaluation shows that OIE4PA achieves a MacroF11_1 equal to 0.89 and a 91%%\% accuracy. In addition, OIE4PA was used as the pillar of a prototype search engine, which has been evaluated through an in-vivo experiment with positive feedback from 32 final users, obtaining a SUS score equal to 83.98.
... For example, the number of relation types in Wikidata grows to over 10,000 in 6 years from 2017 to 2022 [202]. Extracting Open Relation Phrases Open information extraction (OpenIE) aims to extract semi-structured relation phrases [39,40]. Since the relations are treated as free-form text from the sentences, OpenIE can deal with relations that are not pre-defined. ...
Chapter
Full-text available
Knowledge is an important characteristic of human intelligence and reflects the complexity of human languages. To this end, many efforts have been devoted to organizing various human knowledge to improve the ability of machines in language understanding, such as world knowledge, linguistic knowledge, commonsense knowledge, and domain knowledge. Starting from this chapter, our view turns to representing rich human knowledge and using knowledge representations to improve NLP models. In this chapter, taking world knowledge as an example, we present a general framework of organizing and utilizing knowledge, including knowledge representation learning, knowledge-guided NLP, and knowledge acquisition. For linguistic knowledge, commonsense knowledge, and domain knowledge, we will introduce them in detail in subsequent chapters considering their unique knowledge properties.
... HowToKB represents task-related knowledge along with attributes for the parent task, the preceding and the succeeding subtask. This knowledge is extracted from WikiHow 5 articles by means of OpenIE [24]. HowToKB also contains information about tools and objects required to perform a task, if they are explicitly listed in a separate section of the original article. ...
Chapter
Full-text available
We introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using SemAF (Semantic Annotation Framework) as a scheme for annotating discourse structures. SeSB integrates a variety of tools and resources by using SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions. Based on VR, SeSB allows its users to change annotations through body movements instead of symbolic manipulations: from annotations in texts to corrections in editing steps to adjustments in generated scenes, all this is done by grabbing and moving objects. We evaluate SeSB in comparison with a state-of-the-art open source text-to-scene method (the only one which is publicly available) and find that our approach not only performs better, but also allows for modeling a greater variety of scenes.KeywordsText-to-3D Scene GenerationSemantic Annotation FrameworkVirtual Reality
... Because the building of knowledge graphs begins with a systematic description of concepts, entities, and their relationships in the objective world the correctness of information extraction of concepts, entities, and relationships is critical to the construction process.Information loss, redundancy, and overlap are often the most significant challenges to the construction of knowledge graphs. Information extraction, as the first step in knowledge graph construction, is critical to obtaining candidate knowledge units [1][2]. The completeness and accuracy of information extraction directly and explicitly affect the quality and efficiency of the subsequent knowledge graph construction steps and the quality of the final knowledge graph. ...
Article
Full-text available
Information extraction is an important part of natural language processing and is an important basis for building question and answer systems and knowledge graphs. A growing number of new technologies are being applied to information extraction with the development of deep learning techniques. As a first step, this paper introduces information extraction techniques and their main tasks, then describes the development history of information extraction techniques, and introduces the practice and application of different types of information extraction techniques in knowledge graph construction, including entity-extraction, relationship extraction and attribute extraction. Finally, some problems and research directions faced by information extraction techniques are discussed.
... As such, it is recommended to use semi-automatic techniques that involve a combination of manual input and automated processes to develop and update the ontology. These techniques, which are constantly being improved, generally involve the use of deep learning and text capture from the web [462,463,464,465]. vi. ...
Preprint
Full-text available
Human-algorithm interaction is a crucial issue for humanity in light of the impacts of the recent release of ChatGPT3 and 4, among others. These advanced chatbots provoked a worldwide debate in March/2023, when a manifesto signed by several stakeholders was published and widely discussed in the media and academia. This work assumes that human-algorithm interactions are influenced by a context of diverse interests and perspectives, which adds high complexity to the problem. Therefore, this work proposes a solution to enable the effective participation of stakeholders from different domains and society in a constructive dialogue, using digital platforms as a medium. Inspired by the successful governance of the Internet infrastructure ecosystem, the proposal involves the creation of an Autonomous Decentralized Organization (DAO) implemented in the blockchain environment of the Ethereum network. However, before implementing the DAO, it is necessary to build a knowledge base, that is, an ontology, which guides its development in a safe and adequate way. A preliminary version of this knowledge base was manually built using Protégé with over 4,000 axioms.
... The combination model is to decompose VQA tasks into independent models to solve different types of tasks, among which neural module networks (NMN) [13] and dynamic memory networks (DMN) [14] are typical representatives. Furthermore, scholars put forward the introduction of an external knowledge base [8,15], among which the common external knowledge base includes the DBpedia [16], the Freebase [17], the YAGO [18], and the OpenIE [19]. ...
Article
Full-text available
The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.
... Extending the multilingual coverage therefore implies to reproduce a pipeline with language-dependent resources and processing tools. But this approach is not suitable for all languages [34] even with the help of machine learning [15,30] pointed out the need to process poorly endowed languages or dialects without training data. This approach did not seem suitable for building a disease surveillance system for a language like Arabic. ...
... To take into account complex predicates, light verb constructions that include nouns are also recognized, for example, the phrase 'gained access to' 43 . Additionally, Hearst patterns 44 are identified from a pre-compiled list and using the patterns nouns are linked by using the dependency tree structure parsed by spaCy. ...
Article
Full-text available
Information on cyber-related crimes, incidents, and conflicts is abundantly available in numerous open online sources. However, processing large volumes and streams of data is a challenging task for the analysts and experts, and entails the need for newer methods and techniques. In this article we present and implement a novel knowledge graph and knowledge mining framework for extracting the relevant information from free-form text about incidents in the cyber domain. The computational framework includes a machine learning-based pipeline for generating graphs of organizations, countries, industries, products and attackers with a non-technical cyber-ontology. The extracted knowledge graph is utilized to estimate the incidence of cyberattacks within a given graph configuration. We use publicly available collections of real cyber-incident reports to test the efficacy of our methods. The knowledge extraction is found to be sufficiently accurate, and the graph-based threat estimation demonstrates a level of correlation with the actual records of attacks. In practical use, an analyst utilizing the presented framework can infer additional information from the current cyber-landscape in terms of the risk to various entities and its propagation between industries and countries.
... We then compare the set of propositions returned by Proposition Extractor against the annotated ground truth spans, looking for missed ground truth spans. These types of errors are called uninformative extractions [19], where Open IE systems can omit critical information from the sentences in the propositions they identify. The Proposition Extractor extracts 264 propositions from 115 tweets containing claims, and 159 propositions from 85 tweets without claims. ...
Preprint
Full-text available
To curb the problem of false information, social media platforms like Twitter started adding warning labels to content discussing debunked narratives, with the goal of providing more context to their audiences. Unfortunately, these labels are not applied uniformly and leave large amounts of false content unmoderated. This paper presents LAMBRETTA, a system that automatically identifies tweets that are candidates for soft moderation using Learning To Rank (LTR). We run LAMBRETTA on Twitter data to moderate false claims related to the 2020 US Election and find that it flags over 20 times more tweets than Twitter, with only 3.93% false positives and 18.81% false negatives, outperforming alternative state-of-the-art methods based on keyword extraction and semantic search. Overall, LAMBRETTA assists human moderators in identifying and flagging false information on social media.
Preprint
Full-text available
This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
Preprint
Full-text available
This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
Article
Статья посвящена решению актуальной научной проблемы повышения эффективности об-работки и анализа текстовой информации при решении задач поиска и приобретения знаний. Ак-туальность данной задачи связана с необходимостью создания эффективных средств обработкинакапливаемого огромного количества слабо структурированных данных, содержащих важные,иногда скрытые знания, необходимые для построения эффективных систем управления сложны-ми объектами различной природы. Предлагаемый автором алгоритм поиска и приобретения зна-ний при обработке и анализе текстовой информации, отличается применением низкоуровневыхдетерминированных правил, позволяющих провести качественное упрощение текста на основеисключения из текстовой информации слов, инвариантных к смыслу. Алгоритм опирается на до-менную проработку, позволяющую сформировать списки доменно-специфичных слов, что позволя-ет обеспечить высокое качество упрощения текста. В данной задаче исходными данными явля-ются потоки текстовой информации (описание профилей), извлеченных из онлайн платформ длярекрутинга, выходная информация представляется предложениями, сформированными в видетройки «субъект-глагол-объект», отражающих гранулы знаний, полученных в процессе обработ-ки текста. Использование данного порядка единиц, составляющих предложение, обусловлено темфактом, что данный порядок наиболее распространен в русском языке, хотя в самих текстахвозможны иные вариации порядка без потери общего смысла. Основная идея алгоритма заключа-ется в разбиении большого корпуса текста на предложения с последующей фильтрацией получен-ных предложений на основании введенных пользователем ключевых слов. В последствии предло-жения разделяются на компоненты и упрощаются в зависимости от вида поступившей компо-ненты (глагольная, именная). В качестве примера в данной работе использовалась сфера марке-тинга, а ключевыми словами выступили «социальные сети». Автором разработан алгоритм поис-ка и приобретения знаний на основе технологий обработки и анализа текстов на естественномязыке, а также была выполнена программная реализация предложенного алгоритма. В качествеметодов оценки эффективности использовался ряд метрик: индекс Флэша-Кинкейда; индекс Колман-Лиау; автоматический индекс удобочитаемости. Проведенные вычислительные эксперимен-ты подтвердили эффективность предложенного алгоритма по сравнению с аналогами, исполь-зующими нейронные сети для решение подобных задач.
Article
Тематическая сегментация – это задача разделения неструктурированного текста на тематически связные сегменты (такие, в которых речь идет об одном и том же). Граф знаний – графовая структура, вершинами которой являются различные объекты, а ребрами – отношения между ними. Как задача тематической сегментации, так и задача автоматического построения графа знаний не будут новыми, поэтому существует множество алгоритмов для их решения. Однако методы решения задачи тематической сегментации с помощью графов знаний до сих пор исследованы мало. Более того, пока еще нельзя сказать, что задача тематической сегментации решена в общем виде, т.е.существуют алгоритмы, способные при должной настройке решить задачу с требуемым качеством на конкретном наборе данных. Предлагается новый метод решения задачи тематической сегментации на основе графов знаний. Применение графов знаний при сегментации позволяет использовать больше информации о словах в тексте: помимо того чтобы основываться на co-occurrance и семантических расстояниях (как классические алгоритмы), методы на базе графов знаний могут применять расстояние между словами на графе, инкорпорируя тем самым фактологическую информацию из графа знаний в процесс принятия решений о биении текста на сегменты.
Article
Research on innovative content within academic articles plays a vital role in exploring the frontiers of scientific and technological innovation while facilitating the integration of scientific and technological evaluation into academic discourse. To efficiently gather the latest innovative concepts, it is essential to accurately recognize innovative sentences within academic articles. Although several supervised methods for classifying article sentences exist, such as citation function sentences, future work sentences, and formal citation sentences, most of these methods rely on manual annotations or rule-based matching to construct datasets, often neglecting an in-depth exploration of model performance enhancement. To address the limitations of existing research in this domain, this study introduces a semi-automatic annotation method for innovative sentences (IS) with the assistance of expert comments information and proposes a data augmentation method by SAO reconstruction to augment the training dataset. Within this paper, we compared and analyzed the effectiveness of multiple algorithms for recognizing IS within academic articles. This study utilized the full text of academic articles as the research subject and employed the semi-automatic method to annotate IS for creating the training dataset. Then, this study validated the effectiveness of the semi-automatic annotation method through manual inspection and compared it with rule-based annotation methods. Additionally, the impacts of different augmentation ratios on model performance were also explored. The empirical results reveal the following: (1) The semi-automatic annotation method proposed in this study achieves an accuracy rate of 0.87239, ensuring the validity of annotated data while reducing the manual annotation cost. (2) The SAO reconstruction for data augmentation method significantly improved the accuracy of machine learning and deep learning algorithms in the recognition of IS. (3) When the augmentation ratio in the training set was set to 50%, the trained GPT-2 model was superior to other algorithms, achieving an ACC of 0.97883 in the test set and an F1 score of 0.95505 in practical application.
Article
Full-text available
With proliferation of Big Data, organizational decision making has also become more complex. Business Intelligence (BI) is no longer restricted to querying about marketing and sales data only. It is more about linking data from disparate applications and also churning through large volumes of unstructured data like emails, call logs, social media, News, and so on in an attempt to derive insights that can also provide actionable intelligence and better inputs for future strategy making. Semantic technologies like knowledge graphs have proved to be useful tools that help in linking disparate data sources intelligently and also enable reasoning through complex networks that are created as a result of this linking. Over the last decade the process of creation, storage, and maintenance of knowledge graphs have sufficiently matured, and they are now making inroads into business decision making also. Very recently, these graphs are also seen as a potential way to reduce hallucinations of large language models, by including these during pre‐training as well as generation of output. There are a number of challenges also. These include building and maintaining the graphs, reasoning with missing links, and so on. While these remain as open research problems, we present in this article a survey of how knowledge graphs are currently used for deriving business intelligence with use‐cases from various domains. This article is categorized under: Algorithmic Development > Text Mining Application Areas > Business and Industry
Chapter
Full-text available
The law guarantees the regular functioning of the nation and society. In recent years, legal artificial intelligence (legal AI), which aims to apply artificial intelligence techniques to perform legal tasks, has received significant attention. Legal AI can provide a handy reference and convenient legal services for legal professionals and non-specialists, thus benefiting real-world legal practice. Different from general open-domain tasks, legal tasks have a high demand for understanding and applying expert knowledge. Therefore, enhancing models with various legal knowledge is a key issue of legal AI. In this chapter, we summarize the existing knowledge-intensive legal AI approaches regarding knowledge representation, acquisition, and application. Besides, future directions and ethical considerations are also discussed to promote the development of legal AI.
Chapter
Full-text available
Natural language processing (NLP) aims to build linguistic-specific programs for machines to understand and use human languages. Conventional NLP methods heavily rely on feature engineering to constitute semantic representations of text, requiring careful design and considerable expertise. Meanwhile, representation learning aims to automatically build informative representations of raw data for further application and achieves significant success in recent years. This chapter presents a brief introduction to representation learning, including its motivation, history, intellectual origins, and recent advances in both machine learning and NLP.
Chapter
Full-text available
Big pre-trained models (PTMs) have received increasing attention in recent years from academia and industry for their excellent performance on downstream tasks. However, huge computing power and sophisticated technical expertise are required to develop big models, discouraging many institutes and researchers. In order to facilitate the popularization of big models, we introduce OpenBMB, an open-source suite of big models, to break the barriers of computation and expertise of big model applications. In this chapter, we will introduce the core toolkits in OpenBMB, including BMTrain for efficient training, OpenPrompt and OpenDelta for efficient tuning, BMCook for efficient compression, and BMInf for efficient inference.
Chapter
Full-text available
Sentence and document are high-level linguistic units of natural languages. Representation learning of sentences and documents remains a core and challenging task because many important applications of natural language processing (NLP) lie in understanding sentences and documents. This chapter first introduces symbolic methods to sentence and document representation learning. Then we extensively introduce neural network-based methods for the far-reaching language modeling task, including feed-forward neural networks, convolutional neural networks, recurrent neural networks, and Transformers. Regarding the characteristics of a document consisting of multiple sentences, we particularly introduce memory-based and hierarchical approaches to document representation learning. Finally, we present representative applications of sentence and document representation, including text classification, sequence labeling, reading comprehension, question answering, information retrieval, and sequence-to-sequence generation.
Chapter
Full-text available
As a subject closely related to our life and understanding of the world, biomedicine keeps drawing much attention from researchers in recent years. To help improve the efficiency of people and accelerate the progress of this subject, AI techniques especially NLP methods are widely adopted in biomedical research. In this chapter, with biomedical knowledge as the core, we launch a discussion on knowledge representation and acquisition as well as biomedical knowledge-guided NLP tasks and explain them in detail with practical scenarios. We also discuss current research progress and several future directions.
Chapter
Full-text available
Linguistic and commonsense knowledge bases describe knowledge in formal and structural languages. Such knowledge can be easily leveraged in modern natural language processing systems. In this chapter, we introduce one typical kind of linguistic knowledge (sememe knowledge) and a sememe knowledge base named HowNet. In linguistics, sememes are defined as the minimum indivisible units of meaning. We first briefly introduce the basic concepts of sememe and HowNet. Next, we introduce how to model the sememe knowledge using neural networks. Taking a step further, we introduce sememe-guided knowledge applications, including incorporating sememe knowledge into compositionality modeling, language modeling, and recurrent neural networks. Finally, we discuss sememe knowledge acquisition for automatically constructing sememe knowledge bases and representative real-world applications of HowNet.
Chapter
Full-text available
Words are the building blocks of phrases, sentences, and documents. Word representation is thus critical for natural language processing (NLP). In this chapter, we introduce the approaches for word representation learning to show the paradigm shift from symbolic representation to distributed representation. We also describe the valuable efforts in making word representations more informative and interpretable. Finally, we present applications of word representation learning to NLP and interdisciplinary fields, including psychology, social sciences, history, and linguistics.
Chapter
The World Wide Web allows users and organizations to publish information and documents, which are instantly available for all other users of the Web. The data published to the Web continuously increases, providing the users with a vast amount of information on any topic imaginable. However, navigating the Web and identifying the relevant pieces of information in the abundance of data is not trivial. To cope with this problem, Web mining approaches are being used. Web mining includes the application of information retrieval, data mining, and machine learning approaches on Web data and the Web structure. This chapter provides a brief summary of Web mining approaches, including Web content mining, Web structure mining, Web usage mining, and Semantic Web mining.
Chapter
Understanding customers demands and needs is one of the keys to success for large enterprises. Customers come to a large enterprise with a set of requirements and finding a mapping between the needs they are expressing and the scale of available products and services within the enterprise is a complex task. Formalizing the two sides of interaction - the requests and the offerings - is a way to achieve the matching. Enterprise Knowledge Graphs (EKG) are an effective method to represent enterprise information in ways that can be more easily interpreted by both humans and machines. In this work, we propose a solution to identify customer requirements from free text to represent them in terms of an EKG. We demonstrate the validity of the approach by matching customer requirements to their appropriate business units, using a dataset of historical requirement-offering records in IBM spanning over 10 years.
Article
The main task of knowledge acquisition (also named knowledge extraction) from natural language texts is to extract knowledge from natural language texts into fragment of knowledge base of intelligent system. Through the induction of the related literature about knowledge acquisition at a home country and abroad, this paper analyses the strengths and weaknesses of the classical approach. After emphatically researching the rulebased knowledge extraction technology and the method of building ontology of linguistics, this article proposes a solution to the implementation of knowledge acquisition based on the OSTIS technology. The main feature of this solution is to construct a unified semantic model that is able to utilize ontologies of linguistics (mainly, syntactic and semantic aspect) and integrate various problem-solving models (e. g., rule-based models, neural network models) for solving knowledge extraction process from natural language texts.
Conference Paper
Distant Supervision is a relation extraction approach that allows automatic labeling of a dataset. However, this labeling introduces noise in the labels (e.g., when two entities in a sentence are automatically labeled with an invalid relation). Noise in labels makes difficult the relation extraction task. This noise is precisely one of the main challenges of this task. Until now, the methods that incorporate a previous noise reduction step do not evaluate the performance of this step. This paper evaluates the noise reduction using a new representation obtained with autoencoders. In addition, it was incoporated more information to the input of the autoencoder proposed in the state-of-the-art to improve the representation over which the noise is reduced. Also, three methods were proposed to select the instances considered as real. As a result, it was obtained the highest values of the area under the ROC curves using the improved input combined with state-of-the-art anomaly detection methods. Moreover, the three proposed selection methods significantly improve the existing method in the literature.
Chapter
Many organizations, including state administrations, define rules involving spatial knowledge. However, these rules are mainly specified and managed by hand using natural language. Advances in Geographic Information Systems (GIS) and computer science in knowledge management may improve such rule handling. Knowledge has to be built on recent, accurate, consistent and complete information—when possible—to allow decision-makers to take relevant actions. However, in many contexts, such as urban planning or emergency response support, information and rules are available from multiple stakeholders, which belong to different decision levels such as national, regional and local, thus highlighting several challenges. A first issue deals with modelling the rules by considering the hierarchy of decision levels, timeline of rules, etc. Next, acquiring knowledge from these stakeholders may lead to errors (e.g., partial rule extracted from a textual document, misinterpretation), inconsistencies or incompleteness (e.g., cases not covered by rules). Thus, a crucial step deals with the detection of relationships between rules (e.g., equivalence, causality) to facilitate the application of all rules. Lastly, integrating heterogeneous data, on which rules can be applied, is a well-studied problem which becomes more complex due to data provided at different levels of details. In this article, we study how GIS cope with these challenges to manage rule-based knowledge at different levels, and we illustrate them on the COVID-19 pandemic.KeywordsGeographic information systemsKnowledge managementRule extractionData integrationGeographic rule
Article
The sparseness and incompleteness of knowledge graphs (KGs) trigger considerable interest in enhancing the representation learning with external corpora. However, the difficulty of aligning entities and relations with external corpora leads to inferior performance improvement. Open knowledge graphs (OKGs) consist of entity-mentions and relation-mentions that are represented by noncanonicalized freeform phrases, which generally do not rely on the specification of ontology schema. The roughness of the nonontological construction method leads to a specific characteristic of OKGs: diversity, where multiple entity-mentions (or relation-mentions) have the same meaning but different expressions. The diversity of OKGs can provide potential textual and structural features for the representation learning of KGs. We speculate that leveraging OKGs to enhance the representation learning of KGs can be more effective than using pure text or pure structure corpora. In this paper, we propose a new OERL , O pen knowledge graph E nhanced R epresentation L earning of KGs. OERL automatically extracts textual and structural connections between KGs and OKGs, models and transfers refined profitable features to enhance the representation learning of KGs. The strong performance improvement and exhaustive experimental analysis prove the superiority of OERL over state-of-the-art baselines.
Article
The prediction of missing links of open knowledge graphs (OpenKGs) poses unique challenges compared with well-studied curated knowledge graphs (CuratedKGs). Unlike CuratedKGs whose entities are fully disambiguated against a fixed vocabulary, OpenKGs consist of entities represented by non-canonicalized free-form noun phrases and do not require an ontology specification, which drives the synonymity (multiple entities with different surface forms have the same meaning) and sparsity (a large portion of entities with few links). How to capture synonymous features in such sparse situations and how to evaluate the multiple answers pose challenges to existing models and evaluation protocols. In this paper, we propose VGAT, a variational autoencoder densified graph attention model to automatically mine synonymity features, and propose CR, a cluster ranking protocol to evaluate multiple answers in OpenKGs. For the model, VGAT investigates the following key ideas: (1) phrasal synonymity encoder attempts to capture phrasal features, which can automatically make entities with synonymous texts have closer representations; (2) neighbor synonymity encoder mines structural features with a graph attention network, which can recursively make entities with synonymous neighbors closer in representations. (3) densification attempts to densify the OpenKGs by generating similar embeddings and negative samples. For the protocol, CR is designed from the significance and compactness perspectives to comprehensively evaluate multiple answers. Extensive experiments and analysis show the effectiveness of the VGAT model and rationality of the CR protocol.
Article
Full-text available
Knowledge extraction is meant by acquiring relevant information from the unstructured document in natural language and representing them in a structured form. Enormous information in various domains, including agriculture, is available in the natural language from several resources. The knowledge needs to be represented in a structured format to understand and process by a machine for automating various applications. This paper reviews different computational approaches like rule-based and learning-based methods and explores the various techniques, features, tools, datasets, and evaluation metrics adopted for knowledge extraction from the most relevant literature.
Article
Full-text available
We propose a statistical measure for the degree of acceptability of light verb constructions, such as take a walk, based on their linguistic properties. Our measure shows good correlations with human ratings on un-seen test data. Moreover, we find that our measure correlates more strongly when the potential complements of the construction (such as walk, stroll, or run) are separated into semantically similar classes. Our analysis demonstrates the systematic nature of the semi-productivity of these constructions.
Article
Full-text available
In this paper we describe the CoNLL-2005 shared task on Semantic Role La-beling. We introduce the specification and goals of the task, describe the data sets and evaluation methods, and present a general overview of the 19 systems that have con-tributed to the task, providing a compara-tive description and results.
Conference Paper
Full-text available
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
Full-text available
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result.
Conference Paper
Full-text available
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Furthermore, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE. StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
Conference Paper
Full-text available
Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations — more than an order of magnitude greater than any previous approach — with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. 1
Conference Paper
Full-text available
Information-extraction (IE) systems seek to distill semantic relations from natural-language text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner's precision and recall. The key to WOE's performance is a novel form of self-supervised learning for open extractors -- using heuristic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE's extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.
Conference Paper
Full-text available
Information extraction (IE) holds the promise of generating a large-scale knowledge base from the Web’s natural language text. Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors. Recently, researchers have developed multiinstance learning algorithms to combat the noisy training data that can come from heuristic labeling, but their models assume relations are disjoint — for example they cannot extract the pair Founded(Jobs, Apple) and CEO-of(Jobs, Apple). This paper presents a novel approach for multi-instance learning with overlapping relations that combines a sentence-level extraction model with a simple, corpus-level component for aggregating the individual facts. We apply our model to learn extractors for NY Times text using weak supervision from Freebase. Experiments show that the approach runs quickly and yields surprising gains in accuracy, at both the aggregate and sentence level. 1
Article
Full-text available
Nominalization is a highly productive phenomena in most languages. The process of nominalization ejects a verb from its syntactic role into a nominal position. The original verb is often replaced by a semantically emptied support verb (e.g., make a proposal). The choice of a support verb for a given nominalization is unpredictable, causing a problem for language learners as well as for natural language processing systems. We present here a method of discovering support verbs from an untagged corpus via low-level syntactic processing and comparison of arguments attached to verbal forms and potential nominalized forms. The result of the process is a list of potential support verbs for the nominalized form of a given predicate. 1 Introduction Nominalization, the transformation of a verbal phrase into a nominal form, is possible in most languages (Comrie and Thompson, 1990). Nominalizations are used for a variety of stylistic reasons: to avoid repetitions of a verb, to avoid awkward intr...
Article
Full-text available
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Article
A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.
Conference Paper
We propose two new online methods for estimating the size of a backtracking search tree. The first method is based on a weighted sample of the branches visited by chronological backtracking. The second is a recursive method based on assuming that the ...
Conference Paper
Determining whether a textual phrase denotes a functional relation (i.e., a relation that maps each domain element to a unique range el- ement) is useful for numerous NLP tasks such as synonym resolution and contradic- tion detection. Previous work on this prob- lem has relied on either counting methods or lexico-syntactic patterns. However, determin- ing whether a relation is functional, by ana- lyzing mentions of the relation in a corpus, is challenging due to ambiguity, synonymy, anaphora, and other linguistic phenomena. We present the LEIBNIZ system that over- comes these challenges by exploiting the syn- ergy between the Web corpus and freely- available knowledge resources such as Free- base. It first computes multiple typed function- ality scores, representing functionality of the relation phrase when its arguments are con- strained to specific types. It then aggregates these scores to predict the global functionality for the phrase. LEIBNIZ outperforms previ- ous work, increasing area under the precision- recall curve from 0.61 to 0.88. We utilize LEIBNIZ to generate the first public reposi- tory of automatically-identified functional re- lations.
Conference Paper
Even the entire Web corpus does not explicitly answer all questions, yet inference can uncover many implicit answers. But where do inference rules come from? This paper investigates the problem of learning inference rules from Web text in an unsupervised, domain-independent manner. The SHERLOCK system, described herein, is a first-order learner that acquires over 30,000 Horn clauses from Web text. SHERLOCK embodies several innovations, including a novel rule scoring function based on Statistical Relevance (Salmon et al., 1971) which is effective on ambiguous, noisy and incomplete Web extractions. Our experiments show that inference over the learned rules discovers three times as many facts (at precision 0.8) as the TEXTRUNNER system which merely extracts facts explicitly stated in Web text. 1
Conference Paper
Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woepos. More than 30% of ReVerb's extractions are at precision 0.8 or higher---compared to virtually none for earlier systems. The paper concludes with a detailed analysis of ReVerb's errors, suggesting directions for future work.
Conference Paper
People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types.
Conference Paper
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.
Conference Paper
The computation of selectional preferences, the admissible argument values for a relation, is a well-known NLP task with broad applicability. We present LDA-SP, which utilizes LinkLDA (Erosheva et al., 2004) to model selectional preferences. By simultaneously inferring latent topics and topic distributions over relations, LDA-SP combines the benefits of previous approaches: like traditional classbased approaches, it produces humaninterpretable classes describing each relation’s preferences, but it is competitive with non-class-based methods in predictive power. We compare LDA-SP to several state-ofthe-art methods achieving an 85 % increase in recall at 0.9 precision over mutual information (Erk, 2007). We also evaluate LDA-SP’s effectiveness at filtering improper applications of inference rules, where we show substantial improvement over Pantel et al.’s system (Pantel et al., 2007). 1
Conference Paper
Extensive knowledge bases of entailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.
Conference Paper
Traditional Information Extraction (IE) takes a relation name and hand-tagged examples of that relation as input. Open IE is a relation- independent extraction paradigm that is tai- lored to massive and heterogeneous corpora such as the Web. An Open IE system extracts a diverse set of relational tuples from text with- out any relation-specific input. How is Open IE possible? We analyze a sample of English sentences to demonstrate that numerous rela- tionships are expressed using a compact set of relation-independent lexico-syntactic pat- terns, which can be learned by an Open IE sys- tem. What are the tradeoffs between Open IE and traditional IE? We consider this question in the context of two tasks. First, when the number of relations is massive, and the rela- tions themselves are not pre-specified, we ar- gue that Open IE is necessary. We then present a new model for Open IE called O-CRF and show that it achieves increased precision and nearly double the recall than the model em- ployed by TEXTRUNNER, the previous state- of-the-art Open IE system. Second, when the number of target relations is small, and their names are known in advance, we show that O-CRF is able to match the precision of a tra- ditional extraction system, though at substan- tially lower recall. Finally, we show how to combine the two types of systems into a hy- brid that achieves higher precision than a tra- ditional extractor, with comparable recall.
Conference Paper
We are trying to extend the boundary of Information Extraction (IE) systems. Ex-isting IE systems require a lot of time and human effort to tune for a new scenario. Preemptive Information Extraction is an attempt to automatically create all feasible IE systems in advance without human in-tervention. We propose a technique called Unrestricted Relation Discovery that dis-covers all possible relations from texts and presents them as tables. We present a pre-liminary system that obtains reasonably good results.
Conference Paper
Traditionally, Information Extraction (IE) has fo- cused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new ex- traction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces T EXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and explo- ration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to per- form extraction for a handful of pre-specified re- lations, T EXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more rela- tions, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract as- sertions.
Article
Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE, operates on large text corpora without any manual tagging of relations, and indeed without any prespecified relations. Due to its open-domain and open-relation nature, Open IE is purely textual and is unable to relate the surface forms to an ontology, if known in advance. We explore the steps needed to adapt Open IE to a domain-specific ontology and demonstrate our approach of mapping domainindependent tuples to an ontology using domains from the DARPA Machine Reading Project. Our system achieves precision over 0.90 from as few as eight training examples for an NFL-scoring domain.
Article
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym reso- lution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fully- implemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of Resolver's probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% precision and 95% recall on a data set from the TREC corpus.
Conference Paper
A knowledge acquisition tool to extract semantic patterns for a memory-based information retrieval system is presented. The major goal of this tool is to facilitate the construction of a large knowledge base of semantic patterns. The system acquires semantic patterns from texts with a small amount of user interaction. It acquires new phrasal patterns from the input text, maps each element of the pattern to a meaning frame, generalizes the acquired pattern, and merges it into the current knowledge base. Interaction with the user is introduced at some decision points, where the ambiguity cannot be resolved automatically without other pieces of predefined knowledge. The acquisition process is described in detail, and a preliminary experimental result is discussed
Learning Arguments for Open Information Extraction
  • Christensen
[Christensen et al., 2011a] Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. Learning Arguments for Open Information Extraction. Submitted, 2011.
Distant supervision for relation extraction without labeled data Automatically constructing extraction patterns from untagged text
  • Mintz
[Mintz et al., 2009] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In ACL-IJCNLP'09, 2009. [Riloff, 1996] E. Riloff. Automatically constructing extraction patterns from untagged text. In AAAI'96, 1996.
Xavier Carreras and Lluis Marquez
  • Marquez Carreras
[Carreras and Marquez, 2005] Xavier Carreras and Lluis Marquez. Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling, 2005.
Stretched Verb Constructions in English Routledge Studies in Germanic Linguistics. Routledge (Taylor and Francis)
  • J David
  • Allerton
[Allerton, 2002] David J. Allerton. Stretched Verb Constructions in English. Routledge Studies in Germanic Linguistics. Routledge (Taylor and Francis), New York, 2002.
Identifying Functional Relations in Web Text Andres McCallum. Mallet: A machine learning for language toolkit
  • Lin
[Lin et al., 2010] Thomas Lin, Mausam, and Oren Etzioni. Identifying Functional Relations in Web Text. In EMNLP'10, 2010. [McCallum, 2002] Andres McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
The tradeoffs between syntactic features and semantic roles for open information extraction
  • Janara Christensen
  • Stephen Mausam
  • Oren Soderland
  • Etzioni
The tradeoffs between syntactic features and semantic roles for open information extraction
  • Christensen
Christensen et al., 2011a] Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. Learning Arguments for Open Information Extraction. Submitted, 2011. [Christensen et al., 2011b] Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. The tradeoffs between syntactic features and semantic roles for open information extraction. In Knowledge Capture (KCAP), 2011. [Etzioni et al., 2006] Oren Etzioni, Michele Banko, and Michael J. Cafarella. Machine reading. In Proceedings of the 21st National Conference on Artificial Intelligence, 2006. [Fader et al., 2011] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying Relations for Open Information Extraction. Submitted, 2011.
Statistical measures of the semi-productivity of light verb constructions
  • Stevenson
[Stevenson et al., 2004] Suzanne Stevenson, Afsaneh Fazly, and Ryan North. Statistical measures of the semi-productivity of light verb constructions. In 2nd ACL Workshop on Multiword Expressions, pages 1-8, 2004.