Conference PaperPDF Available

Applying Universal Schemas for Domain Specific Ontology Expansion

Authors:

Figures

Content may be subject to copyright.
A preview of the PDF is not available
... Some other in uential examples of such knowledge graph embeddings (KGEs), which is an active area of research, include (but are not limited to) [7], [22], [1], [5]. An important aspect of this research is automatic knowledge base construction and completion (AKBC), to which this work is related [21], [4]. A major di erence is that, because of an additional layer of semantic abstraction (types vs. entities), we can a ord to infer types without incrementally training the model such as in [6] or any other details of how the entity embeddings were derived. ...
... l.bz2 3 Downloaded from h p://downloads.dbpedia.org/2015-10/dbpedia 2015-10.nt4 Accessed at h p://data.dws.informatik.uni-mannheim.de/rdf2vec/models/DBpedia 5 e authors also released Wikidata embeddings, which did not do as well on node classi cation and were noisier (and much larger) than the DBpedia embeddings. ...
Article
Full-text available
We propose a supervised algorithm for generating type embeddings in the same semantic vector space as a given set of entity embeddings. The algorithm is agnostic to the derivation of the underlying entity embeddings. It does not require any manual feature engineering, generalizes well to hundreds of types and achieves near-linear scaling on Big Graphs containing many millions of triples and instances by virtue of an incremental execution. We demonstrate the utility of the embeddings on a type recommendation task, outperforming a non-parametric feature-agnostic baseline while achieving 15x speedup and near-constant memory usage on a full partition of DBpedia. Using state-of-the-art visualization, we illustrate the agreement of our extensionally derived DBpedia type embeddings with the manually curated domain ontology. Finally, we use the embeddings to probabilistically cluster about 4 million DBpedia instances into 415 types in the DBpedia ontology.
... We encourage researchers to refine the model, adapt it to other subfields of genomics and other types of biological objects (metabolites, cell types, organs), and propose alternatives. One can envision, for instance, exploiting the power of computational natural language processing to generate field-specific models by automatic analysis of large bodies of literature (Friedman et al., 2001;Groth et al., 2016), perhaps using our modest manual analysis as a training set. More generally, we hope that interdisciplinary conversations about philosophy (Laplane et al., 2019), rhetoric and scientific concepts will accompany the emergence of new scientific fields in the future. ...
Article
Full-text available
The word function has many different meanings in molecular biology. Here we explore the use of this word (and derivatives like functional) in research papers about de novo gene birth. Based on an analysis of 20 abstracts we propose a simple lexicon that, we believe, will help scientists and philosophers discuss the meaning of function more clearly.
... Results of numerical experiments show good agreement between manual and tags generated via the abstractive summarization method presented. Results can be improved by including more refined ontologies, retraining wiki2vec on more complete versions of DBpedia (potentially augmented as in (Wang et al., 2015) (Groth et al., 2016)), and more sophisticated multi-word phrase handling. ...
Conference Paper
Full-text available
This paper describes an abstractive summarization method 1 for tabular data which employs a knowledge base semantic embedding to generate the summary. Assuming the dataset contains descriptive text in headers, columns and/or some augmenting metadata, the system employs the embedding to recommend a subject/type for each text segment. Recommendations are aggregated into a small collection of super types considered to be descriptive of the dataset by exploiting the hierarchy of types in a prespecified ontology. We present experimental results on open data taken from several sources-OpenML, CKAN and data.world-to illustrate the effectiveness of the approach.
... is can allow for semantic querying over the dataset collection to extract all available data pertinent to some specific task subject at scale. 1 Our code is available for download at h ps://github.com/NewKnowledge/duke Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
Article
Full-text available
This paper describes an abstractive summarization method for tabular data which employs a knowledge base semantic embedding to generate the summary. Assuming the dataset contains descriptive text in headers, columns and/or some augmenting metadata, the system employs the embedding to recommend a subject/type for each text segment. Recommendations are aggregated into a small collection of super types considered to be descriptive of the dataset by exploiting the hierarchy of types in a pre-specified ontology. Using February 2015 Wikipedia as the knowledge base, and a corresponding DBpedia ontology as types, we present experimental results on open data taken from several sources--OpenML, CKAN and data.world--to illustrate the effectiveness of the approach.
... While OIE has been applied to the scientific literature before (Groth et al., 2016), we have not found a systematic evaluation of OIE as applied to scientific publications. The most recent evaluations of OIE extraction tools (Gashteovski et al., 2017;Schneider et al., 2017) have instead looked at the performance of these tools on traditional NLP information sources (i.e. ...
Article
Full-text available
Open Information Extraction (OIE) is the task of the unsupervised creation of structured information from text. OIE is often used as a starting point for a number of downstream tasks including knowledge base construction, relation extraction, and question answering. While OIE methods are targeted at being domain independent, they have been evaluated primarily on newspaper, encyclopedic or general web text. In this article, we evaluate the performance of OIE on scientific texts originating from 10 different disciplines. To do so, we use two state-of-the-art OIE systems applying a crowd-sourcing approach. We find that OIE systems perform significantly worse on scientific text than encyclopedic text. We also provide an error analysis and suggest areas of work to reduce errors. Our corpus of sentences and judgments are made available.
... A fourth member combined those open relations with the medical ontology using a method known as Universal Schemas [9] . This produced a medical knowledge base which could be used to find new extensions to the ontology based on items found in the literature [10]. ...
Article
Full-text available
Annotation Query (AQ) is a program that provides the ability to query many different types of NLP annotations on a text, as well as the original content and structure of the text. The query results may provide new annotations, or they may select subsets of the content and annotations for deeper processing. Like GATE's Mimir, AQ is based on region algebras. Our AQ is implemented to run on a Spark cluster. In this paper we look at how AQ's runtimes are affected by the size of the collection, the number of nodes in the cluster, the type of node, and the characteristics of the queries. Cluster size, of course, makes a large difference in performance so long as skew can be avoided. We find that there is minimal difference in performance when persisting annotations serialized to local SSD drives as opposed to deserialized into local memory. We also find that if the number of nodes is kept constant, then AWS' storage-optimized instance performs the best. But if we factor in total cost, the compute-optimized nodes provides the best performance relative to cost.
Article
The rapid proliferation of text data has lead to an increase in the use of Information Extraction (IE) techniques to automatically extract key information in a fast and effective manner. Relation Extraction (RE), a sub-task of IE focuses on extracting semantic relations from free natural language text and is crucial for further applications including Question Answering, Information Retrieval, Knowledge Base construction, Text Summarization, etc. Literature shows that supervised learning approaches were widely used in RE. However, the performance of supervised methodologies depend on the availability of domain-specific annotated datasets which is not viable for many of the domains including legal, financial, insurance etc. In recent times, Open Information Extraction (OIE) techniques address this issue, by facilitating domain-independent extraction of relations from large text corpora with no demand for domain-specific tagged data and predefined relation classes. Even though OIE systems are fast and simple to implement, they are less effective in handling complex sentences, and often produce redundant extractions. This paper proposes an efficient RE system to extract domain-specific relations from natural language text, consisting of Knowledge-based and Semi-supervised learning systems, integrated with domain ontology. We evaluated the performance of proposed work on ‘judicial domain” as a use case and found that it overcomes the flaws and limitations of existing RE approaches, by achieving better results in terms of precision and recall. On further analysis, we found that the proposed system outperforms existing cutting-edge OIE systems on varying sentence length and complexity.
Conference Paper
Full-text available
In this paper, we propose a novel framework using the word2vec model, a deep learning method, integrated with a book ontology in order to enhance semantically searching books. The idea starts from constructing a book ontology for reasoning book information efficiently. A deep learning method, namely the word2vec model, is then utilized to represent vectors of words occurring on book descriptions. These vectors would help finding most relevant books given a query string. The integration of the word2vec model and the book ontology is able to achieve high performance in searching books. A database of Amazon books is taken into account examining the proposed method, compared with an advanced keyword matching method. The experimental results show that the proposed method can produce more accurate searching results.
Article
Full-text available
Systems that extract structured information from natural language passages have been highly successful in specialized domains. The time is opportune for developing analogous applications for molecular biology and genomics. We present a system, GENIES, that extracts and structures information about cellular pathways from the biological literature in accordance with a knowledge model that we developed earlier. We implemented GENIES by modifying an existing medical natural language processing system, MedLEE, and performed a preliminary evaluation study. Our results demonstrate the value of the underlying techniques for the purpose of acquiring valuable knowledge from biological journals. Contact: friedman.carol@dmi.columbia.edu
Article
What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system’s score is 71.3%, an improvement of 23.8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.
Conference Paper
In this paper we describe a new release of a Web scale entity graph that serves as the backbone of Microsoft Academic Service (MAS), a major production effort with a broadened scope to the namesake vertical search engine that has been publicly available since 2008 as a research prototype. At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine. As a result of the Bing integration, the new MAS graph sees significant increase in size, with fresh information streaming in automatically following their discoveries by the search engine. In addition, the rich entity relations included in the knowledge base provide additional signals to disambiguate and enrich the entities within and beyond the academic domain. The number of papers indexed by MAS, for instance, has grown from low tens of millions to 83 million while maintaining an above 95% accuracy based on test data sets derived from academic activities at Microsoft Research. Based on the data set, we demonstrate two scenarios in this work: a knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation.
Article
When building a knowledge base (KB) of entities and relations from multiple structured KBs and text, universal schema represents the union of all input schema, by jointly embedding all relation types from input KBs as well as textual patterns expressing relations. In previous work, textual patterns are parametrized as a single embedding, preventing generalization to unseen textual patterns. In this paper we employ an LSTM to compositionally capture the semantics of relational text. We dramatically demonstrate the flexibility of our approach by evaluating in a multilingual setting, in which the English training data entities overlap with the seed KB, but the Spanish text does not. Additional improvements are obtained by tying word embeddings across languages. In extensive experiments on the English and Spanish TAC KBP benchmark, our techniques provide substantial accuracy improvements. Furthermore we find that training with the additional non-overlapping Spanish also improves English relation extraction accuracy. Our approach is thus suited to broad-coverage automated knowledge base construction in low-resource domains and languages.
Article
Motivation: Advances in sequencing technology have led to an exponential growth of genomics data, yet it remains a formidable challenge to interpret such data for identifying disease genes and drug targets. There has been increasing interest in adopting a systems approach that incorporates prior knowledge such as gene networks and genotype-phenotype associations. The majority of such knowledge resides in text such as journal publications, which has been undergoing its own exponential growth. It has thus become a significant bottleneck to identify relevant knowledge for genomic interpretation as well as to keep up with new genomics findings. Results: In the Literome project, we have developed an automatic curation system to extract genomic knowledge from PubMed articles and made this knowledge available in the cloud with a Web site to facilitate browsing, searching and reasoning. Currently, Literome focuses on two types of knowledge most pertinent to genomic medicine: directed genic interactions such as pathways and genotype-phenotype associations. Users can search for interacting genes and the nature of the interactions, as well as diseases and drugs associated with a single nucleotide polymorphism or gene. Users can also search for indirect connections between two entities, e.g. a gene and a disease might be linked because an interacting gene is associated with a related disease. Availability and implementation: Literome is freely available at literome.azurewebsites.net. Download for non-commercial use is available via Web services.
Article
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Conference Paper
Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woepos. More than 30% of ReVerb's extractions are at precision 0.8 or higher---compared to virtually none for earlier systems. The paper concludes with a detailed analysis of ReVerb's errors, suggesting directions for future work.
Citeseerx: Intelligent information extraction and knowledge creation from web-based data
  • Alexander G Ororbia
  • Jian Wu
  • Lee C Giles
Alexander G. Ororbia II, Jian Wu, and Lee C. Giles. 2014. Citeseerx: Intelligent information extraction and knowledge creation from web-based data. In The 4th Workshop on AutomatedKnowledge Base Construction, May.