Conference Paper

Automatic Construction of a Semantic Knowledge Base from CEUR Workshop Proceedings

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present an automatic workflow that performs text segmentation and entity extraction from scientific literature to primarily address Task 2 of the Semantic Publishing Challenge 2015. The goal of Task 2 is to extract various information from full-text papers to represent the context in which a document is written, such as the affiliation of its authors and the corresponding funding bodies. Our proposed solution is composed of two subsystems: (i) A text mining pipeline, developed based on the GATE framework, which extracts structural and semantic entities, such as authors’ information and references, and produces semantic (typed) annotations; and (ii) a flexible exporting module, the LODeXporter, which translates the document annotations into RDF triples according to custom mapping rules. Additionally, we leverage existing Named Entity Recognition (NER) tools to extract named entities from text and ground them to their corresponding resources on the Linked Open Data cloud, thus, briefly covering Task 3 objectives, which involves linking of detected entities to resources in existing open datasets. The output of our system is an RDF graph stored in a scalable TDB-based storage with a public SPARQL endpoint for the task’s queries.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Sateli/Witte [15] proposed a rule-based approach. They composed two logical components in a pipeline: a syntactic processor to identify the basic layout units and to cluster them into logical units, and a semantic processor to identify entities in text by pattern search. ...
... Table 5 summarises the results of the performance evaluation. [6] 0.388 0.285 0.292 Nuzzolese et al. [12] 0.274 0.251 0.257 Sateli/Witte [15] 0.3 0.252 0.247 Kovriguina et al. [9] 0.289 0.3 0.265 Ronzano et al. [14] 0.316 0.401 0.332 ...
Article
Full-text available
The Semantic Publishing Challenge series aims at investigating novel approaches for improving scholarly publishing using Linked Data technology. In 2014 we had bootstrapped this effort with a focus on extracting information from non-semantic publications - computer science workshop proceedings volumes and their papers - to assess their quality. The objective of this second edition was to improve information extraction but also to interlink the 2014 dataset with related ones in the LOD Cloud, thus paving the way for sophisticated end-user services.
... Several challenges were faced while fulfilling these tasks. These tasks are being fulfilled through a proposed solution, which is composed of a text mining pipeline, LODeXporter and Named Entity Recognition (NER) for named entities extraction form text and linking them to resources on the LOD cloud (Sateli & Witte, 2015). ...
Article
Full-text available
Bibliographic classification is among the core activities of Library & Information Science that brings order and proper management to the holdings of a library. Compared to printed media, digital collections present numerous challenges regarding their preservation, curation, organization and resource discovery & access. Therefore, true native perspective is needed to be adopted for bibliographic classification in digital environments. In this research article, we have investigated and reported different approaches to bibliographic classification of digital collections. The article also contributes two evaluation frameworks that evaluate the existing classification schemes and systems. The article presents a bird-eye-view for researchers in reaching a generalized and holistic approach towards bibliographic classification research, where new research avenues have been identified.
... It has a loosely captured architecture and a modular workflow based on supervised and unsupervised machine-learning techniques, which simplifies the systems adaptation to new document layouts and styles. It employs an enhanced Docstrum algorithm for page segmentation to obtain the document's hierarchical structure, Support Vector Machines (SVM) to classify its zones, heuristics and regular & Witte (2015) and Sateli & Witte (2016), relying on LODeXporter (http://www.semanticsoftware.info/lodexporter), proposed an iterative rule-based pattern matching approach. ...
Article
Full-text available
While most challenges organized so far in the Semantic Web domain are focused on comparing tools with respect to different criteria such as their features and competencies, or exploiting semantically enriched data, the Semantic Web Evaluation Challenges series, co-located with the ESWC Semantic Web Conference, aims to compare them based on their output, namely the produced dataset. The Semantic Publishing Challenge is one of these challenges. Its goal is to involve participants in extracting data from heterogeneous sources on scholarly publications, and producing Linked Data that can be exploited by the community itself. This paper reviews lessons learned from both (i) the overall organization of the Semantic Publishing Challenge, regarding the definition of the tasks, building the input dataset and forming the evaluation, and (ii) the results produced by the participants, regarding the proposed approaches, the used tools, the preferred vocabularies and the results produced in the three editions of 2014, 2015 and 2016. We compared these lessons to other Semantic Web Evaluation Challenges. In this paper, we (i) distill best practices for organizing such challenges that could be applied to similar events, and (ii) report observations on Linked Data publishing derived from the submitted solutions. We conclude that higher quality may be achieved when Linked Data is produced as a result of a challenge, because the competition becomes an incentive, while solutions become better with respect to Linked Data publishing best practices when they are evaluated against the rules of the challenge.
... The GATE framework then tokenises and lemmatises the text, detects sentence boundaries and performs gazetteering on the text, while the DBpedia Spotlight service is used for entity tagging. They also relied on their solution for the 2015 edition of the challenge [13]. In 2016, the PDF extraction tool used was changed and a number of additional or new conditional heuristics were added. ...
Conference Paper
The Semantic Publishing Challenge aims to involve participants in extracting data from heterogeneous sources on scholarly publications, and produce Linked Data which can be exploited by the community itself. The 2014 edition was the first attempt to organize a challenge to enable the assessment of the quality of scientific output. The 2015 edition was more explicit regarding the potential techniques, i.e., information extraction and interlinking. The current 2016 edition focuses on the multiple dimensions of scientific quality and the great potential impact of producing Linked Data for this purpose. In this paper, we discuss the overall structure of the Semantic Publishing Challenge, as it is for the 2016 edition, as well as the submitted solutions and their evaluation.
... We received six submissions for Task 2: Sateli/Witte [17] proposed a rule-based approach. They composed two logical components in a pipeline: a syntactic processor to identify the basic layout units and to cluster them into logical units, and a semantic processor to identify entities in text by pattern search. ...
Conference Paper
Full-text available
The Semantic Publishing Challenge series aims at investigating novel approaches for improving scholarly publishing using Linked Data technology. In 2014 we had bootstrapped this effort with a focus on extracting information from non-semantic publications – computer science workshop proceedings volumes and their papers – to assess their quality. The objective of this second edition was to improve information extraction but also to interlink the 2014 dataset with related ones in the LOD Cloud, thus paving the way for sophisticated end-user services.
... As a prerequisite to the semantic analysis step, we pre-process a document's text to covert it to a well-defined sequence of linguistically-meaningful units: words, numbers, symbols and sentences, which are passed along to the subsequent processing stages (Sateli & Witte, 2015). As part of this process, the document's text is broken into tokens 5 and lemmatized. ...
Article
Full-text available
Motivation. Finding relevant scientific literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge ‘stored’ in these documents. The emerging domain of Semantic Publishing aims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud. Results. We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of type Claims and Contributions from full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, where Claim and Contribution sentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an average F-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientific literature. Availability. All software presented in this paper is available under open source licenses at http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. Development releases of individual components are additionally available on our GitHub page at https://github.com/SemanticSoftwareLab.
Conference Paper
In this demonstration paper, we describe an open source workflow for supporting experiments in semantic publishing research. Based on a flexible, component-based approach, natural language papers can be converted into a Linked Open Data (LOD) compliant knowledge base. We exemplary discuss how to plan and execute experiments based on an integrated suite of tools, thereby both significantly lowering the barrier of entry in this field, while also encouraging the exchange of tools when building novel contributions.
Conference Paper
The objective of the Semantic Publishing (SemPub) challenge series is to bootstrap a value chain for scientific data to enable services, such as assessing the quality of scientific output with respect to novel metrics. The key idea was to involve participants in extracting data from heterogeneous resources and producing datasets on scholarly publications, which can be exploited by the community itself. Differently from other challenges in the semantic publishing domain, whose focus is on exploiting semantically enriched data, SemPub focuses on producing Linked Open Datasets. The goal of this paper is to review both (i) the overall organization of the Challenge, and (ii) the results that the participants have produced in the first two challenges of 2014 and 2015 – in terms of data, ontological models and tools – in order to better shape future editions of the challenge, and to better serve the needs of the semantic publishing community.
Article
Full-text available
While most challenges organized so far in the Semantic Web domain are focused on comparing tools with respect to different criteria such as their features and competencies, or exploiting semantically enriched data, the Semantic Web Evaluation Challenges series, co-located with the ESWC Semantic Web Conference, aims to compare them based on their output, namely the produced dataset. The Semantic Publishing Challenge is one of these challenges. Its goal is to involve participants in extracting data from heterogeneous sources on scholarly publications, and producing Linked Data that can be exploited by the community itself. This paper reviews lessons learned from both (i) the overall organization of the Semantic Publishing Challenge, regarding the definition of the tasks, building the input dataset and forming the evaluation, and (ii) the results produced by the participants, regarding the proposed approaches, the used tools, the preferred vocabularies and the results produced in the three editions of 2014, 2015 and 2016. We compared these lessons to other Semantic Web Evaluation challenges. In this paper, we (i) distill best practices for organizing such challenges that could be applied to similar events, and (ii) report observations on Linked Data publishing derived from the submitted solutions. We conclude that higher quality may be achieved when Linked Data is produced as a result of a challenge, because the competition becomes an incentive, while solutions become better with respect to Linked Data publishing best practices when they are evaluated against the rules of the challenge.
Article
Full-text available
While most challenges organized so far in the Semantic Web domain are focused on comparing tools with respect to different criteria such as their features and competencies, or exploiting semantically enriched data, the Semantic Web Evaluation Challenges series, co-located with the ESWC Semantic Web Conference, aims to compare them based on their output, namely the produced dataset. The Semantic Publishing Challenge is one of these challenges. Its goal is to involve participants in extracting data from heterogeneous sources on scholarly publications, and producing Linked Data that can be exploited by the community itself. This paper reviews lessons learned from both (i) the overall organization of the Semantic Publishing Challenge, regarding the definition of the tasks, building the input dataset and forming the evaluation, and (ii) the results produced by the participants, regarding the proposed approaches, the used tools, the preferred vocabularies and the results produced in the three editions of 2014, 2015 and 2016. We compared these lessons to other Semantic Web Evaluation challenges. In this paper, we (i) distill best practices for organizing such challenges that could be applied to similar events, and (ii) report observations on Linked Data publishing derived from the submitted solutions. We conclude that higher quality may be achieved when Linked Data is produced as a result of a challenge, because the competition becomes an incentive, while solutions become better with respect to Linked Data publishing best practices when they are evaluated against the rules of the challenge.
Conference Paper
We present a workflow for the automatic transformation of scholarly literature to a Linked Open Data (LOD) compliant knowledge base to address Task 2 of the Semantic Publishing Challenge 2016. In this year’s task, we aim to extract various contextual information from full-text papers using a text mining pipeline that integrates LOD-based Named Entity Recognition (NER) and triplification of the detected entities. In our proposed approach, we leverage an existing NER tool to ground named entities, such as geographical locations, to their LOD resources. Combined with a rule-based approach, we demonstrate how we can extract both the structural (e.g., floats and sections) and semantic elements (e.g., authors and their respective affiliations) of the provided dataset’s documents. Finally, we integrate the LODeXporter, our flexible exporting module to represent the results as semantic triples in RDF format. As the result, we generate a scalable, TDB-based knowledge base that is interlinked with the LOD cloud, and a public SPARQL endpoint for the task’s queries. Our submission won the second place at the SemPub2016 challenge Task 2 with an average 0.63 F-score.
Conference Paper
Full-text available
Machine-understandable data constitutes the foundation for the Semantic Web. This paper presents a viable way for authoring and annotating Semantic Documents on the desktop. In our approach, the PDF file format is the container for document semantics, being able to store both the content and the related metadata in a single file. To achieve this, we provide a framework (SALT - Semantically Annotated $\mbox{\LaTeX}$), that extends the $\mbox{\LaTeX}$ writing environment and supports the creation of metadata for scientific publications. SALT allows the author to create metadata concurrently, i.e. while in the process of writing a document. We discuss some of the requirements which have to be met when developing such a support for creating semantic documents. In addition, we describe a usage scenario to show the feasability and benefit of our approach.
Conference Paper
Finding research literature pertaining to a task at hand is one of the essential tasks that scientists face on daily basis. Standard information retrieval techniques allow to quickly obtain a vast number of potentially relevant documents. Unfortunately, the search results then require significant effort for manual inspection, where we would rather select relevant publications based on more fine-grained, semantically rich queries involving a publication's contributions, methods, or application domains. We argue that a novel combination of three distinct methods can significantly advance this vision: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic generation of RDF triples for both NEs and REs using semantic web ontologies to interconnect them. Combined in a single workflow, these techniques allow us to automatically construct a knowledge base that facilitates numerous advanced use cases for managing scientific documents.
Article
The availability in machine-readable form of descriptions of the structure of documents, as well as of the document discourse (e.g. the scientific discourse within scholarly articles), is crucial for facilitating semantic publishing and the overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe both structural and rhetorical document components in RDF. In addition to describing the formal description of the ontology, this paper showcases its utility in practice in a variety of our own applications and other activities of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.
Article
Nowadays, research group members are often overwhelmed by the large amount of relevant literature for any given topic. The abundance of publications leads to bottlenecks in curating and organizing literature, and important knowledge can be easily missed. While a number of tools exist for managing publications, they generally only deal with bibliographical metadata and their support for further content analysis is limited to simple manual annotation or tagging. Here, we investigate how we can go beyond these approaches by combining semantic technologies, including natural language processing, within a user-friendly wiki systems to create an easy-to-use, collaborative space that facilitates the semantic analysis and management of literature in a research group. We present the Zeeva system as a first prototype that demonstrates how we can turn existing papers into a queryable knowledge base.
The Document Components Ontology (DoCO) The Semantic Web Journal (2015) (in press)
  • A Constantin
  • S Peroni
  • S Pettifer
  • S David
  • F Vitali
SALT-semantically annotated for scientific publications
  • T Groza
  • S Handschuh
  • K Möller
  • S Decker