Utopia documents: linking scholarly literature with research data

School of Computer Science, Faculty of Life Sciences, University of Manchester, Manchester, UK.
Bioinformatics (Impact Factor: 4.62). 09/2010; 26(18):i568-74. DOI: 10.1093/bioinformatics/btq383
Source: PubMed

ABSTRACT In recent years, the gulf between the mass of accumulating-research data and the massive literature describing and analyzing those data has widened. The need for intelligent tools to bridge this gap, to rescue the knowledge being systematically isolated in literature and data silos, is now widely acknowledged.
To this end, we have developed Utopia Documents, a novel PDF reader that semantically integrates visualization and data-analysis tools with published research articles. In a successful pilot with editors of the Biochemical Journal (BJ), the system has been used to transform static document features into objects that can be linked, annotated, visualized and analyzed interactively ( Utopia Documents is now used routinely by BJ editors to mark up article content prior to publication. Recent additions include integration of various text-mining and biodatabase plugins, demonstrating the system's ability to seamlessly integrate on-line content with PDF articles.

Download full-text


Available from: Teresa K Attwood, Jun 21, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Enhanced publications (EPs) can be generally conceived as digital publications "enriched with" or "linking to" related research results, such as research data, workflows, software, and possibly connections among them. Enhanced Publication Information Systems (EPISs) are information systems devised for the management of EPs in specific application domains. Currently, no framework supporting the realization of EPISs is known, and EPIs are typically realized "from scratch" by integrating general-purpose technologies (e.g. relational databases, file stores, triple stores) and Digital Library oriented software (e.g. repositories, cataloguing systems). Such an approach is doomed to entail non-negligible realization and maintenance costs that could be decreased by adopting a more systemic approach. The framework proposed in this work addresses this task by providing EPIS developers with EP management tools that facilitate their efforts by hiding the complexity of the underlying technologies.
    D-Lib Magazine 01/2015; 21(1/2). DOI:10.1045/january2015-bardi
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers.
    01/2014; 2:e524. DOI:10.7717/peerj.524
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This thesis aims at introducing theories, formalisms and applications for opening up Semantic Publishing to an effective interaction between scholarly documents and their related semantic and formal descriptions. Namely, I investigate and propose solutions for three of the main issues that semantic publishing promises to address: the need of tools for linking document text to a formal representation of its meaning, the lack of complete metadata schemas for describing documents according to the publishing vocabulary, and the absence of effective user interfaces for easily acting on semantic publishing models and theories. The first part of my work is about markup theory and technology. A better comprehension of a document derives from its structural organisation and from the formal semantics defined within it. In digital documents, the way we use to say something about a text is that of markup. Markup has been used for years for decorating documents at all levels of granularity, from the digital document as a whole to its sub-components. However, the most commonly used document formats in publishing (i.e., XML and PDF) were not developed to enable semantic enhancement, although it may be possible in principle to use them for this purpose. Trying to fill the gap between document markup and semantic markup, I have developed an OWL-based markup metalanguage called EARMARK. The basic idea is that EARMARK documents are collections of addressable text fragments, and such fragments are associated to OWL assertions that describe structural features as well as semantic properties of (parts of) that content. Of course, (digital) documents and their content represent the core aspect of Semantic Publishing, since it promotes their discovery and connection to document-related resources and contexts, such as other articles and raw scientific data. Un- fortunately, existing Semantic Web vocabularies are too abstract and incomplete to cover all the needs claimed by the actors involved in the publishing process (pub- lishers, editors, authors, etc.). Thus, there is an acute need for new standards (ontologies) that comprehensively cover all the different aspects of the publishing domain. Trying to address these issues, in the central part of my work I propose a suite of orthogonal and complementary OWL 2 DL ontology modules, called Semantic Publishing And Referencing (SPAR) ontologies, for describing all the aspects of bibliographic publications as comprehensive machine-readable RDF metadata. Finally, in the last part of my thesis I deal with the issue of enabling users to use and interact with semantic technologies and semantic data. This aspect is particu- larly crucial for Semantic Publishing, since its end-users are publishers, researchers, readers, librarians and the like rather than experts in semantic technologies. The use semantic models and data should be supported by proper interfaces that simplify the work of Semantic Publishing people. In this thesis I illustrate my personal contribution in this direction. I introduce four different tools that I developed to support users when understanding an ontology (LODE, KC-Viz), when formalising/presenting it (Graffoo), and when defining semantic data according to it (Gaffe).
    01/2012, Degree: Ph.D., Supervisor: Fabio Vitali