Utopia documents: Linking scholarly literature with research data

School of Computer Science, Faculty of Life Sciences, University of Manchester, Manchester, UK.
Bioinformatics (Impact Factor: 4.98). 09/2010; 26(18):i568-74. DOI: 10.1093/bioinformatics/btq383
Source: PubMed


In recent years, the gulf between the mass of accumulating-research data and the massive literature describing and analyzing those data has widened. The need for intelligent tools to bridge this gap, to rescue the knowledge being systematically isolated in literature and data silos, is now widely acknowledged.
To this end, we have developed Utopia Documents, a novel PDF reader that semantically integrates visualization and data-analysis tools with published research articles. In a successful pilot with editors of the Biochemical Journal (BJ), the system has been used to transform static document features into objects that can be linked, annotated, visualized and analyzed interactively ( Utopia Documents is now used routinely by BJ editors to mark up article content prior to publication. Recent additions include integration of various text-mining and biodatabase plugins, demonstrating the system's ability to seamlessly integrate on-line content with PDF articles.

Download full-text


Available from: Teresa K Attwood, Oct 04, 2015
60 Reads
  • Source
    • "Although an increasing number of journals today require the data used to derive the results as prerequisite for publication (e.g., f1000), the steps on how these data have been assembled from primary data and how the data have been processed during the analysis are often hidden. Losing the link between primary data, derived data products, and knowledge results in a " gulf " between primary data repositories and knowledge repositories (Shotton 2009; Attwood et al. 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We are witnessing a growing gap separating primary research data from derived data products presented as knowledge in publications. Although journals today more often require the underlying data products used to derive the results as a prerequisite for a publication, the important link to the primary data is lost. However, documenting the postprocessing steps of data linking, the primary data with derived data products has the potential to increase the accuracy and the reproducibility of scientific findings significantly. Here, we introduce the rBEFdata R package as companion to the collaborative data management platform BEFdata. The R package provides programmatic access to features of the platform. It allows to search for data and integrates the search with external thesauri to improve the data discovery. It allows to download and import data and metadata into R for analysis. A batched download is available as well which works along a paper proposal mechanism implemented by BEFdata. This feature of BEFdata allows to group primary data and metadata and streamlines discussions and collaborations revolving around a certain research idea. The upload functionality of the R package in combination with the paper proposal mechanism of the portal allows to attach derived data products and scripts directly from R, thus addressing major aspects of documenting data postprocessing. We present the core features of the rBEFdata R package along an ecological analysis example and further discuss the potential of postprocessing documentation for data, linking primary data with derived data products and knowledge.
    Ecology and Evolution 07/2015; 5(14). DOI:10.1002/ece3.1547 · 2.32 Impact Factor
  • Source
    • "As shown by the survey conducted in [2], Enhanced Publications (EPs) can be generally conceived as digital publications "enriched with" or "linking to" related research results, such as research data, workflows, software, and possibly connections among them. Enhanced Publication Information Systems (EPISs) are systems devised for the management of EPs [3] [4] [5] [6] [7] [8] [19]. The majority of those systems are tailored to their specific communities and realized "from scratch" so that functionalities that are shared across disciplines and user communities are re-implemented every time. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Enhanced publications (EPs) can be generally conceived as digital publications "enriched with" or "linking to" related research results, such as research data, workflows, software, and possibly connections among them. Enhanced Publication Information Systems (EPISs) are information systems devised for the management of EPs in specific application domains. Currently, no framework supporting the realization of EPISs is known, and EPIs are typically realized "from scratch" by integrating general-purpose technologies (e.g. relational databases, file stores, triple stores) and Digital Library oriented software (e.g. repositories, cataloguing systems). Such an approach is doomed to entail non-negligible realization and maintenance costs that could be decreased by adopting a more systemic approach. The framework proposed in this work addresses this task by providing EPIS developers with EP management tools that facilitate their efforts by hiding the complexity of the underlying technologies.
    D-Lib Magazine 01/2015; 21(1/2). DOI:10.1045/january2015-bardi
  • Source
    • "Expecting scientists to alter their documentation habits to use computer-friendly ontologies rather than human-friendly natural language is unrealistic, especially given that the benefits do not start to accrue until a critical mass is achieved within the community. On the other hand, there has been a considerable amount of research toward designing software to perform fully automated parsing of otherwise intractable text and diagrams (Attwood et al., 2010), and add annotations in a machine-friendly format. Many of these efforts have been found to be valuable for certain scenarios where the high error rate is tolerable. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers.
    PeerJ 08/2014; 2(Suppl 1):e524. DOI:10.7717/peerj.524 · 2.11 Impact Factor
Show more