Article

Special Issue: Combined Special Issues on eScience 2010 and Euro-Par 2011

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Scientific discovery is increasingly driven by the collection, analysis, and comprehension of digital data. Collaborations between domain scientists and computer scientists can accelerate both the investigation and applications processes. The Microsoft eScience Workshop is a recognized venue for showcasing such collaborations and serves as a forum for exchanging both domain and computational researches. This editorial provides an overview of the papers that resulted from selected research collaboration presented at the 2010 Microsoft eScience workshop. Copyright (C) 2012 John Wiley & Sons, Ltd.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Science, especially experimental science, has always depended on the careful capture of plans, actions, raw and processed data and conclusions. With scientific research now so inextricably dependent on computers, the use of an electronic laboratory notebook (ELN) is almost essential. The meticulous notebooks of Michael Faraday and other scientists of his era remain as role models for the recording that is necessary, but they cannot provide the essential support for discussion, sharing, collaboration and formal verification. A blog (a contraction of Web log) can form the basis for implementing an electronic notebook but does not suffice to meet all the needs of an ELN. This paper describes the LabTrove ELN, which is blog based but provides numerous additional features, such as version control, security policies and a flexible metadata scheme, and facilities for interchanging objects with other systems. The MyExperimentalScience project links LabTrove with myExperiment, a repository for workflows in a collaborative environment, thereby making LabTrove templates available for discovery and reuse. Collaboration, sharing and reuse are essential for scientific progress, which depends on individual scientists building on the results already produced by others. Open‐source ELNs such as LabTrove are ideal vehicles to support the growth of Open Notebook Science. Copyright © 2012 John Wiley & Sons, Ltd.
Article
This paper describes a k‐mer approach to analysing DNA data and quickly answering certain types of ad hoc biological questions. These k‐mers (short DNA strings) are stored in a conventional relational database and indexed to support efficient exact match operations. We show that k‐mers around 20–25 bases long have interesting and useful uniqueness properties that can be used to compute a ‘relatedness’ metric and also allow k‐mers to be used as ‘unique enough’ tags to identify organisms and genes. This relatedness metric is used in SQL queries that can directly answer questions such as how two related species differ, and what genes are unique to an organism. The k‐mer tags have proven useful in applications, largely metagenomic ones that can quickly process large volumes of sequencing data to say something about what organisms and genes might be present in an environmental sample. All of this work is based on simple and fast exact matches of k‐mer strings using a database, rather than conventional alignment based on inexact matches of much longer strings. These k‐mer tools provide ways of rapidly exploring large genome spaces and handling large volumes of sequence data, and complement rather than replace existing alignment and assembly tools. Copyright © 2012 John Wiley & Sons, Ltd.
Article
The increase in volume and complexity of biological data has led to increased requirements to reuse that data. Consistent and accurate metadata is essential for this task, creating new challenges in semantic data annotation and in the constriction of terminologies and ontologies used for annotation. The BioSharing community are developing standards and terminologies for annotation, which have been adopted across bioinformatics, but the real challenge is to make these standards accessible to laboratory scientists. Widespread adoption requires the provision of tools to assist scientists whilst reducing the complexities of working with semantics. This paper describes unobtrusive ‘stealthy’ methods for collecting standards compliant, semantically annotated data and for contributing to ontologies used for those annotations. Spreadsheets are ubiquitous in laboratory data management. Our spreadsheet‐based RightField tool enables scientists to structure information and select ontology terms for annotation within spreadsheets, producing high quality, consistent data without changing common working practices. Furthermore, our Populous spreadsheet tool proves effective for gathering domain knowledge in the form of Web Ontology Language (OWL) ontologies. Such a corpus of structured and semantically enriched knowledge can be extracted in Resource Description Framework (RDF), providing further means for searching across the content and contributing to Open Linked Data (http://linkeddata.org/). Copyright © 2012 John Wiley & Sons, Ltd.
Article
The Internet, Web 2.0 and Social Networking technologies are enabling citizens to actively participate in ‘citizen science’ projects by contributing data to scientific programmes via the Web. However, the limited training, knowledge and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequently, the scientific community often perceive citizen science data as not worthy of being used in serious scientific research—which in turn, leads to poor retention rates for volunteers. In this paper, we describe a technological framework that combines data quality improvements and trust metrics to enhance the reliability of citizen science data. We describe how online social trust models can provide a simple and effective mechanism for measuring the trustworthiness of community‐generated data. We also describe filtering services that remove unreliable or untrusted data and enable scientists to confidently reuse citizen science data. The resulting software services are evaluated in the context of the CoralWatch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health. Copyright © 2012 John Wiley & Sons, Ltd.
Article
SUMMARY We are well into the era of data intensive-digital scientific discovery, an era defined by Jim Gray as the Fourth Paradigm. From my own perspective of the life sciences, much has been accomplished, but there is much to do if we are to maximize our understanding of biological systems given the data we have today, let alone what is coming. In my 2010 Jim Gray eScience Award Lecture, I gave my own thoughts on what needs to be accomplished, and with an additional year of hindsight, I expand on that here. Copyright © 2012 John Wiley & Sons, Ltd.
The reaming of life Concurrency and Computation: Practice and Experience 2013
  • P Bourne
Answering biological questions by querying k-mer databases Concurrency and Computation: Practice and Experience 2013
  • P Greenfield