[Show abstract][Hide abstract] ABSTRACT: An increasing portion of biomedical research relies on the use of biobanks and databases. Sharing of such resources is essential for optimizing knowledge production. A major obstacle for sharing bioresources is the lack of recognition for the efforts involved in establishing, maintaining and sharing them, due to, in particular, the absence of adequate tools. Increasing demands on biobanks and databases to improve access should be complemented with efforts of end-users to recognize and acknowledge these resources. An appropriate set of tools must be developed and implemented to measure this impact.To address this issue we propose to measure the use in research of such bioresources as a value of their impact, leading to create an indicator: Bioresource Research Impact Factor (BRIF). Key elements to be assessed are: defining obstacles to sharing samples and data, choosing adequate identifier for bioresources, identifying and weighing parameters to be considered in the metrics, analyzing the role of journal guidelines and policies for resource citing and referencing, assessing policies for resource access, and sharing and their influence on bioresource use. This work allows us to propose a framework and foundations for the operational development of BRIF that still requires input from stakeholders within the biomedical community.
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central -- a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data. RESULTS: A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications. CONCLUSIONS: We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting difficult. Complex standards have proven too time-consuming to implement. RESULTS: The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDB's) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components. CONCLUSIONS: VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
[Show abstract][Hide abstract] ABSTRACT: Genetic and epidemiological research increasingly employs large collections of phenotypic and molecular observation data from high quality human and model organism samples. Standardization efforts have produced a few simple formats for exchange of these various data, but a lightweight and convenient data representation scheme for all data modalities does not exist, hindering successful data integration, such as assignment of mouse models to orphan diseases and phenotypic clustering for pathways. We report a unified system to integrate and compare observation data across experimental projects, disease databases, and clinical biobanks. The core object model (Observ-OM) comprises only four basic concepts to represent any kind of observation: Targets, Features, Protocols (and their Applications), and Values. An easy-to-use file format (Observ-TAB) employs Excel to represent individual and aggregate data in straightforward spreadsheets. The systems have been tested successfully on human biobank, genome-wide association studies, quantitative trait loci, model organism, and patient registry data using the MOLGENIS platform to quickly setup custom data portals. Our system will dramatically lower the barrier for future data sharing and facilitate integrated search across panels and species. All models, formats, documentation, and software are available for free and open source (LGPLv3) at http://www.observ-om.org.
Human Mutation 03/2012; 33(5):867-73. · 5.21 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Motivation for the IRISC workshop came from the observation that identity and digital identification are increasingly important factors in modern scientific research, especially with the now near-ubiquitous use of the Internet as a global medium for dissemination and debate of scientific knowledge and data, and as a platform for scientific collaborations and large-scale e-science activities. The 1 1/2 day IRISC2011 workshop sought to explore a series of interrelated topics under two main themes: i) unambiguously identifying authors/creators \& attributing their scholarly works, and ii) individual identification and access management in the context of identity federations. Specific aims of the workshop included: • Raising overall awareness of key technical and non-technical challenges, opportunities and developments. • Facilitating a dialogue, cross-pollination of ideas, collaboration and coordination between diverse – and largely unconnected – communities. • Identifying \& discussing existing/emerging technologies, best practices and requirements for researcher identification. This report provides background information on key identification-related concepts \& projects, describes workshop proceedings and summarizes key workshop findings.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Contributor identification is a core challenge in data publication. As in scholarly communication more generally, non-unique person names and the current lack of a global identification infrastructure for producers of scholarly content makes it difficult to establish the identity of authors and other contributors. This in turn makes it difficult to accurately attribute datasets published via online digital repositories to their creators – one of several key requirements for including these important outputs in the scholarly record.In the “GEN2PHEN” project:http://www.gen2phen.org we are developing a series of novel web-based systems and processes for online dissemination of genetic variation and other research data. The core aim is that of ensuring that data creators are recognized and rewarded for publishing data. This work builds on and integrates with recently launched international initiatives to extend and adapt the existing DOI infrastructure for identifying, locating and citing online datasets (DataCite), and also creates a global registry of unique identifiers for authors and other contributors (ORCID).
The technical approach we are exploring in this pilot project utilizes this emerging global data citation and contributor identification framework, in order to allow published datasets to be discovered, cited in a scholarly context and unambiguously attributed. We argue that, along with other measures, such an incentive-based approach is key to motivating the sharing of data and other types of digital research outputs in the life sciences.This document can also be viewed on slideshare.
[Show abstract][Hide abstract] ABSTRACT: This editorial talks about Bioresources (for example, biobanks, databases and bioinformatics tools) and why these need to be easily accessible to facilitate advancement of research. The authors comment on the proposition of a Bioresource Research Impact Factor (BRIF), which would promotethe sharing of bioresources by creating a link between their initiators or implementers and the impact of the scientific research using them.
[Show abstract][Hide abstract] ABSTRACT: Explosive growth in the generation of genotype-to-phenotype (G2P) data necessitates a concerted effort to tackle the logistical and informatics challenges this presents. The GEN2PHEN Project represents one such effort, with a broad strategy of uniting disparate G2P resources into a hybrid centralized-federated network. This is achieved through a holistic strategy focussed on three overlapping areas: data input standards and pipelines through which to submit and collect data (data in); federated, independent, extendable, yet interoperable database platforms on which to store and curate widely diverse datasets (data storage); and data formats and mechanisms with which to exchange, combine, and extract data (data exchange and output). To fully leverage this data network, we have constructed the "G2P Knowledge Centre" (http://www.gen2phen.org). This central platform provides holistic searching of the G2P data domain allied with facilities for data annotation and user feedback, access to extensive G2P and informatics resources, and tools for constructing online working communities centered on the G2P domain. Through the efforts of GEN2PHEN, and through combining data with broader community-derived knowledge, the Knowledge Centre opens up exciting possibilities for organizing, integrating, sharing, and interpreting new waves of G2P data in a collaborative fashion.
Human Mutation 02/2011; 32(5):543-50. · 5.21 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Thesis submitted for the degree of Doctor of Philosophy at the University of Leicester, November 2010. Awarded January 2011. Modern research into the genetic basis of human health and disease is increasingly dominated by high-throughput experimentation and routine generation of large volumes of complex genotype to phenotype (G2P) information. Efforts to effectively manage, integrate, analyse and interpret this wealth of data face substantial challenges. This thesis discusses informatics approaches to addressing some of these challenges, primarily in the context of disease genetics. The genome-wide association study (GWAS) is widely used in the field, but translation of findings into scientific knowledge is hampered by heterogeneous and incomplete reporting, restrictions on sharing of primary data, publication bias and other factors. The central focus of the work was design and implementation of a core informatics infrastructure for centralised gathering and presentation of GWAS results. The resulting open-access HGVbaseG2P genetic association database and web-based tools for search, retrieval and graphical genome viewing increase overall usefulness of published GWAS findings. HGVbaseG2P conceptual modelling activities were also merged into a collaborative standardisation effort with international partners. A key outcome of this joint work is a minimal model for phenotype data which, together with ontologies and other standards, lays the foundation for a federated network of semantically and syntactically interoperable, distributed G2P databases. Attempts to gather complete aggregate representations of primary GWAS data into HGVbaseG2P were largely unsuccessful, chiefly due to concerns over re-identification of study participants. This led to a separate line of inquiry which explored - via in-depth field analysis, workshop organisation and other community outreach activities – potential applications of federated identity technologies for unambiguously identifying researchers online. Results suggest two broad use cases for user-centric researcher identities - i) practical, streamlined data access management and ii) tracking digital contributions for the purpose of attribution - which are critical to facilitating and incentivising sharing of GWAS (and other) research data.
[Show abstract][Hide abstract] ABSTRACT: The genome-wide association study (GWAS) database - GWAS Central (http://www.gwascentral.org) - allows the sophisticated interrogation and comparison of summary-level GWAS data. Here we present the application of ontologies within GWAS Central for the description and standardisation of phenotypic observations and their use in inferring disease phenotypes. For orthologous genes, our cross-species phenotype comparison pipeline allows for comparison of phenotypes defined using alternative mammalian phenotype ontologies. Building on the existing rich semantic phenotype annotation layer, we are currently involved in an effort to publish a core subset of the data as RDF nanopublications.
[Show abstract][Hide abstract] ABSTRACT: The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources. Database URL: http://www.casimir.org.uk/casimir_ddf.
Database The Journal of Biological Databases and Curation 01/2010; 2010:baq014. · 4.20 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new *omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed.
The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS' generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This 'model-driven' method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software.
In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist's satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the 'ExtractModel' procedure.
The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
[Show abstract][Hide abstract] ABSTRACT: Torrents of genotype-phenotype data are being generated, all of which must be captured, processed, integrated, and exploited. To do this optimally requires the use of standard and interoperable "object models," providing a description of how to partition the total spectrum of information being dealt with into elemental "objects" (such as "alleles," "genotypes," "phenotype values," "methods") with precisely stated logical interrelationships (such as "A objects are made up from one or more B objects"). We herein propose the Phenotype and Genotype Experiment Object Model (PaGE-OM; www.pageom.org), which has been tested and implemented in conjunction with several major databases, and approved as a standard by the Object Management Group (OMG). PaGE-OM is open-source, ready for use by the wider community, and can be further developed as needs arise. It will help to improve information management, assist data integration, and simplify the task of informatics resource design and construction for genotype and phenotype data projects.
Human Mutation 07/2009; 30(6):968-77. · 5.21 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Biologists need to perform complex queries, often across a variety of databases. Typically, each data resource provides an advanced query interface, each of which must be learnt by the biologist before they can begin to query them. Frequently, more than one data source is required and for high-throughput analysis, cutting and pasting results between websites is certainly very time consuming. Therefore, many groups rely on local bioinformatics support to process queries by accessing the resource's programmatic interfaces if they exist. This is not an efficient solution in terms of cost and time. Instead, it would be better if the biologist only had to learn one generic interface. BioMart provides such a solution.
BioMart enables scientists to perform advanced querying of biological data sources through a single web interface. The power of the system comes from integrated querying of data sources regardless of their geographical locations. Once these queries have been defined, they may be automated with its "scripting at the click of a button" functionality. BioMart's capabilities are extended by integration with several widely used software packages such as BioConductor, DAS, Galaxy, Cytoscape, Taverna. In this paper, we describe all aspects of BioMart from a user's perspective and demonstrate how it can be used to solve real biological use cases such as SNP selection for candidate gene screening or annotation of microarray results.
BioMart is an easy to use, generic and scalable system and therefore, has become an integral part of large data resources including Ensembl, UniProt, HapMap, Wormbase, Gramene, Dictybase, PRIDE, MSD and Reactome. BioMart is freely accessible to use at http://www.biomart.org.
[Show abstract][Hide abstract] ABSTRACT: The flow of research data concerning the genetic basis of health and disease is rapidly increasing in speed and complexity. In response, many projects are seeking to ensure that there are appropriate informatics tools, systems and databases available to manage and exploit this flood of information. Previous solutions, such as central databases, journal-based publication and manually intensive data curation, are now being enhanced with new systems for federated databases, database publication, and more automated management of data flows and quality control. Along with emerging technologies that enhance connectivity and data retrieval, these advances should help to create a powerful knowledge environment for genotype-phenotype information.
[Show abstract][Hide abstract] ABSTRACT: The Human Genome Variation database of Genotype to Phenotype information (HGVbaseG2P) is a new central database for summary-level findings produced by human genetic association studies, both large and small. Such a database is needed so that researchers have an easy way to access all the available association study data relevant to their genes, genome regions or diseases of interest. Such a depository will allow true positive signals to be more readily distinguished from false positives (type I error) that fail to consistently replicate. In this paper we describe how HGVbaseG2P has been constructed, and how its data are gathered and organized. We present a range of user-friendly but powerful website tools for searching, browsing and visualizing G2P study findings. HGVbaseG2P is available at http://www.hgvbaseg2p.org.
Nucleic Acids Research 11/2008; 37(Database issue):D797-802. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.
[Show abstract][Hide abstract] ABSTRACT: With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.
[Show abstract][Hide abstract] ABSTRACT: With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2)