Data integration and genomic medicine

Department of Medical Education and Biomedical Informatics, University of Washington, Seattle, USA. <>
Journal of Biomedical Informatics (Impact Factor: 2.19). 03/2007; 40(1):5-16. DOI: 10.1016/j.jbi.2006.02.007
Source: PubMed

ABSTRACT Genomic medicine aims to revolutionize health care by applying our growing understanding of the molecular basis of disease. Research in this arena is data intensive, which means data sets are large and highly heterogeneous. To create knowledge from data, researchers must integrate these large and diverse data sets. This presents daunting informatic challenges such as representation of data that is suitable for computational inference (knowledge representation), and linking heterogeneous data sets (data integration). Fortunately, many of these challenges can be classified as data integration problems, and technologies exist in the area of data integration that may be applied to these challenges. In this paper, we discuss the opportunities of genomic medicine as well as identify the informatics challenges in this domain. We also review concepts and methodologies in the field of data integration. These data integration concepts and methodologies are then aligned with informatics challenges in genomic medicine and presented as potential solutions. We conclude this paper with challenges still not addressed in genomic medicine and gaps that remain in data integration research to facilitate genomic medicine.

Download full-text


Available from: Fernando Martin-Sanchez, Sep 28, 2015
27 Reads
  • Source
    • "The schema allows for storing information in the context of a flexible and extensible experiment hierarchy, accommodating arbitrary configurations centered around Project, Subject, Visit, Study, Episode, and Acquisition objects, as well as limited information about data provenance. By representing many of the common types of information found in neuroimaging databases, XCEDE can facilitate data integration and act as a common data model, or mediated schema (Louie et al., 2007), that captures information from heterogeneous systems in a common XML syntax. In this way, available database resources can be described using a common language that simplifies data integration and sharing efforts, much in the same way NIfTI simplified imaging data exchange across analysis platforms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Data sharing efforts increasingly contribute to the acceleration of scientific discovery. Neuroimaging data is accumulating in distributed domain-specific databases and there is currently no integrated access mechanism nor an accepted format for the critically important meta-data that is necessary for making use of the combined, available neuroimaging data. In this manuscript, we present work from the Derived Data Working Group, an open-access group sponsored by the Biomedical Informatics Research Network (BIRN) and the International Neuroimaging Coordinating Facility (INCF) focused on practical tools for distributed access to neuroimaging data. The working group develops models and tools facilitating the structured interchange of neuroimaging meta-data and is making progress towards a unified set of tools for such data and meta-data exchange. We report on the key components required for integrated access to raw and derived neuroimaging data as well as associated meta-data and provenance across neuroimaging resources. The components include (1) a structured terminology that provides semantic context to data, (2) a formal data model for neuroimaging with robust tracking of data provenance, (3) a web service-based application programming interface (API) that provides a consistent mechanism to access and query the data model, and (4) a provenance library that can be used for the extraction of provenance data by image analysts and imaging software developers. We believe that the framework and set of tools outlined in this manuscript have great potential for solving many of the issues the neuroimaging community faces when sharing raw and derived neuroimaging data across the various existing database systems for the purpose of accelerating scientific discovery.
    NeuroImage 05/2013; 82. DOI:10.1016/j.neuroimage.2013.05.094 · 6.36 Impact Factor
  • Source
    • "One of the daunting tasks in bioinformatics is managing the vast amount of genomic data generated from large scale experiments (Barrett et al., 2007; Kann, 2009). The challenges enormous biological data presents are in different levels of variations and complexities (Louie et al., 2007); this constitutes some of the challenges of our generation. The problems range from difficulty associated with understanding the human genome, the detailed functions of gene encoding proteins, and sourcing for useful information for drug design among others. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics has already entered into its post genomic era, where research has advanced from data collection to data analysis using advanced computational and analytical tools. Due to current high demands on bioinformatics data, the various shortcomings in the computing infrastructure associated with the handling and processing of such biological data has constituted a great challenge. In this paper, effort was made at developing and describing a prototype of a new-generation computing framework known as the hybrid-grid-based computing framework for bioinformatics (HGCFB), with the aim of maintaining, sharing, discovering, and expanding bioinformatics knowledge in geographically distributed environments. This paper proposed the system architecture of a prototype hybrid-grid-based computing framework for bioinformatics (HGCFB), and described its corresponding functionalities. Attempts were also made at implementing some aspects of this framework with an event driven programming language. This framework will be very useful in facilitating the effective, efficient use and management of bioinformatics databases and resources.
    Scientific research and essays 03/2012; 7(Impact factor : 0.445):730-739. · 0.45 Impact Factor
  • Source
    • "Data integration is a constant challenge in translational science1, 2. In the past decade, several data integration regimes, including federated database strategies3, workflow approaches4, semantic web5–7, and warehousing methods 8–11, have been tested in the biomedical informatics community. The strengths and limitations of these approaches have been carefully reviewed 12–14, and a data warehousing approach is considered most suitable because of its desired data integrity and its standalone architecture that is less affected by inadequate infrastructure environments. To date, the approach has been widely adopted in the translational informatics community: in a recent clinical translational science award (CTSA) annual meeting, 23 out of 67 abstracts were related to warehousing strategies 15. "
    [Show abstract] [Hide abstract]
    ABSTRACT: While data warehousing approaches have been increasingly adopted in the biomedical informatics community for individualized data integration, effectively dealing with data integration, access, and application remains a challenging issue. In this report, focusing on ontology data, we describe how to use an established data warehouse system, named TRAM, to provide a data mart layer to address this issue. Our effort has resulted in a twofold achievement: 1) a model data mart tailored to facilitate oncology data integration and application (ONCOD), and 2) a flexible system architecture that has potential to be customized to support other data marts for various major medical fields.
    03/2012; 2012:105.
Show more