Article

Ontology-based meta-analysis of global collections of high-throughput public data.

NextBio, Cupertino, California, United States of America.
PLoS ONE (impact factor: 4.09). 01/2010; 5(9). DOI:10.1371/journal.pone.0013066
Source: PubMed

ABSTRACT The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today.
We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets.
Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.

0 0
 · 
0 Bookmarks
 · 
74 Views
  • Source
    Article: A comparison of microarray databases.
    [show abstract] [hide abstract]
    ABSTRACT: Microarray technology has become one of the most important functional genomics technologies. A proliferation of microarray databases has resulted. It can be difficult for researchers exploring this technology to know which bioinformatics systems best meet their requirements. In order to obtain a better understanding of the available systems, a survey and comparative analysis of microarray databases was undertaken. The survey included databases that are currently available, as well as databases that should become available in early 2001. Databases fall into three categories: (i) those that can be installed locally, (ii) those available for public data submission and (iii) those available for public query. Developers of microarray gene-expression databases were asked questions regarding the scope and availability of their database, its system requirements, its future compliance with MGED (Microarray Gene Expression Database) standards, and its associated analytical tools. Participants included AMAD (Stanford/Berkeley/UCSF), ArrayExpress (EBI), ChipDB (MIT/Whitehead), GeneX (NCGR), GeNet (Silicon Genetics), GeneDirector (BioDiscovery), GEO (NCBI), GXD (Jackson Laboratory), mAdb (NCI), maxdSQL (University of Manchester), NOMAD (UCSF), RAD (University of Pennsylvania) and SMD (Stanford University). Other database developers were contacted but data was not available at the time of manuscript preparation. Each database fulfils a different role, reflecting the widely varying needs of microarray users.
    Briefings in Bioinformatics 06/2001; 2(2):143-58. · 5.20 Impact Factor
  • Source
    Article: Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer.
    [show abstract] [hide abstract]
    ABSTRACT: With the proliferation of related microarray studies by independent groups, a natural step in the analysis of these gene expression data is to combine the results across these studies. However, this raises a variety of issues in the analysis of such data. In this article, we discuss the statistical issues of combining data from multiple gene expression studies. This leads to more complications than those in standard meta-analyses, including different experimental platforms, duplicate spots and complex data structures. We illustrate these ideas using data from four prostate cancer profiling studies. In addition, we develop a simple approach for assessing differential expression using the LASSO method. A combination of the results and the pathway databases are then used to generate candidate biological pathways for cancer.
    Functional and Integrative Genomics 01/2004; 3(4):180-8. · 2.84 Impact Factor
  • Source
    Article: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes.
    [show abstract] [hide abstract]
    ABSTRACT: Due to the high cost and low reproducibility of many microarray experiments, it is not surprising to find a limited number of patient samples in each study, and very few common identified marker genes among different studies involving patients with the same disease. Therefore, it is of great interest and challenge to merge data sets from multiple studies to increase the sample size, which may in turn increase the power of statistical inferences. In this study, we combined two lung cancer studies using microarray GeneChip, employed two gene shaving methods and a two-step survival test to identify genes with expression patterns that can distinguish diseased from normal samples, and to indicate patient survival, respectively. In addition to common data transformation and normalization procedures, we applied a distribution transformation method to integrate the two data sets. Gene shaving (GS) methods based on Random Forests (RF) and Fisher's Linear Discrimination (FLD) were then applied separately to the joint data set for cancer gene selection. The two methods discovered 13 and 10 marker genes (5 in common), respectively, with expression patterns differentiating diseased from normal samples. Among these marker genes, 8 and 7 were found to be cancer-related in other published reports. Furthermore, based on these marker genes, the classifiers we built from one data set predicted the other data set with more than 98% accuracy. Using the univariate Cox proportional hazard regression model, the expression patterns of 36 genes were found to be significantly correlated with patient survival (p < 0.05). Twenty-six of these 36 genes were reported as survival-related genes from the literature, including 7 known tumor-suppressor genes and 9 oncogenes. Additional principal component regression analysis further reduced the gene list from 36 to 16. This study provided a valuable method of integrating microarray data sets with different origins, and new methods of selecting a minimum number of marker genes to aid in cancer diagnosis. After careful data integration, the classification method developed from one data set can be applied to the other with high prediction accuracy.
    BMC Bioinformatics 06/2004; 5:81. · 2.75 Impact Factor

Full-text (2 Sources)

View
10 Downloads
Available from
23 Jan 2013

Keywords

biological connections
 
data mining approach
 
dataset replication
 
disease biology
 
enables researchers
 
genetic events
 
genomic data
 
govern biological systems
 
heterogeneous datasets
 
heterogeneous genomic datasets
 
huge quantities
 
key challenges
 
large quantities
 
large-scale genomic data
 
multiple lines
 
next-generation sequencing technologies
 
novel data mining framework
 
public high-throughput data
 
rank-based enrichment statistics
 
work presents