Automated multidimensional phenotypic profiling using large public microarray repositories

Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.
Proceedings of the National Academy of Sciences (Impact Factor: 9.67). 07/2009; 106(30):12323-8. DOI: 10.1073/pnas.0900883106
Source: PubMed


Phenotypes are complex, and difficult to quantify in a high-throughput fashion. The lack of comprehensive phenotype data can prevent or distort genotype-phenotype mapping. Here, we describe "PhenoProfiler," a computational method that enables in silico phenotype profiling. Drawing on the principle that similar gene expression patterns are likely to be associated with similar phenotype patterns, PhenoProfiler supplements the missing quantitative phenotype information for a given microarray dataset based on other well-characterized microarray datasets. We applied our method to 587 human microarray datasets covering >14,000 samples, and confirmed that the predicted phenotype profiles are highly consistent with true phenotype descriptions. PhenoProfiler offers several unique capabilities: (i) automated, multidimensional phenotype profiling, facilitating the analysis and treatment design of complex diseases; (ii) the extrapolation of phenotype profiles beyond provided classes; and (iii) the detection of confounding phenotype factors that could otherwise bias biological inferences. Finally, because no direct comparisons are made between gene expression values from different datasets, the method can use the entire body of cross-platform microarray data. This work has produced a compendium of phenotype profiles for the National Center for Biotechnology Information GEO datasets, which can facilitate an unbiased understanding of the transcriptome-phenome mapping. The continued accumulation of microarray data will further increase the power of PhenoProfiler, by increasing the variety and the quality of phenotypes to be profiled.

Download full-text


Available from: Wenyuan Li
  • Source
    • "NMF has been used in several biological applications (Brunet et al. 2004; Kim and Park 2007; Xu et al. 2009; Pu et al. 2011) because its nonnegativity constraint (see Methods) provides an intuitive and biologically interpretable decomposition of a multivariate data set and a natural way to cluster biological data (Brunet et al. 2004). This is unlike principal components analysis, where eigenvectors with negative sign loadings can be hard to interpret in the context of positively valued variables such as ChIP-seq read counts. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide binding assays can determine where individual transcription factors bind in the genome. However, these factors rarely bind chromatin alone, but instead frequently bind to cis-regulatory elements (CREs) together with other factors as protein complexes. Currently there are no integrative analytical approaches that can predict which complexes are formed on chromatin. Here, we describe a computational methodology to systematically capture protein complexes and infer their impact on gene expression. We applied our method to three human cell types, identified thousands of CREs, identified known or undescribed complexes recruited to these CREs, and determined the role of the complexes as activators or repressors. Importantly, we found that the predicted complexes have a higher number of physical interactions between their members than expected by chance. Our work provides a mechanism for developing hypotheses about gene regulation via binding partners, and deciphering the interplay between combinatorial binding and gene expression.
    Preview · Article · Apr 2013 · Genome Research
  • Source
    • "By establishing a threshold (e.g. Z.K = −2), standardized connectivity distributions can be used in a quantitative and unbiased fashion to identify and remove outlying samples, which may reflect hidden factors that can influence the results of genomic experiments [24] (this approach is particularly useful when the number of samples is large, making it difficult to distinguish outlying samples in a dendrogram). Analogously, one can also make use of other network concepts as described below. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of biological research. While many studies have sought to characterize relationships among genomic features, commensurate efforts to characterize relationships among biological samples have been less common. Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating downstream analytical tasks such as gene co-expression network analysis. Here we demonstrate the use of network methods for characterizing sample relationships in microarray data generated from human brain tissue. We describe an approach for identifying outlying samples that does not depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the consistency and integrity of sample relationships, which can be compared across disparate studies, technology platforms, and biological systems. Among these measures, we provide evidence that the correlation between the connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of homogeneity among biological samples. We also show that this measure, which we refer to as cor(K,C), can distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor(K,C) reveals the profound effect of Huntington's disease on samples from the caudate nucleus relative to other brain regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of genes. These results underscore the importance of systematically exploring sample relationships in large genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this purpose using freely available R software that has been designed to enable iterative and interactive exploration of sample networks.
    Full-text · Article · Jun 2012 · BMC Systems Biology
  • Source
    • "Insights into the genetic architecture of common diseases gained from several successful studies indicate that, in high-dimensional phenotypic data sets, no single summary measure can account for the majority of phenotypic variation; certain combinations of traits will prove to be more informative than individual measures, or even the complete set of measures, alone (Bloss et al., 2010; Houle, 2010). By the conclusion of these studies highly specific correlations between genetic and phenotypic variation may be obvious (Oti et al., 2009; Xu et al., 2009), but for particularly complex traits such as neurocognitive phenotypes it is rarely evident at the outset which particular combinations of phenotypic measures should be considered together (Houle, 2010). Phenomics, the systematic standardization of measures hypothesized to represent the complete phenotypic space for a given biological system, and their assessment in all members of a study population, has been proposed as a framework for organizing genome-level phenotype-genotype association studies of complex traits (Bilder et al., 2009a) (Figure 1). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Elucidating the molecular mechanisms underlying quantitative neurocognitive phenotypes will further our understanding of the brain's structural and functional architecture and advance the diagnosis and treatment of the psychiatric disorders that these traits underlie. Although many neurocognitive traits are highly heritable, little progress has been made in identifying genetic variants unequivocally associated with these phenotypes. A major obstacle to such progress is the difficulty in identifying heritable neurocognitive measures that are precisely defined and systematically assessed and represent unambiguous mental constructs, yet are also amenable to the high-throughput phenotyping necessary to obtain adequate power for genetic association studies. In this perspective we compare the current status of genetic investigations of neurocognitive phenotypes to that of other categories of biomedically relevant traits and suggest strategies for genetically dissecting traits that may underlie disorders of brain and behavior.
    Preview · Article · Oct 2010 · Neuron
Show more