Automated multidimensional phenotypic profiling using large public microarray repositories

Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.
Proceedings of the National Academy of Sciences (Impact Factor: 9.67). 07/2009; 106(30):12323-8. DOI: 10.1073/pnas.0900883106
Source: PubMed


Phenotypes are complex, and difficult to quantify in a high-throughput fashion. The lack of comprehensive phenotype data can prevent or distort genotype-phenotype mapping. Here, we describe "PhenoProfiler," a computational method that enables in silico phenotype profiling. Drawing on the principle that similar gene expression patterns are likely to be associated with similar phenotype patterns, PhenoProfiler supplements the missing quantitative phenotype information for a given microarray dataset based on other well-characterized microarray datasets. We applied our method to 587 human microarray datasets covering >14,000 samples, and confirmed that the predicted phenotype profiles are highly consistent with true phenotype descriptions. PhenoProfiler offers several unique capabilities: (i) automated, multidimensional phenotype profiling, facilitating the analysis and treatment design of complex diseases; (ii) the extrapolation of phenotype profiles beyond provided classes; and (iii) the detection of confounding phenotype factors that could otherwise bias biological inferences. Finally, because no direct comparisons are made between gene expression values from different datasets, the method can use the entire body of cross-platform microarray data. This work has produced a compendium of phenotype profiles for the National Center for Biotechnology Information GEO datasets, which can facilitate an unbiased understanding of the transcriptome-phenome mapping. The continued accumulation of microarray data will further increase the power of PhenoProfiler, by increasing the variety and the quality of phenotypes to be profiled.


Available from: Wenyuan Li
  • Source
    • "By establishing a threshold (e.g. Z.K = −2), standardized connectivity distributions can be used in a quantitative and unbiased fashion to identify and remove outlying samples, which may reflect hidden factors that can influence the results of genomic experiments [24] (this approach is particularly useful when the number of samples is large, making it difficult to distinguish outlying samples in a dendrogram). Analogously, one can also make use of other network concepts as described below. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of biological research. While many studies have sought to characterize relationships among genomic features, commensurate efforts to characterize relationships among biological samples have been less common. Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating downstream analytical tasks such as gene co-expression network analysis. Here we demonstrate the use of network methods for characterizing sample relationships in microarray data generated from human brain tissue. We describe an approach for identifying outlying samples that does not depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the consistency and integrity of sample relationships, which can be compared across disparate studies, technology platforms, and biological systems. Among these measures, we provide evidence that the correlation between the connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of homogeneity among biological samples. We also show that this measure, which we refer to as cor(K,C), can distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor(K,C) reveals the profound effect of Huntington's disease on samples from the caudate nucleus relative to other brain regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of genes. These results underscore the importance of systematically exploring sample relationships in large genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this purpose using freely available R software that has been designed to enable iterative and interactive exploration of sample networks.
    BMC Systems Biology 06/2012; 6(1):63. DOI:10.1186/1752-0509-6-63 · 2.44 Impact Factor
  • Source
    • "Insights into the genetic architecture of common diseases gained from several successful studies indicate that, in high-dimensional phenotypic data sets, no single summary measure can account for the majority of phenotypic variation; certain combinations of traits will prove to be more informative than individual measures, or even the complete set of measures, alone (Bloss et al., 2010; Houle, 2010). By the conclusion of these studies highly specific correlations between genetic and phenotypic variation may be obvious (Oti et al., 2009; Xu et al., 2009), but for particularly complex traits such as neurocognitive phenotypes it is rarely evident at the outset which particular combinations of phenotypic measures should be considered together (Houle, 2010). Phenomics, the systematic standardization of measures hypothesized to represent the complete phenotypic space for a given biological system, and their assessment in all members of a study population, has been proposed as a framework for organizing genome-level phenotype-genotype association studies of complex traits (Bilder et al., 2009a) (Figure 1). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Elucidating the molecular mechanisms underlying quantitative neurocognitive phenotypes will further our understanding of the brain's structural and functional architecture and advance the diagnosis and treatment of the psychiatric disorders that these traits underlie. Although many neurocognitive traits are highly heritable, little progress has been made in identifying genetic variants unequivocally associated with these phenotypes. A major obstacle to such progress is the difficulty in identifying heritable neurocognitive measures that are precisely defined and systematically assessed and represent unambiguous mental constructs, yet are also amenable to the high-throughput phenotyping necessary to obtain adequate power for genetic association studies. In this perspective we compare the current status of genetic investigations of neurocognitive phenotypes to that of other categories of biomedically relevant traits and suggest strategies for genetically dissecting traits that may underlie disorders of brain and behavior.
    Neuron 10/2010; 68(2):218-30. DOI:10.1016/j.neuron.2010.10.007 · 15.05 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Calculation of voltages in a mine electrical distribution system using a power-flow program may fail to yield meaningful results if the iterative power-flow procedure fails to converge. A modified power-flow technique that converges for most of the mine electrical power-flow input data, even though the same input data do not yield a convergent solution with a traditional Newton power-flow algorithm, is presented. When no operable solution to the power-flow problem exists, the results tend to indicate the location of the modeling error that is responsible for nonconvergence. An extensive case study using data for a mine electrical power system is presented to demonstrate the robustness and error identification properties of the power-flow algorithm
    Neuron 01/1989; DOI:10.1109/IAS.1989.96842 · 15.05 Impact Factor
Show more