Content-based microarray search using differential expression profiles

Department of Bioengineering, Stanford University School of Medicine, CA, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2010; 11(1):603. DOI: 10.1186/1471-2105-11-603
Source: PubMed


With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.
We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.
Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.

Download full-text


Available from: Russ B Altman
  • Source
    • "A representative example is to compute differential expression profiles of case vs. control, use the correlation between activity profiles as the measure of relevance, and retrieve the experiments with the highest correlations (e.g. Engreitz et al., 2010). This requires auxiliary information about the experiments, namely case and control labels of experiment samples, and possibly additional a priori defined sets of important genes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case vs. control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. We then show empirically that inference for the full probabilistic model can be approximated with good performance using the computationally fast k-means clustering algorithm. The suggested metric for retrieval using clusterings is the normalized information distance. The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. Availability: The method can be implemented using only standard k-means and normalized information distance, available in many standard statistical software packages.
    Full-text · Article · May 2015 · Bioinformatics
  • Source
    • "Methods for cross-study integration of gene expression data have tended to focus on differential expression in well-matched control and experimental samples [12], because approaches based on correlation or absolute profiles [13] are dominated by laboratory and platform variability in cross-study analyses [14]. The ability to leverage public data to address platform-effects has been demonstrated most recently by the Gene Expression Barcode (GEB) and Gene Expression Commons, both of which define absolute gene expression scores based on a background distribution [15,16]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: New strategies to combat complex human disease require systems approaches to biology that integrate experiments from cell lines, primary tissues and model organisms. We have developed Pathprint, a functional approach that compares gene expression profiles in a set of pathways, networks and transcriptionally-regulated targets. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. We show that Pathprints combine mouse and human blood developmental lineage, and develop new prognostic indicators in Acute Myeloid Leukemia. The code and resources are available at
    Full-text · Article · Jul 2013 · Genome Medicine
  • Source
    • "Moreover, these approaches are fundamentally limited by the fact that the text-based description of a study and its results contains only a fraction of the information in the actual gene expression data. Data-driven or content-based approaches to information retrieval or meta-analysis (Caldas et al., 2009; Engreitz et al., 2011; Fujibuchi et al., 2007; Hu and Agarwal, 2009; Huang et al., 2010; Hunter et al., 2001; Kapushesky et al., 2009; Kupershmidt et al., 2010; Lamb et al., 2006; Segal et al., 2004) have a high potential for discovering novel and biologically meaningful relationships between the studied tissues, organisms and biological conditions, since similarities between studies are derived from shared expression patterns. [17:39 20/12/2011 Bioinformatics-btr634.tex] "

    Full-text · Dataset · Nov 2012
Show more