Article

Content-based microarray search using differential expression profiles

Department of Bioengineering, Stanford University School of Medicine, CA, USA.
BMC Bioinformatics (Impact Factor: 2.67). 12/2010; 11(1):603. DOI: 10.1186/1471-2105-11-603
Source: PubMed

ABSTRACT With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.
We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.
Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.

Download full-text

Full-text

Available from: Russ B Altman, Aug 27, 2015
1 Follower
 · 
152 Views
  • Source
    • "Moreover, these approaches are fundamentally limited by the fact that the text-based description of a study and its results contains only a fraction of the information in the actual gene expression data. Data-driven or content-based approaches to information retrieval or meta-analysis (Caldas et al., 2009; Engreitz et al., 2011; Fujibuchi et al., 2007; Hu and Agarwal, 2009; Huang et al., 2010; Hunter et al., 2001; Kapushesky et al., 2009; Kupershmidt et al., 2010; Lamb et al., 2006; Segal et al., 2004) have a high potential for discovering novel and biologically meaningful relationships between the studied tissues, organisms and biological conditions, since similarities between studies are derived from shared expression patterns. [17:39 20/12/2011 Bioinformatics-btr634.tex] "
  • Source
    • "Moreover, these approaches are fundamentally limited by the fact that the text-based description of a study and its results contains only a fraction of the information in the actual gene expression data. Data-driven or content-based approaches to information retrieval or meta-analysis (Caldas et al., 2009; Engreitz et al., 2011; Fujibuchi et al., 2007; Hu and Agarwal, 2009; Huang et al., 2010; Hunter et al., 2001; Kapushesky et al., 2009; Kupershmidt et al., 2010; Lamb et al., 2006; Segal et al., 2004) have a high potential for discovering novel and biologically meaningful relationships between the studied tissues, organisms and biological conditions, since similarities between studies are derived from shared expression patterns. [17:39 20/12/2011 Bioinformatics-btr634.tex] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights. We propose an information retrieval method based on differential expression. Our method deals with arbitrary experimental designs and performs competitively with alternative approaches, while making the search results interpretable in terms of differential expression patterns. We show that our model yields meaningful connections between biological conditions from different studies. Finally, we validate a previously unknown connection between malignant pleural mesothelioma and SIM2s suggested by our method, via real-time polymerase chain reaction in an independent set of mesothelioma samples. Supplementary data and source code are available from http://www.ebi.ac.uk/fg/research/rex.
    Bioinformatics 11/2011; 28(2):246-53. DOI:10.1093/bioinformatics/btr634 · 4.62 Impact Factor
  • Source
    • "There is not a universally regarded best clustering algorithm and so the actual clustering function employed by R users depends on their preferences and data structure. Based on the survey replies, the authors' experience with their own data, and its practical use in the field ([8],[9],[10],[11]), the Partitioning Around Medoids (PAM) R clustering algorithm was chosen for inclusion in SPRINT. "
    [Show abstract] [Hide abstract]
    ABSTRACT: R is a free statistical programming language commonly used for the analysis of high-throughput microarray and other data. It is currently unable to easily utilise multi processor architectures without substantial changes to existing R scripts. Further, working with large volumes of data often leads to slow processing and even memory allocation faults. A recent survey highlighted clustering algorithms as both computation and data intensive bottlenecks in post-genomic data analyses. These algorithms aim to sort numeric vectors (such as gene expression profiles) into groups by minimising vector distances within groups and maximising them between groups. This paper describes the optimisation and parallelisation of a popular clustering algorithm, partitioning around medoids (PAM), for the Simple Parallel R INTerface (SPRINT). SPRINT allows R users to exploit high performance computing systems without expert knowledge of such systems. This paper reports on a serial optimisation of the original code and a subsequent parallel implementation. The parallel implementation enables the processing of data sets that exceed the available physical memory and can yield, depending on the data set, over 100-fold increase in performance.
    High Performance Computing and Simulation (HPCS), 2011 International Conference on; 08/2011
Show more