Content-based microarray search using differential expression profiles.

Department of Bioengineering, Stanford University School of Medicine, CA, USA.
BMC Bioinformatics (Impact Factor: 3.02). 01/2010; 11:603. DOI: 10.1186/1471-2105-11-603
Source: DBLP

ABSTRACT With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.
We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.
Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.

1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Meta-analysis of gene expression array databases has the potential to reveal information about gene function. The identification of gene-gene interactions may be inferred from gene expression information but such meta-analysis is often limited to a single microarray platform. To address this limitation, we developed a gene-centered approach to analyze differential expression across thousands of gene expression experiments and created the CO-Regulation Database (CORD) to determine which genes are correlated with a queried gene. Using the GEO and ArrayExpress database, we analyzed over 120,000 group by group experiments from gene microarrays to determine the correlating genes for over 30,000 different genes or hypothesized genes. CORD output data is presented for sample queries with focus on genes with well-known interaction networks including p16 (CDKN2A), vimentin (VIM), MyoD (MYOD1). CDKN2A, VIM, and MYOD1 all displayed gene correlations consistent with known interacting genes. We developed a facile, web-enabled program to determine gene-gene correlations across different gene expression microarray platforms. Using well-characterized genes, we illustrate how CORD's identification of co-expressed genes contributes to a better understanding a gene's potential function. The website is found at
    PLoS ONE 01/2014; 9(3):e90408. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, comparison of drug responses on gene expression has been a major approach to identifying the functional similarity of drugs. Previous studies have mostly focused on a single feature, the expression differences of individual genes. We provide a more robust and accurate method to compare the functional similarity of drugs by diversifying the features of comparison in gene expression and considering the sample dependent variations. For differentially expressed gene measurement, we modified the conventional t-test to normalize variations in diverse experimental conditions of individual samples. To extract significant differentially co-expressed gene modules, we searched maximal cliques among the co-expressed gene network. Finally, we calculated a combined similarity score by averaging the two scaled scores from the above two measurements. This method shows significant performance improvement in comparison to other approaches in the test with Connectivity Map data. In the test to find the drugs based on their own expression profiles with leave-one-out cross validation, the proposed method showed an area under the curve (AUC) score of 0.99, which is much higher than scores obtained with previous methods, ranging from 0.71 to 0.93. In the drug networks, we could find well clustered drugs having the same target proteins and novel relations among drugs implying the possibility of drug repurposing. Inclusion of the features of a co-expressed module provides more implications to infer drug action. We propose that this method be used to find collaborative cellular mechanisms associated with drug action and to simply identify drugs having similar responses.
    Healthcare informatics research. 01/2014; 20(1):52-60.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Microarrays have been useful in understanding various biological processes by allowing the simultaneous study of the expression of thousands of genes. However, the analysis of microarray data is a challenging task. One of the key problems in microarray analysis is the classification of unknown expression profiles. Specifically, the often large number of non-informative genes on the microarray adversely affects the performance and efficiency of classification algorithms. Furthermore, the skewed ratio of sample to variable poses a risk of overfitting. Thus, in this context, feature selection methods become crucial to select relevant genes and, hence, improve classification accuracy. In this study, we investigated feature selection methods based on gene expression profiles and protein interactions. We found that in our setup, the addition of protein interaction information did not contribute to any significant improvement of the classification results. Furthermore, we developed a novel feature selection method that relies exclusively on observed gene expression changes in microarray experiments, which we call "relative Signal-to-Noise ratio" (rSNR). More precisely, the rSNR ranks genes based on their specificity to an experimental condition, by comparing intrinsic variation, i.e. variation in gene expression within an experimental condition, with extrinsic variation, i.e. variation in gene expression across experimental conditions. Genes with low variation within an experimental condition of interest and high variation across experimental conditions are ranked higher, and help in improving classification accuracy. We compared different feature selection methods on two time-series microarray datasets and one static microarray dataset. We found that the rSNR performed generally better than the other methods.
    PLoS ONE 01/2013; 8(10):e76561. · 3.53 Impact Factor

Full-text (3 Sources)

Available from
Jun 1, 2014