A new type of stochastic dependence revealed in gene expression data

Department of Probability and Statistics, Charles University.
Statistical Applications in Genetics and Molecular Biology (Impact Factor: 1.13). 02/2006; 5(1):Article7. DOI: 10.2202/1544-6115.1189
Source: PubMed


Modern methods of microarray data analysis are biased towards selecting those genes that display the most pronounced differential expression. The magnitude of differential expression does not necessarily indicate biological significance and other criteria are needed to supplement the information on differential expression. Three large sets of microarray data on childhood leukemia were analyzed by an original method introduced in this paper. A new type of stochastic dependence between expression levels in gene pairs was deciphered by our analysis. This modulation-like unidirectional dependence between expression signals arises when the expression of a "gene-modulator'' is stochastically proportional to that of a "gene-driver''. A total of more than 35% of all pairs formed from 12550 genes were conservatively estimated to belong to this type. There are genes that tend to form Type A relationships with the overwhelming majority of genes. However, this picture is not static: the composition of Type A gene pairs may undergo dramatic changes when comparing two phenotypes. The ability to identify genes that act as ;;modulators'' provides a potential strategy of prioritizing candidate genes.

Download full-text


Available from: Lev Klebanov, Aug 02, 2014
19 Reads
  • Source
    • "However, observing low overlap across biomarker lists identified from different high-throughput datasets is highly likely because the sample sizes of current studies are often insufficient to fully capture large biological variations [6], [26]. Because complex diseases are often characterised by many functionally correlated molecular changes [58], [59], we have proposed consistency scores for evaluating the reproducibility of disease biomarker discovery at the systems biology level [38], [60]. In the future, by applying these consistency scores, we plan to evaluate the reproducibility of DE peaks detected in different MS-based studies for a disease, an approach that is currently limited by the fact that few MS datasets for cancer are publicly available [61]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: There has been much interest in differentiating diseased and normal samples using biomarkers derived from mass spectrometry (MS) studies. However, biomarker identification for specific diseases has been hindered by irreproducibility. Specifically, a peak profile extracted from a dataset for biomarker identification depends on a data pre-processing algorithm. Until now, no widely accepted agreement has been reached. In this paper, we investigated the consistency of biomarker identification using differentially expressed (DE) peaks from peak profiles produced by three widely used average spectrum-dependent pre-processing algorithms based on SELDI-TOF MS data for prostate and breast cancers. Our results revealed two important factors that affect the consistency of DE peak identification using different algorithms. One factor is that some DE peaks selected from one peak profile were not detected as peaks in other profiles, and the second factor is that the statistical power of identifying DE peaks in large peak profiles with many peaks may be low due to the large scale of the tests and small number of samples. Furthermore, we demonstrated that the DE peak detection power in large profiles could be improved by the stratified false discovery rate (FDR) control approach and that the reproducibility of DE peak detection could thereby be increased. Comparing and evaluating pre-processing algorithms in terms of reproducibility can elucidate the relationship among different algorithms and also help in selecting a pre-processing algorithm. The DE peaks selected from small peak profiles with few peaks for a dataset tend to be reproducibly detected in large peak profiles, which suggests that a suitable pre-processing algorithm should be able to produce peaks sufficient for identifying useful and reproducible biomarkers.
    PLoS ONE 10/2011; 6(10):e26294. DOI:10.1371/journal.pone.0026294 · 3.23 Impact Factor
  • Source
    • "The underlying assumption is that in a disease the majority of genes are not differentially expressed and the numbers of up-regulated and down-regulated genes are roughly equal (Quackenbush, 2006). However, this assumption is rarely checked, so it is never certain that the data are properly normalized , especially when accumulated evidences suggest that the gene expression pattern could be globally altered in a complex disease (Qiu et al., 2005; Klebanov et al., 2006; Zhang et al., 2008, 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: When using microarray data for studying a complex disease such as cancer, it is a common practice to normalize data to force all arrays to have the same distribution of probe intensities regardless of the biological groups of samples. The assumption underlying such normalization is that in a disease the majority of genes are not differentially expressed genes (DE genes) and the numbers of up- and down-regulated genes are roughly equal. However, accumulated evidences suggest gene expressions could be widely altered in cancer, so we need to evaluate the sensitivities of biological discoveries to violation of the normalization assumption. Here, we analyzed 7 large Affymetrix datasets of pair-matched normal and cancer samples for cancers collected in the NCBI GEO database. We showed that in 6 of these 7 datasets, the medians of perfect match (PM) probe intensities increased in cancer state and the increases were significant in three datasets, suggesting the assumption that all arrays have the same median probe intensities regardless of the biological groups of samples might be misleading. Then, we evaluated the effects of three currently most widely used normalization algorithms (RMA, MAS5.0 and dChip) on the selection of DE genes by comparing them with LVS which relies less on the above-mentioned assumption. The results showed using RMA, MAS5.0 and dChip may produce lots of false results of down-regulated DE genes while missing many up-regulated DE genes. At least for cancer study, normalizing all arrays to have the same distribution of probe intensities regardless of the biological groups of samples might be misleading. Thus, most current normalizations based on unreliable assumptions may distort biological differences between normal and cancer samples. The LVS algorithm might perform relatively well due to that it relies less on the above-mentioned assumption. Also, our results indicate that genes may be widely up-regulated in most human cancer.
    Computational biology and chemistry 06/2011; 35(3):126-30. DOI:10.1016/j.compbiolchem.2011.04.006 · 1.12 Impact Factor
  • Source
    • "However, until more comprehensive investigations have been performed for the multivariate procedures, it seems prudent to examine the false positive rate in 1,000 permutations of their data before accepting the results at face value. Other investigators have explored more complex procedures involving stochastic dependence [33], while deeper explorations of random-set methods [34] may be valuable, but still essentially rely on independence assumptions. Several other methods are reviewed and compared in [35-37]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon. We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data. These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.
    BMC Genomics 10/2010; 11(1):574. DOI:10.1186/1471-2164-11-574 · 3.99 Impact Factor
Show more