Article

A new type of stochastic dependence revealed in gene expression data.

Department of Probability and Statistics, Charles University.
Statistical Applications in Genetics and Molecular Biology (Impact Factor: 1.52). 02/2006; 5:Article7. DOI: 10.2202/1544-6115.1189
Source: PubMed

ABSTRACT Modern methods of microarray data analysis are biased towards selecting those genes that display the most pronounced differential expression. The magnitude of differential expression does not necessarily indicate biological significance and other criteria are needed to supplement the information on differential expression. Three large sets of microarray data on childhood leukemia were analyzed by an original method introduced in this paper. A new type of stochastic dependence between expression levels in gene pairs was deciphered by our analysis. This modulation-like unidirectional dependence between expression signals arises when the expression of a "gene-modulator'' is stochastically proportional to that of a "gene-driver''. A total of more than 35% of all pairs formed from 12550 genes were conservatively estimated to belong to this type. There are genes that tend to form Type A relationships with the overwhelming majority of genes. However, this picture is not static: the composition of Type A gene pairs may undergo dramatic changes when comparing two phenotypes. The ability to identify genes that act as ;;modulators'' provides a potential strategy of prioritizing candidate genes.

Download full-text

Full-text

Available from: Lev Klebanov, Aug 02, 2014
0 Followers
 · 
71 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: When using microarray data for studying a complex disease such as cancer, it is a common practice to normalize data to force all arrays to have the same distribution of probe intensities regardless of the biological groups of samples. The assumption underlying such normalization is that in a disease the majority of genes are not differentially expressed genes (DE genes) and the numbers of up- and down-regulated genes are roughly equal. However, accumulated evidences suggest gene expressions could be widely altered in cancer, so we need to evaluate the sensitivities of biological discoveries to violation of the normalization assumption. Here, we analyzed 7 large Affymetrix datasets of pair-matched normal and cancer samples for cancers collected in the NCBI GEO database. We showed that in 6 of these 7 datasets, the medians of perfect match (PM) probe intensities increased in cancer state and the increases were significant in three datasets, suggesting the assumption that all arrays have the same median probe intensities regardless of the biological groups of samples might be misleading. Then, we evaluated the effects of three currently most widely used normalization algorithms (RMA, MAS5.0 and dChip) on the selection of DE genes by comparing them with LVS which relies less on the above-mentioned assumption. The results showed using RMA, MAS5.0 and dChip may produce lots of false results of down-regulated DE genes while missing many up-regulated DE genes. At least for cancer study, normalizing all arrays to have the same distribution of probe intensities regardless of the biological groups of samples might be misleading. Thus, most current normalizations based on unreliable assumptions may distort biological differences between normal and cancer samples. The LVS algorithm might perform relatively well due to that it relies less on the above-mentioned assumption. Also, our results indicate that genes may be widely up-regulated in most human cancer.
    Computational biology and chemistry 06/2011; 35(3):126-30. DOI:10.1016/j.compbiolchem.2011.04.006 · 1.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: According to current consistency metrics such as percentage of overlapping genes (POG), lists of differentially expressed genes (DEGs) detected from different microarray studies for a complex disease are often highly inconsistent. This irreproducibility problem also exists in other high-throughput post-genomic areas such as proteomics and metabolism. A complex disease is often characterized with many coordinated molecular changes, which should be considered when evaluating the reproducibility of discovery lists from different studies. Results: We proposed metrics percentage of overlapping genes-related (POGR) and normalized POGR (nPOGR) to evaluate the consistency between two DEG lists for a complex disease, considering correlated molecular changes rather than only counting gene overlaps between the lists. Based on microarray datasets of three diseases, we showed that though the POG scores for DEG lists from different studies for each disease are extremely low, the POGR and nPOGR scores can be rather high, suggesting that the apparently inconsistent DEG lists may be highly reproducible in the sense that they are actually significantly correlated. Observing different discovery results for a disease by the POGR and nPOGR scores will obviously reduce the uncertainty of the microarray studies. The proposed metrics could also be applicable in many other high-throughput post-genomic areas. Contact: guoz@ems.hrbmu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 06/2009; 25(13):1662-8. DOI:10.1093/bioinformatics/btp295 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Differentially expressed gene (DEG) lists detected from different microarray studies for a same disease are often highly inconsistent. Even in technical replicate tests using identical samples, DEG detection still shows very low reproducibility. It is often believed that current small microarray studies will largely introduce false discoveries. RESULTS: Based on a statistical model, we show that even in technical replicate tests using identical samples, it is highly likely that the selected DEG lists will be very inconsistent in the presence of small measurement variations. Therefore, the apparently low reproducibility of DEG detection from current technical replicate tests does not indicate low quality of microarray technology. We also demonstrate that heterogeneous biological variations existing in real cancer data will further reduce the overall reproducibility of DEG detection. Nevertheless, in small subsamples from both simulated and real data, the actual false discovery rate (FDR) for each DEG list tends to be low, suggesting that each separately determined list may comprise mostly true DEGs. Rather than simply counting the overlaps of the discovery lists from different studies for a complex disease, novel metrics are needed for evaluating the reproducibility of discoveries characterized with correlated molecular changes. Supplementaty information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 09/2008; 24(18):2057-63. DOI:10.1093/bioinformatics/btn365 · 4.62 Impact Factor