Assessing Differential Expression in Two-Color Microarrays: A Resampling-Based Empirical Bayes Approach

Office of Public Health Studies, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, Hawaii, United States of America.
PLoS ONE (Impact Factor: 3.23). 11/2013; 8(11):e80099. DOI: 10.1371/journal.pone.0080099
Source: PubMed


Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth's ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth's parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth's parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.

Download full-text


Available from: Timothy De Ver Dye, Mar 24, 2014
35 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples. We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation. Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at
    BMC Bioinformatics 07/2009; 10(1):198. DOI:10.1186/1471-2105-10-198 · 2.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article discusses specific assumptions necessary for permutation multiple tests to control the Familywise Error Rate (FWER). At issue is that, in comparing parameters of the marginal distributions of two sets of multivariate observations, validity of permutation testing is affected by all the parameters in the joint distributions of the observations. We show the surprising fact that, in the case of a linear model with i.i.d. errors such as in the analysis of Quantitative Trait Loci (QTL), this issue has no impact on control of FWER, if the test statistic is of a particular form. On the other hand, in the analysis of gene expression levels or multiple safety endpoints, unless some assumption connecting the marginal distributions of the observations to their joint distributions is made, permutation multiple tests may not control FWER.
    Biometrical Journal 10/2008; 50(5):756-66. DOI:10.1002/bimj.200710471 · 0.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA methylation patterns differ among children and adults and play an unambiguous role in several disease processes, particularly cancers. The origin of these differences is inadequately understood, and this is a question of specific relevance to childhood and adult cancer. DNA methylation levels at 26,485 autosomal CpGs were assayed in 201 newborns (107 African American and 94 Caucasian). Nonparametric analyses were performed to examine the relation between these methylation levels and maternal parity, maternal age, newborn gestational age, newborn gender, and newborn race. To identify the possible influences of confounding, stratification was performed by a second and third variable. For genes containing CpGs with significant differences in DNA methylation levels between races, analyses were performed to identify highly represented gene ontological terms and functional pathways. 13.7% (3623) of the autosomal CpGs exhibited significantly different levels of DNA methylation between African Americans and Caucasians; 2% of autosomal CpGs had significantly different DNA methylation levels between male and female newborns. Cancer pathways, including four (pancreatic, prostate, bladder, and melanoma) with substantial differences in incidence between the races, were highly represented among the genes containing significant race-divergent CpGs. At birth, there are significantly different DNA methylation levels between African Americans and Caucasians at a subset of CpG dinucleotides. It is possible that some of the epigenetic precursors to cancer exist at birth and that these differences partially explain the different incidence rates of specific cancers between the races.
    Birth Defects Research Part A Clinical and Molecular Teratology 08/2011; 91(8):728-36. DOI:10.1002/bdra.20770 · 2.09 Impact Factor
Show more