Page 1

Assessing Differential Expression in Two-Color

Microarrays: A Resampling-Based Empirical Bayes

Approach

Dongmei Li1*, Marc A. Le Pape2, Nisha I. Parikh2,3, Will X. Chen2, Timothy D. Dye2

1Office of Public Health Studies, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, Hawaii, United States of America, 2John A. Burns School of

Medicine, University of Hawaii, Honolulu, Hawaii, United States of America, 3Cardiovascular Division, The Queens Medical Center, Honolulu, Hawaii, United States of

America

Abstract

Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and

detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II

error rates; however, real microarray data do not always fit their distribution assumptions. Smyth’s ubiquitous parametric

method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error

rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a

permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method

fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of

the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the

Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally

distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed.

Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the

Smyth’s parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods.

Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation

study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower

false discovery rates compared to Smyth’s parametric method when data are not normally distributed. The Resampling-

based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method

when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed

data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data

analysis.

Citation: Li D, Le Pape MA, Parikh NI, Chen WX, Dye TD (2013) Assessing Differential Expression in Two-Color Microarrays: A Resampling-Based Empirical Bayes

Approach. PLoS ONE 8(11): e80099. doi:10.1371/journal.pone.0080099

Editor: Holger Fro ¨hlich, University of Bonn, Bonn-Aachen International Center for IT, Germany

Received May 21, 2013; Accepted September 30, 2013; Published November 27, 2013

Copyright: ? 2013 Li et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted

use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported in part by the National Institute of Minority Health and Health Disparities awards U54MD007584 (J. Hedges, PI),

G12MD007601 (M. Berry, PI), and U.S. Public Health Service grant P20GM103516 from the Centers of Biomedical Research Excellence program of the National

Institute of General Medical Sciences, National Institutes of Health. The funders had no role in study design, data collection and analysis, or preparation of the

manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: dongmeil@hawaii.edu

Introduction

Microarray technology is widely used to examine the activity

level of thousands of genes simultaneously in human cells to better

understand differential gene activation across diseases, such as

heart diseases, infectious diseases, mental illness, and health

disparities across ethnic groups. For example, DNA microarrays

are widely used for DNA methylation studies - which are

increasingly recognized as an important biological factor in

ethnicity-based health disparities. A recent study shows that

significantly different DNA methylation levels at birth, between

Caucasians and African Americans, partially explain the incidence

rates differential of specific cancers between ethnicities [1].

DNA methylation experiments typically use single channel or

two-color microarrays for detecting DNA methylation differences

between different groups. Smyth’s parametric model (PM) [2], one

of the most frequently used and most powerful models for two-

color micoarray data analysis, is available through the lmFit and

eBayes function in the open source Bioconductor/R software’s

limma package. The traditional approach to microarray analysis is

the ordinary t-statistic [3]. However, a large t-statistic may result

from an unrealistically small standard deviation. Thus, genes with

small sample variances are more likely to have large t-statistics

even when they are not differentially expressed. Both Tusher et al.

[4] and Efron et al. [5] modified the ordinary t-statistic to have

penalized t-statistics by adding a penalty to the standard deviation.

The penalty in Tusher’s method is chosen to minimize the sample

variation coefficient, while Efron et al. chose the penalty as the

90th percentile of the sample standard deviation values. In

simulation studies, Lo ¨nnstedt and Speed [6] showed that both

forms of penalized t-statistics were far superior to the ordinary t-

statistic for selecting differentially expressed genes. They further

PLOS ONE | www.plosone.org 1November 2013 | Volume 8 | Issue 11 | e80099

Page 2

modified the penalized t-statistics through a parametric empirical

Bayes approach using a simple mixture of normal models and a

conjugate prior, and showed that their empirical Bayes method

had both lower Type I error rates and Type II error rates

compared to the penalized t-statistics.

Smyth developed the hierarchical model of Lo ¨nnstedt and

Speed into a practical approach for general microarray experi-

ments with arbitrary number of treatments and RNA samples

using a moderated t-statistic that follow a t-distribution with

augmented degrees of freedom. Smyth also showed in simulation

studies that the moderated t-statistic has the largest area under the

Receiver Operating Curve, with both lower Type I and lower

Type II error rates compared to ordinary t-statistics, Efron’s

penalized t-statistics, and Lo ¨nnstedt and Speed’s empirical Bayes

statistic. However, Smyth’s method calculates p-values based on

the t-distribution, which could generate incorrect p-values for non-

normally distributed microarray data. Another widely used

microarray data analysis method, Significance Analysis of

Microarrays (SAM) [4], is based on permutation test and robust

to violations of the t-distribution. However, fold change threshold

selection in the SAM method is problematic as different fold

change criteria can critically alter the conclusions of a study,

resulting from compositional changes of the control data set in the

analysis [7]. As such, to reduce false discovery rates for non-

normally distributed microarray data, we propose a novel

approach combining resampling with empirical Bayes methods:

Resampling-based empirical Bayes Methods (RBMs). This ap-

proach is impervious to fold change criteria as no control data set

selection is needed; furthermore, this novel approach is general-

izable to next generation sequencing RNA-seq data analysis.

Methods

Ethics Statement

The data used in this paper to argue the false discovery controls of

PM and RBMs were collected in accordance with the University of

Hawaii IRB CHS #20067 terms of approval for a placental DNA

methylation study deemed exempt from federal regulations pertain-

ing to the protection of human research participants. Authority for

exemption is documented in Title 45, Code of Federal Regulations,

Part 46. The methylation microarray data have been deposited in

NCBI’s Gene Expression Omnibus [8] and are accessible through

GEO Series accession number GSE49504 (http://www.ncbi.nlm.

nih.gov/geo/query/acc.cgi?acc=GSE49504).

FDR, Sensitivity, and Specificity

Suppose we are testing m null hypotheses simultaneously in a

DNA microarray study. Among the m hypotheses, m0 of the

hypotheses are true. For any multiple testing procedure that reject

R null hypotheses out of m null hypotheses, we use V to denote the

number of falsely rejected true null hypotheses (false discoveries)

among R rejections, and use S to denote the number of true

rejections among R rejections (R~VzS). Table 1 shows the

possible outcomes when testing m null hypotheses simultaneously.

The framework of false discovery rate (FDR) was proposed by

Soric [9] for quantifying the statistical significance based on the

rate of false discoveries. The formal definition of FDR was

proposed by Benjamini and Hochberg [10] as:

FDR~E(V

RDRw0)Pr(Rw0):

ð1Þ

For a discovery-based microarray study, FDR is generally

recognized as an appropriate multiple testing error rate with 5% as

the most commonly used cutoff value. When comparing different

methods for microarray data analysis, high sensitivity and

specificity are often desired properties of a good microarray

analysis method. Sensitivity is defined as the probability of

rejecting a non-true null hypothesis, while specificity is defined

as the probability of failing to reject a true null hypothesis. The

sensitivity and specificity of a multiple testing procedure can be

calculated as follows:

sensitivity~

S

m{m0,

ð2Þ

specificity~U

m0~1{V

m0:

ð3Þ

Sensitivity relates to a test’s ability to identify positive results

(giving the proportion of true positives) and is also a definition of

power in multiple testing. A test with a high sensitivity has a low

type II error rate and high power.

Specificity relates to a test’s ability to identify negative results. A

test with high specificity has a low type I error rate which is

important to control for.

Table 1. Possible outcomes of testing m null hypotheses.

Number

not rejected

Number

rejectedTotal

True null hypotheses UVm0

Non-true null

hypotheses

TSm-m0

Totalm-RRm

m is the total number of null hypotheses.

doi:10.1371/journal.pone.0080099.t001

Figure 1. Commonly used experimental design for two color

micoarrays.

doi:10.1371/journal.pone.0080099.g001

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org2November 2013 | Volume 8 | Issue 11 | e80099

Page 3

Resampling-based multiple testing procedures

Resampling-based multiple testing procedures are widely used

in genomic studies to identify differential gene expression and to

conduct genome-wide association studies (GWAS), particularly

when the distribution of test statistics is non-normally distributed

or unknown. Meanwhile, resampling-based multiple testing

procedures can also take into account dependence among p-

values or test statistics. Commonly used resampling techniques

include permutation tests and bootstrap methods.

A permutation test is a type of non-parametric statistical

significance test in which the test statistics’ distribution under the

null hypothesis is constructed by calculating all possible values or a

concrete number of test statistics (usually 1000 or above) from

permuted observations under the null hypothesis. The theory of

permutation tests is based work done by Fisher and Pitman in the

1930s [11]. Permutation tests are distribution-free, and can

provide exact p-values even when the sample size is small. Westfall

and Young [12] elaborated upon multiple testing procedures using

the permutation test, and it has been shown that the permutation

test has a strong control of multiple testing error rate under the

marginal-determining-joint condition [13].

The bootstrap method, first introduced by Efron [14], and

further discussed by Efron and Tibshirani [15], is a way of

approximating the sampling distribution from just one sample.

Instead of taking many simple random samples from the

population to find a sample statistic’s sampling distribution, the

bootstrap method repeatedly resamples with replacement from

one random sample. Efron [14] showed that the bootstrap method

is an asymptotically unbiased estimate for the variance of a sample

median, and for error rates in a linear discrimination problem -

outperforming cross-validation.

showed that the bootstrap approximation to the distribution of

least square estimates is valid. Finally, Hall [17] showed that the

bootstrap method’s reduction of error coverage probability, from

O(n{1=2) to O(n{1), makes the bootstrap method one order of

magnitude more accurate than the delta method. The p-values

computed by the bootstrap method are less exact than the p-values

obtained from the permutation method. It has been proved that

the p-values estimated by the bootstrap method are asymptotically

convergent to the true p-values [18].

Freedman[16]conclusively

Significance Analysis of Microarrays (SAM) procedure

The Significance Analysis of Microarrays (SAM) was first

introduced by Tusher et al. [4] to identify statistically significant

differences in gene expression by assimilating a set of gene-specific

t tests. In SAM, each gene is assigned a score on the basis of its

difference in gene expression relative to the standard deviation of

repeated measurements for that gene. A scatter plot of the

observed relative difference, versus the expected relative difference

estimated by the permutation method, is then used to select

statistically significant genes based on a pre-determined threshold.

The SAM procedure can be summarized as follows.

N Compute a test statistic tifor each gene i (i~1,...,g).

N Compute order statistics t(i)such that t(1)ƒt(2)ƒt(g).

N Perform B permutations of the responses/covariates

y1,...,yn. For each permutation b, compute the permuted

test statistics ti,b and the corresponding order statistics

t(1),bƒt(2),bƒt(g),b.

N From the B permutations, estimate the expected values of

order statistics by? t t(i)~1

B

N Form a quantile-quantile (Q{Q) plot (SAM plot) of the

observed t(i)versus the expected? t t(i).

PB

b~1t(i),b.

N For a given threshold D, starting at the origin, and moving

up to find the first i~i1such that t(i){? t t(i)wD. All genes past

i1 are called significant positives. Similarly, starting at the

origin, and moving down to the left, find the first i~i2such

that ? t t(i){t(i)wD. All genes past i2 are called significant

n e g a t i v e s .D e f i n e

Cutup(D)~minft(i): iƒi1g~t(i1), and the lower cut point

Cutlow(D)~minft(i): i§i2g~t(i2).

N For a given threshold, the expected number of false

rejections E(V) is estimated by computing the number of

genes with ti,babove Cutup(D) or below Cutlow(D) for each

of the B permutations, and averaging the numbers over B

permutations.

N A threshold D is chosen to control the Fdr(Fdr~E(V)=r)

under the complete null hypothesis, at an acceptable

nominal level.

t h e u p p e rc u tp o i n t

In our simulation studies, the SAM procedure is implemented

through the sam function in the Bioconductor’s siggenes package.

Linear models and empirical Bayes method (PM)

In general, let yT

g~(yg1,...,ygn) denote the log-ratios of two-

color dye intensities or log-intensities for single color data which

have been suitably normalized in a microarray experiment. The

log-ratios of the two-color intensities or log-intensities for single

color data can be expressed in a linear model format as follows:

E(yg)~Xbg,

ð4Þ

where X is a design matrix of full column rank and bgis a

coefficient vector. The commonly used designs of a two-color

microarray experiment described in Kerr and Churchill [19] are

displayed in Figure 1. Each rectangle represents a DNA/RNA

sample with C denoting control and T denoting treatment

samples. Each arrow denotes a microarray. The DNA/RNA

sample on the left of the arrow will be dyed with cy3 (green dye)

and the RNA sample on the right of the arrow will be dyed with

cy5 (red dye). Design (a) in Figure 1 examines whether the log2

differences of red dye intensities (T) and green dye intensities (C)

between treatment and control samples are equal to 0. Design (b)

swaps the two dyes and generates two log2 differences, T-C and

C-T. The design matrix X for design (b) is

X~

1

{1

??

:

The design matrix X for design (c) and design (d) could be as

follows:

X~

10

01

??

for design (c), and

X~

10

01

1{1

0

@

B

1

A

C

for design (d).

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org3November 2013 | Volume 8 | Issue 11 | e80099

Page 4

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org4November 2013 | Volume 8 | Issue 11 | e80099

Page 5

Different from two color microarrays, single color microarrays

usually have a single expression value for each gene and each

array. The design matrix for single color microarrays can be

formed exactly as in classic linear model settings based on

biological factors in microarray experiments.

We assume var(yg)~Wgs2

non-negative definite weight matrix. Wg may contain diagonal

weights of zero for missing values in yg. s2

variance for yg.^b bgis the estimated coefficient vector. Vg is a

positive definite matrix not depending on s2

variance for gth gene. Let vgibe the jth diagonal element of Vg.

Smyth [2] assumes the following distributional assumptions:

gand var(^b bg)~Vgs2

g. Wgis a known

gis the unknown error

gwhich is the residual

^b bgjDbgj,s2

g*N(bgj,vgjs2

g)

ð5Þ

and

s2

gDs2

g*s2

g

dgx2

dg

ð6Þ

where dgis the residual degrees of freedom for the linear model for

gene g. The ordinary t-statistic under these assumptions is

tgj~

^b bgj

sg

ffiffiffiffiffiffi

vgj

p

ð7Þ

which follows an approximate t-distribution on dg degrees of

freedom.

A prior distribution on s2

gis assumed as equation (8) with prior

estimator s2

0and d0degrees of freedom estimated from the data by

equating empirical to expected values for the first two moments of

logs2

g, which is used because of its finite property for any degrees of

freedom and an even more nearly normal distribution than s2

that moment estimation is likely to be more efficient.

g, so

1

s2

g

*

1

d0s2

0

x2

d0

ð8Þ

The posterior mean of s2

ggiven s2

gis

~ s s2

g~E(s2

gDs2

g)~d0s2

0zdgs2

d0zdg

g

ð9Þ

Then the moderated t-statistic, based on a hybrid classical/

Bayes approach, is defined by:

~ t tgj~

^b bgj

p

~ s sg

ffiffiffiffiffiffi

vgj

ð10Þ

The p-value for testing H0: bgj~0 based on the moderated t-

statistic can be calculated from t distribution with dgzd0degrees

of freedom. Appropriate quadratic forms of the moderated t-

statistics follow F-distributions and can be used to test hypotheses

about any set of contrasts simultaneously. Smyth’s method is

available through the limma package in Bioconductor, and is

widely used for two-color microarray data analysis.

Resampling and empirical Bayes methods (RBMs)

To carry out the permutation/bootstrap test based on the

moderated t-statistics proposed by Smyth [2], we proceed as

follows:

N Compute the moderated t-statistics~ t tgjbased on the observed

data set for each gene g.

N Permute/bootstrap the original data in a way that matches

the null hypothesis to get permuted/bootstraped resamples,

and construct the reference distribution using the moderated

t-statistics or p-values calculated from permuted/boot-

strapped resamples.

N Calculate the critical value of a level a test based on the

upper a percentile of the reference distribution, or obtain the

p-value by computing the proportion of permutation/

bootstrap test statistics or p-values that are as extreme as,

or more extreme than, the observed moderated t-statistic or

p-value.

The p-values for the p-value based permutation/bootstrap

methods are calculated according to the following formula:

pi~Pr(PlƒpiDHM):

ð11Þ

Similarly, the p-values for the test statistics-based permutation/

bootstrap methods are calculated from the following formula:

pi~Pr(DTlD§DtiDDHM):

ð12Þ

In the above formulas, HM

hypothesis and Pl denotes the random variable for the raw p-

value of the lth hypothesis.

Depending on the resampling method (either permutation or

bootstrap) and the p-value calculation method (either test statistics

or p-values), four RBM methods are proposed: RBM test statistic

based permutation method (TSBP); RBM test statistic based

bootstrap method (TSBB); RBM p-value based permutation

method (PBP); RBM p-value based bootstrap method (PBB). Both

denotes the complete null

Figure 2. Sensitivity, specificity, total rejection, and estimated false discovery rate comparisons between the RBMs and the PM for

normal distributed gene expression data. Blue: PM; Grey: SAM; Red: RBM test statistic based permutation method; Orange: RBM p-value based

permutation method; Green: RBM test statistic based bootstrap method; Purple: RBM p-value based bootstrap method. Figure 2a: sample size n=4 in

each group; Figure 2b: sample size n=6 in each group; Figure 2c: sample size n=12 in each group; Figure 2d: sample size n=24 in each group;

Figure 2e: sample size n=48 in each group.

doi:10.1371/journal.pone.0080099.g002

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org5November 2013 | Volume 8 | Issue 11 | e80099

Page 6

Table 2. Comparison of sensitivities for all six methods.

Distributionn

p1

PMSAMTSBPTSBBPBPPBB

Normaln=40.100.9800.6800.820 0.8800.8400.580

0.250.9520.8480.8000.8640.8400.608

0.50 0.952 0.9280.8400.8880.8480.672

0.750.9570.6640.8830.9010.8770.717

0.900.9580.5560.9000.9070.8780.680

n=60.101.0000.9000.9400.940 0.9400.700

0.251.0000.9440.9520.9600.9520.776

0.500.9960.9520.9440.9520.9560.792

0.75 0.9950.6670.949 0.9520.9570.805

0.900.9890.5560.9440.9440.9560.811

n=120.10 1.0000.9800.9800.9800.9800.840

0.25 1.0000.9920.9920.9920.9920.904

0.500.9920.988 0.9880.988 0.988 0.932

0.750.992 0.6690.9840.9870.984 0.925

0.900.993 0.558 0.9800.9820.973 0.913

n=240.101.0001.000 1.0001.0001.000 0.980

0.25 1.0001.000 1.000 1.0001.0000.984

0.501.0001.0001.000 1.0001.0000.984

0.750.9970.6720.9970.997 0.9970.981

0.900.9980.5620.9980.998 0.9980.973

n=480.101.0001.0001.0001.0001.000 1.000

0.25 1.000 0.9921.0000.992 1.000 0.992

0.50 1.0000.992 1.0000.9961.0000.988

0.75 1.0000.6670.997 0.995 0.9970.989

0.901.000 0.5560.9960.9960.9960.987

Log Normaln=40.100.9600.5200.7800.8200.7800.480

0.25 0.9440.7680.7840.8080.7920.544

0.500.9480.8840.7920.8200.7960.592

0.75 0.9490.6640.8190.840 0.8290.608

0.900.9510.5560.8440.8600.8290.602

n=60.100.9600.8400.8800.8800.9000.720

0.250.9840.896 0.9200.9120.9360.672

0.500.9840.9160.9160.9200.9280.704

0.750.9840.6670.9070.9120.9120.723

0.90 0.987 0.5560.9090.9180.9090.727

n=120.101.0000.900 0.9000.9000.900 0.780

0.251.0000.9680.9440.9440.9440.840

0.50 1.0000.9680.9520.9520.9520.844

0.75 0.997 0.6690.9520.9550.952 0.864

0.900.9980.5580.9470.9490.9470.860

n=240.10 1.0000.9600.9800.980 0.9800.940

0.251.0000.9760.976 0.9760.976 0.928

0.501.0000.9880.984 0.9840.984 0.916

0.751.0000.6720.9870.9870.9870.923

0.901.0000.5600.9890.9890.9820.918

n=480.10 1.0000.9801.000 1.0001.0000.980

0.251.000 0.9840.9920.9920.9920.968

0.500.9960.9760.980 0.9800.9800.952

0.750.9970.6670.9810.9810.9810.949

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org6November 2013 | Volume 8 | Issue 11 | e80099

Page 7

the PM and the RBMs follow the empirical Bayes approach

proposed by Efron et al. [5], thus controlling for false discovery

rates in the data analysis.

Simulation data sets

In our simulation studies, each data set includes 1000

independently generated samples of two groups of equal sample

size of 4, 6, 12, 24, and 48. The sample sizes of 4 and 6 represent

small sample size scenarios, 12 and 24 represent medium sample

size scenarios, and 48 represents large sample size scenarios. The

total number of genes (m) is set to be 500 with the fraction of

differentially expressed genes (p1~m1=m) equal to 10%, 25%,

50%, 75%, and 90% to cover all possible scenarios. In the two-

groups comparisons, the gene expression level on log2 scale is

generated randomly, either from a multivariate normal distribu-

tion with m~0 and s~1, or from a multivariate log normal

distribution with logm~0 and logs~1, or from a mixed normal

distribution (80% of the data follow a normal distribution with

m~0 and s~1, and 20% of the data follow a normal distribution

with m~2 and s~1), with random correlations between genes to

mimic the correlations in real microarray data. Mean differences

between groups are set to be 1, and standard deviations are

randomly generated from a scaled chi-square distribution with 4

degrees of freedom. The number of permutation/bootstrap is set

at 1000. In our simulation study, three reasons led us to choose

1000 permutations as the optimal number of permutations. The

first reason was to standardize to the default number of

permutations used in most statistical software packages such as

Bioconductor and IBM SPSS. A second reason for selecting 1000

permutations was that a larger number of permutations was

originally used in our simulation study with no significantly

different results. Indeed, with 1000 permutations the smallest

possible p-value is 0.001, and the uncertainty near p=0.05 is about

1%; as our approach already controled for FDR and no further

multiplicity adjustment was needed, 1000 permutations were

deemed sufficient. A third and final reason lied with reducing

computational load and fostering computational efficiency. The

significance level was set at 5%. The R codes for our resampling

and empirical Bayes methods are publicly available from http://

www.hawaii.edu/publichealth/faculty/profile/li.html.

Table 2. Cont.

Distributionn

p1

PM SAMTSBP TSBBPBPPBB

0.900.9980.5560.9820.9820.9820.953

Mixed Normaln=4 0.100.9200.820 0.900 0.9200.880 0.840

0.250.8960.9120.8640.912 0.8480.816

0.500.9320.9600.9000.9360.9000.860

0.750.933 0.6670.9230.9390.9200.864

0.900.927 0.5560.9400.9420.9310.853

n=60.10 0.940 0.920 0.9400.9400.9400.920

0.250.9440.9440.9440.944 0.9440.928

0.500.9640.9800.9640.964 0.9640.936

0.750.9730.667 0.9730.9730.9730.947

0.900.9670.5560.962 0.9670.9620.944

n=120.101.0001.000 1.0001.0001.000 1.000

0.251.0000.9841.0001.0001.000 0.984

0.500.9920.9960.9960.9960.9960.984

0.750.992 0.6690.992 0.995 0.9920.987

0.90 0.993 0.5580.9910.9910.9910.982

n=24 0.101.0001.0001.0001.0001.0001.000

0.25 1.0000.9921.0001.0001.0001.000

0.500.9961.000 1.0001.0001.0000.996

0.750.9970.6751.000 1.0001.0000.997

0.900.9980.562 1.0001.0001.000 0.996

n=48 0.101.0001.000 1.000 1.0001.0001.000

0.25 1.0001.0001.000 1.0001.0001.000

0.50 1.0001.0001.000 1.0001.0000.996

0.75 0.997 0.6690.9970.9970.9970.995

0.900.9980.5580.9980.9980.9980.996

p1: Proportion of differentially expressed genes.

TSBP: RBM Test statistic based permutation method.

TSBB: RBM Test statistic based bootstrap method.

PBP: RBM p-value based permutation method.

PBB: RBM p-value based bootstrap method.

doi:10.1371/journal.pone.0080099.t002

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org7November 2013 | Volume 8 | Issue 11 | e80099

Page 8

Table 3. Comparison of estimated false discovery rates for all six methods.

Distributionn

p1

PMSAMTSBPTSBB PBPPBB

Normaln=40.100.3100.0560.3050.2790.311 0.094

0.250.1310.0360.1600.1360.1730.050

0.500.0480.0170.0790.047 0.1130.018

0.75 0.0250.0040.052 0.0340.089 0.011

0.900.014 0.0000.0360.029 0.0530.013

n=60.10 0.4510.0820.4050.3970.4130.205

0.250.2190.0710.2220.2110.2220.094

0.50 0.1040.0560.1130.109 0.1250.048

0.75 0.0360.0120.0580.0580.0930.032

0.900.0110.0000.0250.0210.049 0.016

n=12 0.100.2070.0390.2690.2460.2690.046

0.250.0740.0080.1010.1140.1010.009

0.500.020 0.012 0.0200.028 0.0240.000

0.75 0.0050.0000.0260.0210.0290.000

0.900.005 0.0000.0180.0180.0250.010

n=240.100.243 0.0000.243 0.254 0.2430.058

0.25 0.1070.000 0.107 0.1140.107 0.032

0.500.0350.0000.0350.0390.0420.012

0.75 0.0130.000 0.0210.0160.0260.003

0.900.002 0.0000.0070.0070.0070.000

n=480.10 0.3330.0380.2860.3060.286 0.091

0.25 0.1550.0000.1200.1390.126 0.031

0.50 0.0530.0080.0530.057 0.0530.020

0.75 0.0160.0000.0160.0180.016 0.008

0.90 0.0040.0000.0090.0090.009 0.005

Log Normaln=40.10 0.9020.0710.3160.3170.418 0.040

0.25 0.7580.0400.1770.1440.369 0.015

0.50 0.5110.018 0.0830.0600.321 0.013

0.75 0.257 0.0040.055 0.0400.188 0.066

0.90 0.1010.0000.0280.0280.077 0.029

n=60.10 0.9040.0870.4050.3970.511 0.100

0.25 0.7530.0510.2120.2030.415 0.046

0.50 0.5040.0380.1190.1150.350 0.049

0.75 0.2530.0040.0560.0550.1990.103

0.90 0.1010.000 0.0290.0210.087 0.068

n=120.100.9000.0220.2740.2370.3180.000

0.250.7500.0160.1060.0920.1920.000

0.500.4990.0320.0250.0250.2010.000

0.750.2490.0000.0220.0170.1560.015

0.900.0980.0000.0160.0120.0640.033

n=240.100.9000.0000.2460.2580.2900.000

0.250.7500.0000.0960.1030.2520.000

0.500.5000.0120.0350.0350.2120.004

0.750.2500.0000.0210.0190.1360.014

0.900.1000.0000.0020.0020.0580.012

n=480.100.9000.0000.2960.3060.3150.000

0.250.7500.0080.1390.1330.1900.000

0.50 0.5000.0240.0580.0540.134 0.000

0.75 0.2490.0000.0190.0210.1090.000

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org8November 2013 | Volume 8 | Issue 11 | e80099

Page 9

Results

Simulation studies were conducted to compare the sensitivity,

specificity, total rejection, and false discovery rate across the PM,

the SAM, and the RBMs. The simulation studies include situations

for both normally distributed and non-normally (log normally and

mixed normally) distributed miroarray data.

Simulation results from normally distributed data

In terms of sensitivity (power), the PM shows very high

sensitivity across all sample sizes and higher sensitivity than all

other methods - even when sample size is small, e.g., 4 or 6 in each

group (Figure 2a, b and Table 2). Both the SAM and the PBB has

lower sensitivity compared to other methods for small sample sizes.

However, the sensitivity improves significantly as sample size in

each group increases for the PBB method, but not for the SAM

method. The SAM method shows low sensitivity when the

proportion of differentially expressed genes is over 50% -

regardless of sample size (Figure 2 and Table 2). All other RBM

methods show good sensitivity levels, comparable to the PM

method when n§6 in each group (Figure 2b, 2c, 2d, 2e and

Table 2).

All methods show comparable specificity when sample size is

large (Figure 2c, 2d, and 2e). Both the PBB and the SAM methods

show slightly higher specificity than the PM method even when

sample size is small (Figure 2a and 2b). Other RBMs perform

similarly to the PM when the proportion of differentially expressed

genes is less than 50%, and sample size is small.

The number of total rejections for all methods shows similar

trends as sensitivity (Figure 2). The RBMs have a slightly lower

number of total rejections compared to the PM. The SAM has a

comparable number of total rejections as the RBMs when the

proportion of differentially expressed genes are less than 50%. As

expected, the SAM has a lower number of total rejections due to

its lower sensitivity compared to all other methods when the

proportion of differentially expressed genes are over 50% across all

sample sizes.

For false discovery rates control, the SAM method has the most

conservative control rate among all methods compared for all

sample sizes (Table 3). The conservativeness of the SAM method

slightly increases with sample size. Both the SAM and the PBB

methods have much lower estimated false discovery rates

compared to the PM and other RBM methods when the

Table 3. Cont.

Distributionn

p1

PMSAMTSBPTSBB PBPPBB

0.90 0.0980.004 0.0050.005 0.0450.005

Mixed Normaln=40.100.8940.0000.274 0.3030.3970.045

0.250.7450.0420.1560.156 0.3730.029

0.500.4880.0160.1000.086 0.3440.014

0.750.2420.0160.0750.046 0.2120.082

0.900.0940.0000.0390.0320.0890.054

n=60.100.899 0.0210.3380.3090.4600.000

0.250.7470.0330.1510.1390.3620.000

0.500.489 0.0200.0770.0770.3470.013

0.750.2400.0000.042 0.045 0.2070.085

0.90 0.0940.0000.0200.016 0.0940.062

n=120.100.8980.0200.3900.3830.4840.020

0.250.7470.0160.1880.1830.2980.008

0.500.497 0.0040.088 0.0880.3180.012

0.75 0.2470.0000.0530.0510.1950.066

0.90 0.0970.0000.0220.0160.082 0.050

n=24 0.10 0.9000.0000.342 0.3330.3900.000

0.250.750 0.0000.1500.1500.2560.000

0.500.5010.064 0.0640.0670.288 0.004

0.750.251 0.0560.041 0.0390.1920.026

0.900.100 0.0230.018 0.018 0.0850.035

n=480.10 0.8990.0000.254 0.2650.3330.000

0.250.7490.0080.1010.1010.2330.000

0.500.5000.0040.027 0.0230.2310.000

0.750.2510.0000.0050.003 0.1780.013

0.900.100 0.0000.007 0.0020.0840.022

p1: Proportion of differentially expressed genes.

TSBP: RBM Test statistic based permutation method.

TSBB: RBM Test statistic based bootstrap method.

PBP: RBM p-value based permutation method.

PBB: RBM p-value based bootstrap method.

doi:10.1371/journal.pone.0080099.t003

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org9 November 2013 | Volume 8 | Issue 11 | e80099

Page 10

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org10November 2013 | Volume 8 | Issue 11 | e80099

Page 11

proportion of differentially expressed genes are over 50% across

sample sizes (Figure 2).

In summary, the PBB method performs best when sample size is

large with high sensitivity, specificity, and low false discovery rate,

while the PM performs well on sensitivity and specificity across all

sample sizes, except for a slightly higher false discovery rate when

the proportion of differentially expressed genes is lower than 50%.

The SAM method also performs well with high sensitivity,

specificity, and low false discovery rate, except for low sensitivity

when the proportion of differentially expressed genes is over 50%

for all sample sizes.

Simulation results from non-normally distributed data

In many cases, microarray data are not normally distributed

and fail to be transformed to follow a normal distribution.

Therefore, we explored the sensitivity, specificity, total rejection,

and estimated false discovery rates of the PM, the SAM, and the

RBMs for both log normally and mixed normally distributed data.

When data are log normal distributed (right skewed), the PM

method shows the highest sensitivity of all methods under

comparison across all sample sizes (Table 2). Both the SAM and

the PBB methods show lower sensitivity than other RBM methods.

The sensitivity of the PBB method improves as sample size

increases. However, the SAM method exhibits a much lower

sensitivity compared to all other methods when the proportion of

differentially expressed genes is over 50%. When data follow a

mixed normal distribution (left skewed and skinny tails), although

the same pattern is observed for all methods, the PBB method

shows improved sensitivity for all sample sizes. The SAM method

still has the lowest sensitivity when the proportion of differentially

expressed genes is greater than 50% for all sample sizes.

The specificities of the PM method are significantly lower than

all other methods regardless of sample size and proportion of

differentially expressed genes for both log normal distributed data

and mixed normal distributed data (Figure 3 and Figure 4). Both

the SAM method and the RBMs have high specificity across all

sample sizes, except that the PBP and the PBB methods have

decreased specificity for all sample sizes when the proportion of

differentially expressed genes is greater than 50% for both log

normally distributed and mixed normally distributed data.

The number of total rejections for all methods also shows

similar trends to sensitivity when data are either log normally or

mixed normally distributed (Figure 3 and Figure 4). In contrast to

normally distributed data, the PM method rejects almost all null

hypotheses even when the proportion of differentially expressed

genes are only 10% or 25% for all sample sizes. The number of

total rejections for the RBMs is close to the true number of

differentially expressed genes in the simulated data set. However,

the SAM method rejects far less null hypotheses than the true

number of non-true null hypotheses when the proportion of

differentially expressed genes are over 50% across all sample sizes

for either log normally or mixed normally distributed data.

The PM method’s false discovery rate is the highest of all

methods compared, for both log normally or mixed normally

distributed data. The estimated false discovery rates for the PM

method are significantly higher than all other methods especially

for data characterized by a small proportion of differentially

expressed genes such as 10% or 25% (Table 3). Of all methods,

both the PBB and the SAM method show good control of false

discovery rates at a 5% level, even when the proportion of

differentially expressed genes is small and sample size is small.

In summary, the PBB method performs better than any other

methods in terms of sensitivity, specificity and false discovery rate

controls, when data are not normally distributed. The SAM

method performs well, except for low sensitivity when the

proportion of differentially expressed genes is over 50% across

all sample sizes.

Real data example

Preterm birth, which is defined as birth occurring before 37

weeks of gestation, can be harmful to the short-term and long-term

health of the infant, and creates a large economic burden in the

US. The estimated lower boundary of annual societal economic

burden associated with preterm birth in the United States was in

excess of $26.2 billion in 2005. A recent study on the role of DNA

methylation in preterm birth indicates that DNA methylation is an

epigenetic risk factor in preterm birth which may influence the risk

of preterm birth, or result in changes predisposing a neonate to

adult-onset diseases [20].

A methylation two-color microarray study (2012) was conducted

at the University of Hawaii John A. Burns School of Medicine to

identify placental DNA methylation loci associated with preterm

delivery. The DNA of 9 women’s placental tissue (originating from

4 premature births and 5 term deliveries) was analyzed. The

gestational age distribution between the preterm delivery group

and the term delivery group had been compared using the

permutation test and no significant difference was found between

this two groups (p-value=0.556). Placental tissue was sampled

from the decidual membrane rolls of previously formalin fixed

samples. The decidual portion of the placenta predominantly

consists of maternal tissue with a small percentage of fetal tissue.

Using the IlluminaH infinium bead chip, which utilizes bisulfite

conversion across prespecified CpG sites across the genome, the

percentage of DNA methylation in each of 485,577 loci was

assessed. The ‘‘print-tip loess’’ normalization method was used to

correct for within-array dye and spatial effects, while single

channel quantile normalization was used to facilitate comparison

between arrays.

The percentage of methylation histogram at all loci shows that

the distribution of the percentage data is not normally distributed

(Figure 5). Both the RBMs, the SAM, and the PM were used to

identify differentially methylated loci between premature births

and term deliveries. Table 4 lists the total number of identified

methylation loci by all six methods.

According to the PM, over 98% of the loci are differentially

methylated between placental tissues from 4 preterm women and 5

term women. This result indicates that the FDR is not well

controlled by the PM. The SAM method rejected no loci for

differential methylations between preterm deliveries and term

deliveries which might due to the high conservativeness and low

Figure 3. Sensitivity, specificity, total rejection, and estimated false discovery rate comparisons between the RBMs and the PM for

lognormal distributed gene expression data. Blue: PM; Grey: SAM; Red: RBM test statistic based permutation method; Orange: RBM p-value

based permutation method; Green: RBM test statistic based bootstrap method; Purple: RBM p-value based bootstrap method. Figure 2a: sample size

n=4 in each group; Figure 2b: sample size n=6 in each group; Figure 2c: sample size n=12 in each group; Figure 2d: sample size n=24 in each

group; Figure 2e: sample size n=48 in each group.

doi:10.1371/journal.pone.0080099.g003

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org11November 2013 | Volume 8 | Issue 11 | e80099

Page 12

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org12November 2013 | Volume 8 | Issue 11 | e80099

Page 13

power of the SAM method. In contrast, the RBMs perform better

than both the PM and the SAM methods, as it rejects a reasonable

number of methylation loci. The number of rejections by the

RBMs is comparable to the number of methylation loci identified

by a recent DNA methylation microarray study [21], which

identified 29 CpG sites among over 485,000 CpG sites associated

with spontaneous preterm birth, independent of gestational ages.

Discussion

The sensitivities, specificities, total rejections, and FDR controls

were compared across the PM, the SAM, and the RBMs methods

through simulation studies. The simulation results showed that this

novel approach offers significantly higher specificity and lower

false discovery rates compared to the PM method - for non-

normally distributed data. This approach also offers higher

statistical power than the SAM method when the proportion of

significantly differentially expressed genes is large for both

normally and non-normally distributed data. A real methylation

microarray example was introduced to compare FDR controls

across all methods. The RBMs rejected less than 1% of the

methylation loci, which is comparable to the number of

methylation loci identified by a recent study [21]. However, PM

rejected over 98% of the methylation loci, and the SAM method

rejected none of the methylation loci.

Our resampling-based empirical Bayes approach combined the

resampling methods used by Westfall and Young [12] and Pollard

and Van Der Laan [18] with the empirical Bayes method used by

Smyth [2]. The robustness of the resampling methods to

parametric distribution assumptions on test statistics was incorpo-

rated into the empirical Bayesian part of Smyth’s method - which

made the test statistics more robust to normal distribution

assumptions, and less affected by either underestimated or

overestimated sample variances compared to the ordinary test

statistics used in the resampling procedures.

The PBB method and the SAM method always control the

FDR at lower levels compared to other RBMs and the PM, for

normally, lognormally, and mixed normally distributed data. The

PM has a very large false discovery rate when microarray data are

not normally distributed and the proportion of differentially

expressed genes is small. FDR controls achieved by other RBMs

(i.e., the PBP, the TSBP, and the TSBB methods) are significantly

better than those achieved by the PM, but never as good as those

achieved by the PBB method and the SAM method when data are

not normally distributed. However, for normally distributed data,

the performance of all other RBMs is similar to the PM.

Overall, the PBB method has the highest sensitivity, specificity,

and the best FDR controls of all methods compared in this paper.

Furthermore, the PBB method has much higher sensitivity than

the SAM method when the proportion of differentially expressed

genes is large, and much better FDR controls than the PM -

especially when data are not normally distributed. The RBMs

methods are computationally more intensive than Smyth’s method

as a result of the resampling approach; however, the computa-

tional efficiency of the RBMs methods could be greatly improved

through a Bayesian algorithm that would reallocate more

efficiently the number of resamples based on p-values [22].

A vexing issue with the Smyth’s approach to microarrays

analysis is its propensity to generate erroneous findings, especially

when normality assumptions are violated. Our results show that

the PBB method significantly improves microarray data analysis

when normality assumptions are violated, and promotes accurate

interpretation of findings from microarray studies. As Larsson

pointed out, fold change criteria in the SAM method is

problematic and can critically alter the conclusion of a study

due to compositional changes of the control data set in the analysis

[7]. As it turns out, our approach is not affected by the fold change

threshold since no selection for the control data set is needed.

Although the Resampling-based empirical Bayes Methods focus

on two-color microarray data analysis, this novel approach could

also be applied to single color oligonucletide microarrays, and

generalized to next generation sequencing RNA-seq data analysis.

Figure 4. Sensitivity, specificity, total rejection, and estimated false discovery rate comparisons between the RBMs and the PM for

mixed normal distributed gene expression data. Blue: PM; Grey: SAM; Red: RBM test statistic based permutation method; Orange: RBM p-value

based permutation method; Green: RBM test statistic based bootstrap method; Purple: RBM p-value based bootstrap method. Figure 2a: sample size

n=4 in each group; Figure 2b: sample size n=6 in each group; Figure 2c: sample size n=12 in each group; Figure 2d: sample size n=24 in each

group; Figure 2e: sample size n=48 in each group.

doi:10.1371/journal.pone.0080099.g004

Figure 5. Histogram of percentage of methylation at 485,577

loci.

doi:10.1371/journal.pone.0080099.g005

Table 4. Total discoveries comparison between the PM and

the RBMs for 485577 loci.

MethodsPM SAMTSBPTSBBPBPPBB

Total discoveries476,5540 34 35119 304

TSBP: RBM Test statistic based permutation method.

TSBB: RBM Test statistic based bootstrap method.

PBP: RBM p-value based permutation method.

PBB: RBM p-value based bootstrap method.

doi:10.1371/journal.pone.0080099.t004

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org13 November 2013 | Volume 8 | Issue 11 | e80099

Page 14

Acknowledgments

Many thanks to the reviewers for their thorough review and insightful

comments which helped us improved this manuscript.

Author Contributions

Conceived and designed the experiments: DL MALP. Performed the

experiments: DL MALP NIP TDD. Analyzed the data: DL MALP. Wrote

the paper: DL MALP. Performed the simulation studies: DL. Provided

specimens to the assessment: NIP TDD. Edited the manuscript: DL MALP

NP WXC TDD.

References

1. Adkins RM, Krushkal J, Tylavsky FA, Thomas F (2011) Racial differences in

gene-specific dna methylation levels are present at birth. Birth Defects Res A Clin

Mol Teratol 91: 728–36.

2. Smyth GK (2004) Linear models and empirical bayes for asessingdifferential

expression in microarray experiments. Statistical Application in Genetic

Molecular Biology 3: Article 3.

3. Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for

indenifying differntially expressed genes in replicated cdna microarray

experiments. Statistica Sinica 12: 111–139.

4. Tusher VG, Tibshirani R, Chu G (2001) Significant analysis of microarrays

applied to the ionizing radiation response. Proceedings of the National Academy

of Sciences 98: 5116–5121.

5. Efron B, Tibshirani RJ, Storey JD, Tusher V (2001) Empirical Bayes analysis of

microarray experiment. Journal of American Statistical Association 96: 1151–

1160.

6. Lo ¨nnstedt I, Speed TP (2002) Replicated microarray data. Statistica Sinica 12:

31–46.

7. Larsson O, Wahlestedt C, Timmons JA (2005) Considerations when using the

significance analysis of microarrays (SAM) algorithm. BMC Bioinformatics 6:

129–134.

8. Edgar R, Domrachev M, Lash AE (2002) Gene expression omnibus: Ncbi gene

expression and hybridization array data repository. Nucleic Acids Res 30: 207–

210.

9. Soric B (1989) Statistical discoveries and effect-size estimation. Journal of the

American Statistical Association 84: 608–610.

10. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical

and powerful approach to multiple testing. Journal of the Royal Statistical

Society Series B (Methodological) 57: 289–300.

11. Good PI (2005) Permutation, parametric and bootstrap tests of hypotheses.

Springer, 3rd edition.

12. Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and

methods for P-Value adjustment. New York: Wiley.

13. Calian V, Li D, Hsu JC (2008) Partitioning to uncover conditions for

permutation tests to control multiple testing error rates. Biometrical Journal

50: 756–766.

14. Efron B (1979) Bootstrap methods: Another look at the jackknife. The Annals of

Statistics 7: 1–26.

15. Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. Chapman &

Hall/CRC.

16. Freedman DA (1981) Bootstrapping regression models. The Annals of Statistics

9: 1218–1228.

17. Hall P (1986) On the bootstrap and confidence intervals. The Annals of Statistics

14: 1431–1452.

18. Pollard KS, van der Laan MJ (2005) Resampling-based multiple testing:

Asymptotic control of type I error and applications to gene expression data.

Journal of Statistical Planning and Inference 125: 85–100.

19. Kerr MK, Churchill GA (2001) Experimental design for gene expression

microarrays. Biostatistics 2: 183–201.

20. Menon R, Conneely KN, Smith AK (2012) Dna methylation: an epigenetic risk

factor in preterm birth. Reproductive Sciences 19: 6–13.

21. Parets SE, Conneely KN, Kilaru V, Fortunato SJ, Syed TA, et al. (2013) Fetal

dna methylation associates with early spontaneous preterm birth and gestational

age. PLoS One 8: e67489.

22. Jensen TG, Soi S, Wang L (2009) A bayesian approach to efficient differential

allocation for resampling-based significance testing. BMC Bioinformatics 10:

198.

Analysis of Two-Color Microarrays

PLOS ONE | www.plosone.org 14November 2013 | Volume 8 | Issue 11 | e80099