Vol. 25 no. 6 2009, pages 751–757
Statistical methods of background correction for Illumina
Yang Xie1,2,∗, Xinlei Wang3and Michael Story2,4
1Division of Biostatistics, Department of Clinical Sciences,2Simmons Cancer Center, University of Texas
Southwestern Medical Center,3Department of Statistical Science, Southern Methodist University and
4Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA
Received on September 26, 2008; revised on December 18, 2008; accepted on January 17, 2009
Advance Access publication February 4, 2009
Associate Editor: David Rocke
Motivation: Advances in technology have made different microarray
platforms available. Among the many, Illumina BeadArrays are
relatively new and have captured significant market share. With
BeadArray technology, high data quality is generated from low
sample input at reduced cost. However, the analysis methods
for Illumina BeadArrays are far behind those for Affymetrix
oligonucleotide arrays, and so need to be improved.
Results: In this article, we consider the problem of background
correction for BeadArray data. One distinct feature of BeadArrays is
that for each array, the noise is controlled by over 1000 bead types
conjugated with non-specific oligonucleotide sequences. We extend
the robust multi-array analysis (RMA) background correction model
to incorporate the information from negative control beads, and
consider three commonly used approaches for parameter estimation,
namely, non-parametric, maximum likelihood estimation (MLE) and
Bayesian estimation. The proposed approaches, as well as the
existing background correction methods, are compared through
simulation studies and a data example. We find that the maximum
likelihood and Bayes methods seem to be the most promising.
Supplementary information: Supplementary data are available at
Illumina BeadArrays are long oligonucleotide arrays. Compared
with Affymetrix oligonucleotide arrays, Illumina expression
BeadArrays often generate data of similar quality, require less
sample input and reduce the cost of array experiments (Shi et al.,
2006). These features have allowed the BeadArray platform to
capture a significant market share. BeadArray technology has
been applied to a variety of genomic studies including expression
profiling, single-nucleotide polymorphism genotyping and copy
number variation analysis. In this article, we will focus on genome-
wide expression data.
The essential element of BeadArray technology is the 3-micron
silica bead coated with hundreds of thousands of copies of a
specific oligonucleotide sequence, which will be referred to as one
bead type. For genome-wide expression arrays, the majority of
∗To whom correspondence should be addressed.
the genes are represented by one bead type, which is a specific
50mer oligonucleotide sequence, derived from the national center
for biotechnology information reference sequences. Each bead
type is replicated about 30 times on each array to increase the
reproducibility and the quality of the data. Besides gene sequences,
Illumina also allocates more than 1000 control bead types to each
array, which do not correspond to any expressed sequences in the
genome. So the associated beads are not expected to hybridize to
any genes in the RNA samples. They serve as negative controls for
the non-specific binding or background noise in an experiment.
Microarray data preprocessing plays a crucial role in obtaining
valid biological results from array experiments (Bolstad et al., 2003;
Durbin et al., 2002; Irizarry et al., 2003; Li and Wong, 2001; Lin
et al., 2008). Data preprocessing methods for cDNA spotted arrays
and Affymetrix oligonucleotide arrays have undergone rigorous
testing and validation, and these methods have been shown to
greatly improve the quality of data. In contrast, few statistical
methods have been developed for processing BeadArray data.
Dunning et al. (2008) discussed some important statistical issues in
processing and analyzing Illumina BeadArray data; they focused on
proposed a variance-stabilizing transformation (VST) method that
takes advantage of the technical replicates available on an Illumina
microarray to preprocess BeadArray data. In this article, we will
focus on the background correction step for Illumina BeadArrays,
the purpose of which is to remove the non-specific signal from total
signal. Background correction is the main step separating various
microarray data preprocessing methodologies (Irizarry et al., 2006).
Furthermore, it is platform specific because of the different sources
of noise for each platform (Irizarry et al., 2003; Ritchie et al.,
2007). Although the methodologies in background correction for
cDNA and Affymetrix arrays are well studied (Irizarry et al., 2006;
Ritchie et al., 2007), few have been proposed for BeadArrays.
Several inherent platform differences make statistical methods for
processing Illumina BeadArrays different from other platforms.
First, more than 1000 negative control bead types are allocated
on each Illumina array to control background noises. Second,
the negative control beads on Illumina arrays are conjugated to
non-specific sequences that are not associated with gene-specific
probes; this is different from Affymetrix mismatch probes, each of
which has one single nucleotide different from the corresponding
perfect match (PM) probe. Third, the number of replicated beads
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
Y.Xie et al.
for each bead type is random, 30 on average, and the replicated
beads have the same sequence for one bead type.
As the most commonly used software for preprocessing
BeadArray data, BeadStudio (Illumina, CA, USA) provides only
two options for background correction: no background correction
and background subtraction using the average value of the negative
control beads. However, a large number of negative data values
can be generated by using the background subtraction approach.
For example, in our motivating leukemia study (Ding et al., 2008),
half of the probe values on one array became negative after the
background subtraction step. Without using a more advanced data
transformation method, such as VST by Lin et al. (2008), the probes
with negative values need to be excluded for the purpose of log
transformation and it will result in the loss of large amounts of
information on the arrays. On the other hand, without background
correction, the commonly used measure for downstream analysis,
fold change (i.e. the expression ratios between two experimental
groups), can be compressed. For illustration, let E and C represent
the true expression value for the treatment group and the control
group, respectively; and let B be the noise value for both groups.
Suppose we are interested in the ratio R=E/C.Without background
correction, we would use the observed total intensities, E+B and
C+B, to calculate the ratio R?=(E+B)/(C+B). It is easy to show
that R?is always biased to 1 compared with R, which results in fewer
genes identified as differentially expressed than expected. Dunning
et al. (2008) also concluded that the background normalization
recommended by Illumina can introduce substantial variability into
the data, increase the numbers of false positives and a large number
of low expression values become negative and cannot be log2
to BeadStudio background correction methodologies.
for Illumina BeadArrays: beadarray (Dunning et al., 2007) and
lumi (Du et al., 2008). In those packages, there are three choices
for background correction: no background correction, background
subtraction and Normexp background correction. Normexp uses the
same normal-exponential convolution model as the robust multi-
array analysis (RMA) background correction, and we refer to this as
the RMAmodel in our article. RMAis a popular R package used to
preprocess Affymetrix oligonucleotide array data. The background
correction step in RMA uses a normal-exponential convolution
model to fit the PM probes on the array. Empirical evidence shows
that RMA background correction works well in practice (Bolstad
et al., 2003; Irizarry et al., 2003; Wu et al., 2004). Recently,
Silver et al. (2008) proposed using saddlepoint approximation to
estimate parameters for the normal-exponential convolution model,
and they showed that the approach achieved good performance in
background correction for two-color microarray data. This normal-
exponential convolution model with saddlepoint approximation
method is now available in the beadarray package. Ding et al.
(2008) proposed a model-based background correction method
incorporate information from negative control data generated with
Illumina BeadArrays. Ding et al. (2008) showed that MBCB can
lead to more precise determination of gene expression and better
biological interpretation of Illumina BeadArray data. In this article,
we will further improve parameter estimation of the MBCB model.
We will consider and compare three methods for parameter
estimation including non-parametric, maximum likelihood and
Histograms of the gene intensities
100 200 300 400
Fig. 1. Smoothed histograms: (A) illustrates the distributions of observed
gene expression intensities from all arrays in the leukemia study; (B) the
distribution of the observed intensities of genes and negative controls on a
Bayesian approaches. A complete mathematical development and
and data files used for this article will be available through the
Bioconductor web site. We will use both simulation studies and
a real data example to illustrate and compare these methods, and
provide guidance for their practical use.
The statistical model we use is motivated by a leukemia study,
which will be described in Section 5. Figure 1Ashows the smoothed
histograms of the gene intensities observed from all BeadArrays
in the study. It is not surprising that the distributions appear to
be similar to the Affymetrix signal distributions (Bolstad et al.,
2003). Figure 1B shows one example of the smoothed histograms of
intensities for both the genes and negative controls on a single array,
from which we can see that the distribution of the negative controls
might be approximated by a normal distribution. In addition, the
modes of the genes and the negative controls are close to each other,
with the mode of the genes being larger than that of the controls.
These observations motivated us to extend the convolution model
in the RMAbackground correction method forAffymetrix arrays to
the BeadArray signal intensities. That is, observed intensity=true
signal+background noise, where the true signal intensities (if not
zero) are modeled by an exponential distribution with mean α and
the background noise is modeled by a normal distribution with
finite mean µ and variance σ2. We further assume that µ and σ2
BeadArray data background correction
take appropriate values (e.g. µ>2σ) so that negative background
noise occurs rarely and has a negligible impact. Note that an
additive model is used here, which was considered for modeling
the background noise component by several researchers (Cui et al.,
model in the context of background correction.
Throughout the article, we use i to index the genes, and j to index
the negative controls. For i=1,...,I, we have
where Xiis the observed intensity, Siis the signal intensity and Bi
is the noise intensity for gene i; and I is the number of the genes.
where X0jis the observed intensity, and B0jis the noise intensity for
negative control j; and J is the number of the negative controls. Note
that all Xi’s and X0j’s are assumed to be independent; and for each
negative control bead, the signal intensity S0jis naturally assumed
to be zero.
Under (1), for the i-th gene, the marginal density of Xiis given by
where φ(·) and ?(·) are the probability density function (pdf) and
cumulative density function (cdf) of a standard normal distribution.
The conditional distribution of the true signal Sigiven the total
intensity Xi=xican be given by
where ai=xi−µ−σ2/α and b=σ. Then the expected true signal
given the observed intensity can be given by
As in the RMAmethod, E(Si|Xi=xi) will be used as the background
corrected intensity for gene i.
It is worth mentioning that (5) is different from that used in
the original RMA background correction method (Bolstad, 2004),
This is because in the original RMA method, Bi’s were assumed to
be non-negative; however, for algebraic simplicity, the normalizing
constant of the truncated normal distribution was ignored when
calculating p(xi) and the subsequent E(Si|Xi=xi) in (6). In contrast,
the Bi’s in our model can take negative values but with a very
small probability so that we can ignore the effect of negative values.
Both approaches lead to closed-form formulas. However, we prefer
using (3) and (5) because the mathematical correctness of p(xi)
plays an important role in obtaining the correct maximum likelihood
In the original RMA convolution model, there is no closed form
to estimate the parameters α, µ and σ2. The numerical maximum
likelihood algorithms cannot converge well, as pointed out by the
authors themselves and others (Bolstad et al., 2003; McGee and
Chen, 2006). Therefore, an ad hoc approach is used to estimate
the parameters. In Bolstad et al. (2003), it is described as follows.
First, a non-parametric density function is fitted from the observed
intensities, and the mode of this density is used to estimate µ. Then
the lower tail of the density (to the left of the mode) is used to
estimate σ and the right tail is used to estimate α. Based on the
program in Bioconductor, the parameters are estimated in a slightly
different way: one first obtains the overall mode m0of the whole
density; then obtains the local mode m1from the left tail of the
density (to the left of m0) and use m1to estimate µ; and finally, one
uses the left tail of the density (to the left of m1) to estimate σ and
the right tail ( to the right of m1) to estimate α.
The model described in Section 2 incorporates extra information
from negative controls, which can easily solve the problem
encountered in the RMA parameter estimation. Let θTdenote the
vector of parameters α, µ and σ2in our model. We will discuss
three methods for estimating θ below.
A simple method to estimate θ is to estimate α using all the
observations, but estimate µ and σ2using observations from
negative control beads only. Noting that EXi=α+µ and EX0j=µ,
an unbiased estimator of α can be given by ˜ α=¯X−¯X0, where
estimator˜θ =(˜ α, ˜ µ, ˜ σ2) is given by
˜ α =
˜ µ =
Var(˜ α) =α2+σ2
Var( ˜ µ) =
Var(˜ σ2) =
A non-parametric estimator
j=1X0j/J. If we further estimate µ and
j=1, then an unbiased
σ2using sample mean and variance of (X0j)J
Y.Xie et al.
An approximate interval estimator of α can be given by
˜ α±Z·ˆ SD(˜ α),
exact confidence interval of α can be computed using the fact
that ˜ α is equivalent to the sum of independent Gamma(I,β/I) and
Though simple in nature,˜θ might be attractive for practitioners.
First of all, ˜θ is a non-parametric estimator of θ that requires
no distributional assumptions to both true signals and background
noise. Second, (˜ α, ˜ µ) is indeed the least square estimator of (α,µ),
which minimizes the sum of squared distances
More importantly,˜θ is very easy to compute; and the associated
variances can be estimated readily by plugging˜θ in (8). Note that
we use the terminology ‘non-parametric’ in this article to refer to
the parameter estimation but not the model assumptions.
ˆ SD(˜ α)=
(˜ α2+ ˜ σ2)/I+ ˜ σ2/J. Also, an
When estimating µ and σ2, the non-parametric estimator˜θ ignores
the information expressed through the genes and this leads to great
simplicity in calculation. However, since I?J, one may improve
In order to do so, it is natural to consider the maximum likelihood
estimator, due to its well-established theoretical properties.
The likelihood function is given by
The maximum likelihood estimation (MLE)
The maximization of the log likelihood function l(θ) over θ
has no closed form solution. The MLE ˆθ =(ˆ α, ˆ µ, ˆ σ2) can be
computed numerically through the Newton–Raphson algorithm,
which iteratively updates the parameter estimate using
where˙l denotes the vector of the first-order derivatives of l(θ),¨l
denotes the Hessian matrix of the second-order derivatives of l(θ),
and θ0is the initial value. Here, we use our non-parametric estimate
˜θ for θ0. We note that the good initial point, plus the fact that˙l and¨l
are both available in closed forms, leads to quick convergence of the
algorithm in our problem setting.Also, the covariance matrix ofˆθ is
estimated routinely through −
interval estimator of θ readily. We also show that for large samples,
( ˆ µ, ˆ σ2)canimproveestimationefficiencyof( ˜ µ, ˜ σ2).Thedetailscan
be seen in the Supplementary Material.
?−1, so that we can construct an
Bayesian approaches are frequently used in microarray data analysis
due to the flexibility of incorporation of complicated data structure
A Bayesian approach
and prior information (Reilly et al., 2003; Xiao et al., 2006). Here,
we describe a Bayesian method for parameter estimation in the
The joint posterior distribution for the parameters α, µ and σ2is
where π(α,µ,σ2) is the prior density function, which can easily
incorporate useful information available from direct knowledge
or previous experiments. Here, we consider weak, but proper
independent prior distributions for the three parameters, to represent
the case that no real prior information is available, for the purpose
of fair comparison in our numerical experiments. For µ, we choose
a normal prior with a sufficiently large variance to make the
mean irrelevant. For both α and σ2, we use the conjugate prior
with vague information, that is, the inverse gamma distribution
The posterior distribution p(α,µ,σ2|X,X0) is not analytically
tractable. So a Metropolis–Hastings (MH) procedure is used to
simulate samples from the posterior distribution, which is outlined
in the Supplementary Material. We now proceed to discuss the
background correction method under the Bayesian approach. As
indicated in Equation (4), the conditional distribution of Si|Xi
is a truncated normal distribution, Si|Xi=xi,µ,σ2,α∼N(xi−µ−
α,σ2) where 0<Si<Xi. In the Markov chain Monte Carlo
(MCMC) procedure, we can simulate a sample of Sifrom this
for each gene and then use the average of Si|Xisamples as the
background corrected signal for gene i. An alternative way is to
calculate the posterior means of the parameters, and then plug them
into Equation (5) to get the background corrected signals. These
approaches lead to almost identical results, and we use the second
approach in our numerical experiments for comparison with the
We have discussed three methods for estimating θ based on
the model in (1) and (2). In this section, we first compare the
performance of the three methods in parameter estimation and
background correction under various parameter settings.
In our first experiment, we set α to be 20, 50 or 100; µ to be
100, 150 or 200; and σ to be 25, 35 or 45. For each possible θ
value, we simulated 100 datasets and for each dataset, we generated
45000 observations for genes from (1), and 1000 observations for
negative controls from (2). Supplementary Table 1 reports the mean
squared errors (MSE) of parameter estimation for each method.
Table 1 reports the MSEs of background corrected intensities.
BeadArray data background correction
Table 1. MSE of background corrected intensities
RMA NPMLEB RAW NES
Under each setting, for example, the MSE of the MLE is defined
intensities is defined as
t=1(ˆθt−θ)2/100; and the MSE of background corrected
The MSEs for RMA, non-parametric and Bayes estimators are
defined similarly. We also compare the MSEs of background
corrected intensities for raw data without any background correction
and the normexp model with saddlepoint estimation. In the tables,
NP stands for the non-parametric estimator˜θ, B stands for the Bayes
estimatorˇθ, NES stands for the normexp model with saddlepoint
estimation. RMA, MLE and RAW are obvious.
We first discuss the performance in parameter estimation. From
Supplementary Table 1, we can see that the RMA estimator is
substantially worse than the NP, MLE and B estimators. The overall
performance of MLE and B are very similar. MLE and B work very
well for estimating α and µ, and the estimates are nearly as good
as the truth, as indicated by the close-to-zero MSE values. There is
actually not much difference in estimating α and µ among MLE,
NP and B, though MLE and B appear to always be slightly better.
However, MLE and B are much better than NP for estimating σ2.As
to background correction, we can conclude easily from Table 1 that
For RMA, NP, MLE and B methods, we also calculated the
biases of parameter estimation (e.g. the bias of MLE is defined as
t=1(ˆθt−θ)/100), and the average computing time over all the
settings. RMA is seriously biased in all the settings, while all the
other methods appear to be unbiased. In terms of computing speed,
NP is the fastest, followed by RMA and MLE, and B requires much
more time (i.e., the average time in seconds is 0.00, 0.23, 1.46 and
26.18 for NP, RMA, MLE and B, respectively).
In practice, the normality assumption for background noise may not
always hold. In our second experiment, we test the robustness of the
estimators when the normality assumption is violated.
We generated true signal intensities from exponential(100), and
background noise from Gamma(a,b). Here, (a,b) takes values
(64,3.125), (32.65,6.125) and (19.75,10.125) so that the mean µ of
the background noise is fixed at 200, and the SD σ takes values 25,
35 and 45, respectively. Supplementary Figure 1 shows clearly that
the gamma distribution with the largest variance deviates the most
from the normality assumption.
For each (a,b), we generated 100 datasets and proceeded with the
parameter estimation and background correction as if the underlying
true model had been (1) and (2). The top panel of Table 2 reports
the MSEs of background corrected intensities defined in (11). Note
that in (11), the first expectation was calculated under the assumed
model, while the second expectation was calculated under the true
data generating model.
better than RMA. They work reasonably well, especially when the
deviation from normality is not big.As the deviation gets larger, the
performance worsens. Clearly, NP is the best amongst all methods.
when estimating parameters.
Another model assumption is that the distribution of Bi’s (the
control sequences). Ideally, the noise sources should be the same for
scanned simultaneously. In this sense, the assumption is reasonable.
However, it is possible that the negative control beads also contain
some weak signals because of the sequence selection, which may
lead to the higher intensity levels of the negative controls compared
negative controls is greater than the mean intensity of the random
noise. In our third experiment, we check the performance of the
methods under this scenario.
We generated true signal intensities from exponential(100), the
true background noise from normal distributions with mean µ
and variance σ2, and the negative control intensities from normal
distributions with mean µ0and variance σ2. We set µ to be 100 or
of Table 2 reports the MSEs of background corrected intensities
defined in (11).
Table 2 shows that NP, MLE and B perform better than RMA even
when the noise distributions are different for negative controls and
genes.Whenδ increases,theMSEsincreaseforNP,MLE andB.The
performance of RMA does not depend on δ because its estimation
uses information from genes only. On the other hand, NP is the most
sensitive to δ, because its estimator ˜ µ is calculated from negative
controls only. MLE performs the best in all the cases, which is
slightly better than B.Although both MLE and B use the information
Y.Xie et al.
Table 2. Robustness checking for MSE of background corrected intensities
under Gamma background noise (top panel) and under different distributions
of background noise (bottom panel) for genes and negative controls
We use a leukemia study to examine the different approaches of
parameter estimation and the subsequent background correction. In
to study the leukemogenic process, which have been proved to be
potential tools for studying the pathogenesis of leukemia in humans.
Illumina Mouse-6 V1 BeadChip mouse whole-genome expression
arrays were used to obtain the gene expression profiles of acute
myeloid leukemia (AML) samples from irradiated CBA mice who
subsequently developed AML and the control samples. In this type
of array, 46120 genes and 1655 negative controls are randomly
allocated on each array. The goal of the study was to identify
the genes that express differently between leukemia and control
samples. Ding et al. (2008) described the experiment in detail and
demonstrated that using a background correction strategy enhanced
the biological findings of the study.
We applied our methods to all the arrays in the study.
Supplementary Table 1 gives an example of the point estimates and
standard errors of the parameters for one array. Note that the RMA
method does not provide standard errors for parameter estimation.
The variances of the parameter estimates are small for NP, MLE and
B. This is expected because the information from over 40000 genes
As in our simulation studies, MLE and B give similar results that
are different from those of RMA.
To compare the performance of different methods, reverse
transcriptase-polymerase chain reaction (RT–PCR) experiments
were conducted on randomly selected genes in the study. RT-PCR
is a highly sensitive technique for the detection and quantification
of mRNA (messenger RNA) level, and so is regarded as the gold
standard for gene expression levels. However, RT-PCR experiments
are relatively time and labor intensive since they measure the
Table 3. Comparison with the RT-PCR results in the leukemia study
expression level for one gene at one time. In our study, RT-PCR
experiments were limited to 14 genes because of cost. We applied
the raw data without background correction (RAW), the background
subtraction method as mentioned in Section 1, RMA, NP, MLE and
B for background correction, and then used the quantile–quantile
normalization to remove the systematic variation amongst arrays.
Note that after background subtraction, 12 out of the 14 genes have
negative expression values for either AML or control samples. So
log ratios of gene expression levels between leukemia samples and
control samples were calculated for each method. The method that
can generate the most consistent results with the RT-PCR results
will be thought as the best method.
Supplementary Figure 2 indicates that for the randomly
selected 14 genes, the background corrected expression levels
from BeadArrays are highly correlated with the RT-PCR results.
Background correction using MLE and B lead to the most consistent
results with RT-PCR, and RAW performs the worst. We notice
that without background correction, the log ratios are closer to 0
compared with the RT-PCR results, which is consistent with the
data compression problem discussed in the Section 1.
Table 3 summarizes the association between the background
corrected microarray results and the RT-PCR results based on linear
the response variable is log10ratio of gene expression between
leukemia and normal tissues generated by microarrays with the
different methods of background correction, and the independent
variable is log10ratio of gene expression between leukemia and
normal tissues generated by RT-PCR. If a background correction
method generates consistent results with the RT-PCR method, then
the slope of the corresponding linear regression model is expected
to be close to 1. The slopes for RT-PCR versus MLE and B are
both 0.87, and the slope for RT-PCR versus NP is 0.95, RT-PCR
versus RMA is 1.20 and RT-PCR versus RAW is 0.52. In addition,
using the RT-PCR results as the gold standard, the MSEs for MLE
and B are smallest, followed by NP, and last RMA and RAW. All
of the above indicate that applying background correction methods
can increase the BeadArray data quality and amongst the methods,
it appears that MLE and B are the best, with NP next, and RMA the
worst. Also note that if we use the background subtraction method
provided by Beadstudio software, most information for those 14
genes is lost. The normexp method with saddlepoint approximation
method (NES) implemented in beadarray library was applied to the
the leukemia study, and it does not generate more consistent results
with RT-PCR results (MSE=0.19 and β=0.54) compared with the
methods we described in this article.
BeadArray data background correction Download full-text
The Illumina BeadArray platform has become increasingly
important because it is carefully designed to control noise
and variation. However, the statistical methodology development
for this platform is far behind that forAffymetrix and cDNAarrays,
and there is ample space for improvement. In this article, we have
described model-based background correction methods for Illumina
BeadArrays. Built on the RMA convolution model, our model
incorporates the information from over 1000 negative control beads,
and improves the efficiency of background correction significantly,
compared with the existing methods. We have considered three
methods, namely, non-parametric, MLE and Bayes methods, for
parameter estimation. All these methods have their own merits and
are better than the RMA estimation method. The non-parametric
method is very simple, and fast in calculation, which can provide
a good starting point for the other two methods. The MLE is
attractive in theory and has the best estimation efficiency overall.
complicated data structure and prior information into the model,
which is very useful when extra sources of data are available. Note
that the methods of background correction compared in this article
focus on removing background noise from auto fluorescence of the
non-specific oligonucleotide on a spot. Because these methods do
not intend to address local background noise, they can be applied to
both bead-level data and summarized bead-type data. Our real data
example does not show much benefit of applying the methods to
The model we used relies on two assumptions: the background
noise is assumed to be normally distributed, and has the same
distribution for both gene and negative control beads. The
assumptions were made based on the empirical distributions from
real data examples that we have been exposed to, as well as
mathematical convenience. We caution that the assumptions might
be violated for BeadArrays in some experiments. To lessen this
concern, we have examined the robustness of our methods through
simulations where the assumptions did not hold and through the
leukemia study where the truth is not known. We find that our
these issues for potentially better results.
Funding: National Institute of Health (UL1RR024982, and NASA
(NSCORS NNJ05HD36G, NAG9-1569).
Conflict of Interest: none declared.
Bolstad,B.M et al. (2003) A comparison of normalization methods for high density
oligonucleotide array data based on variance and bias. Bioinformatics, 19,
Bolstad,B.M. (2004) Low Level Analysis of High-density Oligonucleotide Array
Data: Background, Normalization and Summarization. Dissertation. University of
Cui,X. et al. (2003) Transformations for cDNAmicroarray data. Stat. Appl. Genet. Mol.
Biol., 2, Article 4.
Ding,L.H. et al. (2008) Enhanced identification and biological validation of differential
gene expression via Illumina whole-genome expression arrays through the use
of the model-based background correction methodology. Nucleic Acids Res.,
Du,P. et al. (2008) lumi: a pipeline for processing Illumina microarray. Bioinformatics,
Dunning,M.J. et al. (2007) beadarray: R classes and methods for Illumina bead-based
data. Bioinformatics, 23, 2183–2184.
Dunning,M.J. et al. (2008) Statistical issues in the analysis of Illumina data. BMC
Bioinformatics, 9, 85.
Durbin,B.P. et al. (2002) A variance-stabilizing transformation for gene-expression
microarray data. Bioinformatics, 18(Suppl. 1), S105–S110.
Huber,W. et al. (2002) Variance stabilization applied to microarray data calibration and
to the quantification of differential expression. Bioinformatics, 18(Suppl. 1).
Irizarry,R.A. et al. (2003) Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics, 4, 249–264.
Irizarry,R.A. et al. (2006) Comparison of Affymetrix GeneChip expression measures.
Bioinformatics, 22, 789–794.
Li,C. and Wong,W.H. (2001) Model-based analysis of oligonucleotide arrays:
expression index computation and outlier detection. Proc. Natl Acad. Sci. USA,
Lin,S.M. et al. (2008) Model-based variance-stabilizing transformation for Illumina
microarray data. Nucleic Acids Res., 36, e11.
McGee,M. and Chen,Z. (2006) Parameter estimation for the exponential-normal
convolution model for background correction of affymetrix genechip data.
Stat. Appl. Genet. Mol. Biol., 5, 24.
Reilly,C. et al. (2003) A method for normalizing microarrays using the genes that are
not differentially expressed. JASA, 98, 868–878.
Ritchie,M.E. et al. (2007) A comparison of background correction methods for two-
colour microarrays. Bioinformatics, 23, 2700–2707.
Shi,L. et al. (2006) The MicroArray Quality Control (MAQC) project shows inter- and
intraplatform reproducibility of gene expression measurements. Nat. Biotechnol.,
Silver,J.D. et al. (2009) Microarray background correction: maximum likelihood
estimation for the normal-exponential convolution. Biostatistics.
Wu,Z. et al. (2004) A model based background adjustment for oligonucleotide
expression arrays. JASA, 99, 909–917.
Xiao,G. et al. (2006) Operon information improves gene expression estimation for
cDNA microarrays. BMC Genomics, 7, 87.