Faster cyclic loess: normalizing RNA arrays via
Karla V. Ballman∗, Diane E. Grill, Ann L. Oberg and
Terry M. Therneau
Division of Biostatistics, Mayo Clinic College of Medicine, Rochester, MN 55905, USA
Recieved on January 9, 2004; revised and accepted on February 20, 2004
Advance Access publication May 27, 2004
technique that yields results similar to cyclic loess normaliza-
tion and with speed comparable to quantile normalization.
and quantile normalization and is fast; it is at least an order of
magnitude faster than cyclic loess and approaches the speed
of quantile normalization.Furthermore, fastlo is more versatile
than both cyclic loess and quantile normalization because it is
Availability: The Splus/R function for fastlo normalization is
available from the authors.
Our goal was to develop a normalization
High-density gene expression array technology allows
investigators to obtain quantitative measurement of the
specimen. There are two major types of microarray technolo-
gies: spotted cDNA and oligionucleotide arrays. Expression
data obtained from either type of microarray technology has
measurement error or variation. One type of variability is due
to biological differences between specimen samples. This
is what is of interest to the investigator; the investigator
would like to know which genes are differentially expressed
among different biological samples (e.g. between cancer-
ous and normal kidney tissues, among B-cells challenged
with different agents in culture.). Systematic variation also
affects the measured gene expression level. Many sources
contribute to this type of variation and are found in every
microarray experiment. Sources include, but are not lim-
ited to, the array manufacturing process, the preparation of
the biological sample, the hybridization of the sample to the
et al., 2001) provide a more complete discussion of the
∗To whom correspondence should be addressed.
process. The purpose of normalization is to minimize the
systematic variations in the measured gene expression levels
among different array hybridizations to allow the compar-
ison of expression levels across arrays and so that biological
differences can be more easily identified.
There are numerous methods for normalizing gene expres-
sion data. Generally, normalization methods use a scaling
are applied to the raw intensities of the spots (gene sequence
GeneChip array) on the microarray to produce normalized or
scaled intensities. Types of normalization techniques include
mean correction (Richmond and Somerville, 2000), non-
linear models (Yang et al., 2002), linear combination of
factors (Alter et al., 2000), and Bayesian methods (Newton
ofabaselinearray, performthebest(Bolstadetal.,2003). We
tend to favor two commonly used non-linear methods: cyclic
loess normalization and quantile normalization. Both tech-
niques are non-linear and perform normalization on the set of
arrays as a whole without specifying a reference array. Over-
all, we prefer cyclic loess because it is not as aggressive in
its normalization as is quantile normalization; however, cyc-
lic loess is relatively slow for even a moderate sized set of
arrays. Quantile normalization, on the other hand, is very
fast for even large sets of arrays. The goal of our invest-
igation was to develop a method that produced normalized
faster—on the order of the speed of quantile normalization.
The result is a new normalization method called fastlo. Since
normalization is performed on the raw intensity values of the
spots on the arrays, the methods discussed here are applic-
able to both major types of array technology: spotted cDNA
and oligonucleotide. However, our focus is on data arising
from GeneChip arrays and it should be noted that there are
cDNA arrays (Yang et al., 2002).
In the next section we discuss cyclic loess normalization
and useful insights that can be gained from viewing it as a
Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved.
at University of Portland on May 23, 2011
average of the log (base 2) values
difference of the log (base 2) values
68 10 12 14
Fig. 1. MA plot for 10000 randomly selected probes.
parallel algorithm. Section 3 describes a new type of normal-
ization, fast linear loess (fastlo), arising from the observation
that cyclic loess is essentially a smoothing function coupled
with a very simple linear model. In Section 4, we review
quantile normalization. The performances of the three dif-
ferent normalization techniques are compared in Section 5.
Comparisons are made using simulated data and real data,
one of the benchmark sets for Affymetrix GeneChip expres-
sion measures (Cope et al., 2003). The simple linear model
underlying the fastlo method is extended in Section 6. We
close with a discussion of our findings in Section 7.
Most methods of normalization, including the methods dis-
change expression levels under the conditions being studied.
ratio of expression values between two conditions is one or
sion is zero for a typical gene. This is biologically plausible
for many studies. However, if there is good reason to believe
this assumption is not true for a particular study, then the
normalization methods described here are not appropriate.
A fundamental graphical tool for the analysis of gene expres-
sion array data is the M versus A plot (MA plot); here M is
the difference in log expression values and A is the average
of log expression values (Dudoit et al., 2002). Figure 1 con-
tains an MA plot for a random sample of 10000 probes from
two (unnormalized) GeneChip arrays with a loess smoother
show a point cloud scattered about the M = 0 axis. In other
words, the loess smoother would be a horizontal line at 0 for
ideally normalized data.
Cyclic loess normalizes two arrays at a time by applying a
correction factor obtained from a loess curve fit through the
MA plot of the two arrays, call the curve f(x). For example,
consider the circled point in Figure 1. This point corresponds
level at this spot would be reduced by 1/2 the distance of
f(x) from the y = 0 line; in other words, f(x)/2 would be
subtracted from the expression level of this spot on one of the
arrays. The expression value for this spot on the other array
would be increased by f(x)/2. After correction, the MA plot
for this particular pair of arrays would be horizontal. One
pass of the cyclic loess algorithm consists of performing this
pairwise normalization on all distinct pairs of arrays. Passes
of the algorithm continue until the computed corrections of a
completed pass are essentially zero.
In summary, to perform the cyclic loess algorithm begin
with the log2of the spot expression intensities arranged as a
matrix with one column per array and one row per array spot
and proceed through the steps below.
The x-axis is the mean probe expression value of the
two arrays and the y-axis is the difference (one point
for each spot).
(2) Fit a smooth loess curve f(x) through the data.
(3) Subtract f(x)/2 from the first array and add f(x)/2 to
(4) Repeat until all distinct pairs have been compared.
(5) Repeat until the algorithm converges.
In practice, the pairs of arrays are chosen by a method that
systematically cycles through all pairs. A drawback of cyclic
loess is the amount of time required to normalize a set of
data; the time grows exponentially as the number of arrays
increases. Typically, twoorthreepassesthroughthecomplete
cyclearerequiredforconvergence. Likely, cyclicloesswould
converge faster if pairings went in a more balanced order.
a loess smooth would still be required for a relatively large
number of array pairs.
Further examination of the cyclic loess algorithm reveals
some interesting facts. First, the algorithm preserves the row
means of the data matrix, Y. At any given step, one number in
a row is increased by f(x)/2 and another is decreased by the
same amount. Second, if all the values in one of the columns
algorithm (the scaled intensities on each array) are changed
only by the addition of a constant. This is because any one of
the plots on which the smooths are based is identical but for
the labeling of its axes, and thus any given smooth is changed
only by a constant. One pass through the algorithm requires
2loess smooths on all the spots on the array.
2.2 Parallel loess
Cyclic loess is inherently parallel in nature. Viewing it in this
manner may provide insights that allow computational time
savings. Imaginethatwehadaparallelmachine, sothatallthe
pairwise normalizations could be done simultaneously. Once
at University of Portland on May 23, 2011
K.V.Ballman et al.
do differ among tissue initially thought to be similar, e.g. a
molecular subtype within brain cancer tissue, quantile nor-
malization may be too aggressive and attenuate differences of
interest. These issues are yet to be resolved.
An advantage of fastlo not shared by the other methods is
that it is model-based. As demonstrated above, this allows
us to normalize an entire set of arrays to be compared as
a group—i.e. no need to perform separate normalizations
for subgroups. This is useful in situations where it is not
clear to what degree the biological variation between the
groups and the microarray experimental process contribute
to the systematic variation among the arrays. This model
can be easily extended to incorporate other variables that
are part of the underlying experimental design such as sub-
ject demographic variables, time, dose level, etc. Other
estimation options are also open, e.g. shrinkage of the treat-
ment coefficients toward zero in an experiment where it was
felt that only a small percentage of genes are differentially
Finally, a more robust estimator could be used to generate
ˆ y. If the number of arrays to be normalized is relatively small,
the mean estimate can be significantly influenced by a single
outlier. We explored this using the median of spot i across
all arrays to generate the reference array. The results were
not promising; they resulted in a considerably biased estim-
ate. Recall the data to be normalized is a matrix of arrays as
columns and spot intensities as rows. Fastlo iterates between
fits to the rows (ˆ y) and loess smooths on the columns of the
array. It appears that the same type of estimator must be used
for row and column operations in the algorithm. We have not
pursued this further.
Overall, fastlo (1) produces normalized results similar to
cyclic loess (as well as quantile) normalization, (2) is consid-
erably faster than cyclic loess, and (3) has added versatility
above quantile and cyclic loess normalization through its
connection with linear models.
Alter,O., Brown,P.O. and Botstein,D. (2000) Singular value decom-
position for genome-wide expression processing and modeling.
Proc. Natl Acad. Sci., USA, 97, 10101–10106.
Bolstad,B., Irizarray,R., Astrand,M. and Speed,T. (2003) A compar-
ison of normalization methods for high density oligonucleotide
array data based on bias and variance. Bioinformatics, 19,
Cope,L.M., Irizarry,R.A., Jaffee,H., Wu,Z. and Speed,T.P. (2003)
A benchmark for Affyrmetrix genechip expression measures.
Bioinformatics, 1, 1–13.
Dudoit,S., Yang,Y.H., Callow,M.J. and Speed,T.P. (2002) Stat-
istical methods for identifying differentially expressed genes
in replicated cDNA microarray experiments. Stat. Sinica, 12,
Hartemink,A., Gifford,D., Jaakkola,T. and Young,R. (2001) Max-
imum likelihood estimation of optimal scaling factors for expres-
Irizarry,R., Hobbs,B., Collins,F., Beazer-Barclay,Y., Anntonellis,K.,
Scherf,U. and Speed,T. (2003) Exploration, normalization, and
summaries of high density oligonucleotide array probe level data.
Biostatistics, 4, 249–264.
Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and
Tsui,K.W. (2001) On differential variability of expression ratios:
improving statistical inference about gene expression changes
from microarray data. J. Comput. Biol., 8, 37–52.
microarrays. Curr. Opin. Plant Biol., 3, 108–116.
Yang,Y.H., Dudoit,S., Luu,P., Lin,D.M., Peng,V., Ngai,J. and
Speed,T. (2002) Normalization for cDNA microarray data: a
robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res., 30, el5.
at University of Portland on May 23, 2011