Page 1

BIOINFORMATICS

Vol.20no.162004,pages2778–2786

doi:10.1093/bioinformatics/bth327

Faster cyclic loess: normalizing RNA arrays via

linear models

Karla V. Ballman∗, Diane E. Grill, Ann L. Oberg and

Terry M. Therneau

Division of Biostatistics, Mayo Clinic College of Medicine, Rochester, MN 55905, USA

Recieved on January 9, 2004; revised and accepted on February 20, 2004

Advance Access publication May 27, 2004

ABSTRACT

Motivation:

technique that yields results similar to cyclic loess normaliza-

tion and with speed comparable to quantile normalization.

Results:Fastloyieldsnormalizedvaluessimilartocyclicloess

and quantile normalization and is fast; it is at least an order of

magnitude faster than cyclic loess and approaches the speed

of quantile normalization.Furthermore, fastlo is more versatile

than both cyclic loess and quantile normalization because it is

model-based.

Availability: The Splus/R function for fastlo normalization is

available from the authors.

Contact: ballman@mayo.edu

Our goal was to develop a normalization

1

High-density gene expression array technology allows

investigators to obtain quantitative measurement of the

expressionlevelsfortensofthousandsofgenesinabiological

specimen. There are two major types of microarray technolo-

gies: spotted cDNA and oligionucleotide arrays. Expression

data obtained from either type of microarray technology has

measurement error or variation. One type of variability is due

to biological differences between specimen samples. This

is what is of interest to the investigator; the investigator

would like to know which genes are differentially expressed

among different biological samples (e.g. between cancer-

ous and normal kidney tissues, among B-cells challenged

with different agents in culture.). Systematic variation also

affects the measured gene expression level. Many sources

contribute to this type of variation and are found in every

microarray experiment. Sources include, but are not lim-

ited to, the array manufacturing process, the preparation of

the biological sample, the hybridization of the sample to the

array,andthequantificationofthespotintensities.(Hartemink

et al., 2001) provide a more complete discussion of the

sourcesofsystematicvariationinthemicroarrayexperimental

INTRODUCTION

∗To whom correspondence should be addressed.

process. The purpose of normalization is to minimize the

systematic variations in the measured gene expression levels

among different array hybridizations to allow the compar-

ison of expression levels across arrays and so that biological

differences can be more easily identified.

There are numerous methods for normalizing gene expres-

sion data. Generally, normalization methods use a scaling

functiontocorrectforexperimentalvariation.Thesefunctions

are applied to the raw intensities of the spots (gene sequence

onspottedarraysoroligonucleotideprobesontheAffymetrix

GeneChip array) on the microarray to produce normalized or

scaled intensities. Types of normalization techniques include

mean correction (Richmond and Somerville, 2000), non-

linear models (Yang et al., 2002), linear combination of

factors (Alter et al., 2000), and Bayesian methods (Newton

etal.,2001).Thereiscompellingevidencethatnon-linearnor-

malizationmethods, whicharenotdependentuponthechoice

ofabaselinearray, performthebest(Bolstadetal.,2003). We

tend to favor two commonly used non-linear methods: cyclic

loess normalization and quantile normalization. Both tech-

niques are non-linear and perform normalization on the set of

arrays as a whole without specifying a reference array. Over-

all, we prefer cyclic loess because it is not as aggressive in

its normalization as is quantile normalization; however, cyc-

lic loess is relatively slow for even a moderate sized set of

arrays. Quantile normalization, on the other hand, is very

fast for even large sets of arrays. The goal of our invest-

igation was to develop a method that produced normalized

valuessimilartothatofcyclicloessbutwouldbeconsiderably

faster—on the order of the speed of quantile normalization.

The result is a new normalization method called fastlo. Since

normalization is performed on the raw intensity values of the

spots on the arrays, the methods discussed here are applic-

able to both major types of array technology: spotted cDNA

and oligonucleotide. However, our focus is on data arising

from GeneChip arrays and it should be noted that there are

additionalconsiderationswhennormalizingtwo-colorspotted

cDNA arrays (Yang et al., 2002).

In the next section we discuss cyclic loess normalization

and useful insights that can be gained from viewing it as a

2778

Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved.

at University of Portland on May 23, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 2

Fastlo normalization

average of the log (base 2) values

difference of the log (base 2) values

68 10 12 14

-1

0

1

2

3

4

f(x)

Fig. 1. MA plot for 10000 randomly selected probes.

parallel algorithm. Section 3 describes a new type of normal-

ization, fast linear loess (fastlo), arising from the observation

that cyclic loess is essentially a smoothing function coupled

with a very simple linear model. In Section 4, we review

quantile normalization. The performances of the three dif-

ferent normalization techniques are compared in Section 5.

Comparisons are made using simulated data and real data,

one of the benchmark sets for Affymetrix GeneChip expres-

sion measures (Cope et al., 2003). The simple linear model

underlying the fastlo method is extended in Section 6. We

close with a discussion of our findings in Section 7.

2

Most methods of normalization, including the methods dis-

cussedbelow,assumethatthevastmajorityofthegenesdonot

change expression levels under the conditions being studied.

Inotherwords,theyassumethattheaverage(geometricmean)

ratio of expression values between two conditions is one or

equivalently,theaverage(arithmeticmean)logratioofexpres-

sion is zero for a typical gene. This is biologically plausible

for many studies. However, if there is good reason to believe

this assumption is not true for a particular study, then the

normalization methods described here are not appropriate.

CYCLIC NORMALIZATION

2.1

A fundamental graphical tool for the analysis of gene expres-

sion array data is the M versus A plot (MA plot); here M is

the difference in log expression values and A is the average

of log expression values (Dudoit et al., 2002). Figure 1 con-

tains an MA plot for a random sample of 10000 probes from

two (unnormalized) GeneChip arrays with a loess smoother

superimposed.TheMAplotforideallynormalizeddatawould

show a point cloud scattered about the M = 0 axis. In other

words, the loess smoother would be a horizontal line at 0 for

ideally normalized data.

Cyclic loess normalizes two arrays at a time by applying a

correction factor obtained from a loess curve fit through the

MA plot of the two arrays, call the curve f(x). For example,

consider the circled point in Figure 1. This point corresponds

Cyclic loess

toaspoti oneacharray.Ononearray,theobservedexpression

level at this spot would be reduced by 1/2 the distance of

f(x) from the y = 0 line; in other words, f(x)/2 would be

subtracted from the expression level of this spot on one of the

arrays. The expression value for this spot on the other array

would be increased by f(x)/2. After correction, the MA plot

for this particular pair of arrays would be horizontal. One

pass of the cyclic loess algorithm consists of performing this

pairwise normalization on all distinct pairs of arrays. Passes

of the algorithm continue until the computed corrections of a

completed pass are essentially zero.

In summary, to perform the cyclic loess algorithm begin

with the log2of the spot expression intensities arranged as a

matrix with one column per array and one row per array spot

and proceed through the steps below.

(1) ChoosetwoarraysandgenerateanMAplotofthedata.

The x-axis is the mean probe expression value of the

two arrays and the y-axis is the difference (one point

for each spot).

(2) Fit a smooth loess curve f(x) through the data.

(3) Subtract f(x)/2 from the first array and add f(x)/2 to

the second.

(4) Repeat until all distinct pairs have been compared.

(5) Repeat until the algorithm converges.

In practice, the pairs of arrays are chosen by a method that

systematically cycles through all pairs. A drawback of cyclic

loess is the amount of time required to normalize a set of

data; the time grows exponentially as the number of arrays

increases. Typically, twoorthreepassesthroughthecomplete

cyclearerequiredforconvergence. Likely, cyclicloesswould

converge faster if pairings went in a more balanced order.

However, thetimesavingswouldnotbeconsiderablebecause

a loess smooth would still be required for a relatively large

number of array pairs.

Further examination of the cyclic loess algorithm reveals

some interesting facts. First, the algorithm preserves the row

means of the data matrix, Y. At any given step, one number in

a row is increased by f(x)/2 and another is decreased by the

same amount. Second, if all the values in one of the columns

areincreasedordecreasedbyaconstant,thefinalresultsofthe

algorithm (the scaled intensities on each array) are changed

only by the addition of a constant. This is because any one of

the plots on which the smooths are based is identical but for

the labeling of its axes, and thus any given smooth is changed

only by a constant. One pass through the algorithm requires

Cn

2loess smooths on all the spots on the array.

2.2 Parallel loess

Cyclic loess is inherently parallel in nature. Viewing it in this

manner may provide insights that allow computational time

savings. Imaginethatwehadaparallelmachine, sothatallthe

pairwise normalizations could be done simultaneously. Once

2779

at University of Portland on May 23, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 3

K.V.Ballman et al.

each of the pairwise corrections is obtained, then the correc-

tionforspoti onarrayj wouldbethe‘average’correctionfor

this spot across all array pairs containing array j. Essentially,

the correction to the spots on array j would be the average of

allcomputedcorrectionsofthespotsforpairscontainingarray

j. To simplify the logic, we consider the pairing of each array

with itself, as well as both orderings of the pairs. This means

that for n arrays, there are n2pairings. Here, and throughout

the remaining text, Y is used to denote the matrix of the log2

intensity values where the column j corresponds to an array

and row i corresponds to a spot.

As the simplest example, consider the noiseless case where

each column of Y, corresponding to an array, differs from any

other by a constant and array ‘0’ is an imaginary reference

representing the true expression level. In other words, the

intensity of the spot i,j, yij, can be expressed as yij= yi0+

cj. The ideal correction of all spots for a given chip is cj− ¯ c

(a horizontal loess curve) and the average correction for array

1 from the parallel algorithm is

(1/n)

n

?

j=1

(c1− cj)/2 = (c1− ¯ c)/2.

That is, the average correction is 1/2 of what it should be.

As a result, we define the correction step for the i-th chip in

parallel loess to be (2/n)?

Now consider a simple simulation where each column of

the data matrix (each array) differs from any other by a con-

stant plus a symmetrically distributed noise term. Array ‘0’ is

the set of 5000 true intensities, which were randomly selected

from a uniform distribution with range from 0 to 10. The

log2intensity levels for the four arrays in the experiment, yij,

are derived from array ‘0’ as follows: yij = yi0+ j + eij,

where eij ∼ t8(i = 1,2,...,5000 and j = 1,2,3,4). A

t-distribution was selected for the error term (noise distribu-

tion) because it is symmetric and produces more outliers than

anormaldistribution. Inthiscase, thetruecorrectionforarray

j isj−¯j = j−2.5.Thefourpairwisecorrectionsforapartic-

ular spot involving array 1 have expectations of 0,−1.0,−2.0

and −3.0. The average of the four corrections for a particular

spot is −1.5. Figure 2 shows the four computed corrections

for each spot from parallel loess involving array 1 as well

as the average of the corrections across arrays. This suggests

averaging the corrections to get an overall update, rather than

applying each of the separate corrections to the data in turn.

Figure2alsoincludesthecorrectionsforarray1afterapplying

cyclic loess. For this case performing the smooths cyclically

or in parallel produces equivalent results. For this figure, and

allremainingfigures, asubsetofrandomlyselectedpointsare

plottedtothintheplotandbetterdisplaytherelevantstructure.

Usingtheparallelversionwouldprobablybefasterthancyc-

lic loess. However, we did not intend to obtain computational

speed savings by developing a parallel version of cyclic loess.

jfij, where fijis the smooth for

the plots of chips i and j.

computed correction

1

2

3

4

x

0

1vs1

1vs2

1vs3

1vs4

avg

cyclic

02468 10

-3

-2

-1

0

1

Fig. 2. The computed corrections for array 1 from the 4 pairwise

smooths for a parallel loess (1v1, 1v2, 1v3, 1v4), for the average of

the pairwise corrections (avg), and the for cyclic loess (cyclic).

Rather, we used this construct to gain insight into how cyclic

loess works and how it might be made faster for non-parallel

machines.

Parallel loess can be shown to be unbiased for this simple

case. Let yij = αi+ βj+ ?ij; αiare the true probe values

(yi0in the simulation), βj the simple array effects, and ?ij

the error. Assume that a,b are two vectors of constants, and

a smooth of Yb on Ya is to be used to estimate the correc-

tion. Sincethecorrectionshouldbeconstant, wewantYa and

Yb to be uncorrelated, i.e. that there be no linear bias in the

smooth. The covariance matrix of Y has elements of σ2

on the diagonal, and σ2

variances of α and ? above. Simple algebra shows the cov-

ariance of Ya and Yb to be σ2

?

Since σ2

in the plot is avoided by choosing a and b so that both terms

are zero. For parallel loess, the plot for arrays 1 and 2 has

a = (1/2,1/2,0,...,0) and b = (1,−1,0,...,0), clearly

satisfyingthecriteria, andlikewiseforanyotherpairofarrays

i,j. Other choices of a and b will be explored below.

α+σ2

?

αoff the diagonal, where these are the

?ajbj+ σ2

α

??aj

???bj

?.

αand σ2

?are unknown for a given problem, linear bias

2.2.1

the parallel cyclic loess algorithm is to replace the n loess

smooths, each on p points, with a single loess smooth on np

points. In other words, the corrections to be applied to spots

on array j would be obtained by placing all the points in the

MA plots containing array j (n of them) into a single plot and

performing a single loess smooth on this plot. Recall, the cor-

rections for array j are obtained in parallel loess by averaging

thenpairwisecorrections, obtainedfromthenMAplots; and

recall that the corrections are just twice the smooth value. So

if the smoother is a linear operator, which loess is other than

outlier rejection passes, then the average of the smooths will

beequaltoasinglesmoothappliedtoaplotwithallthederived

datapoints,i.e.thepointsfromthenMAplotsinvolvingarray

j allplacedonasingleplot. Becauseoftheverylargeamount

ofdataonmicroarraygeneexpressionarrays, outlierrejection

Parallel loess variant one

An obvious variant of

2780

at University of Portland on May 23, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 4

Fastlo normalization

0

–1.0

–8.0

–6.0

computed correction

–0.4

–0.2

–0.0

0.0

0.2

0.4

computed correction

0.6

0.8

1.0

2468 1002468 10

# parallel variant 1

x parallel

O cyclic

Fig. 3. The computed corrections for arrays 2 and 3 from the example of 4 arrays. The corrections are from a single smooth on all the derived

20000 data points (parallel variant 1), from parallel loess (parallel), and from cyclic loess (cyclic).

is not normally an important issue. Specifically, when there is

a large quantity of data over a given range, no point, even if

extreme, can exert much influence.

Figure 3 compares the computed corrections for the array

spots on two arrays from our four array example introduced

above: true expression values for the array shifted by a con-

stant plus symmetrically distributed error. The computed

corrections displayed are those obtained from cyclic loess,

from the pure parallel version of cyclic loess, and from a

single loess smooth on the plot containing all the points from

the four MA plots involving the array to be normalized (par-

allel variant 1). Recall that for the pure parallel version of

cyclic loess, the correction for a particular point is found by

averaging the corrections obtained from the n MA plots. As

expected, all three methods produced equivalent results.

This parallel variant 1 reduces the number of loess smooths

that are computed. One pass through the data with parallel

variant 1 requires only n loess smooths compared to Cn

smooths required by cyclic loess. However, each smooth for

this parallel variant 1 is performed on np versus p points.

Whether this version of parallel loess is faster than cyclic

loess likely depends on the implementation details of each

algorithm.

2loess

2.2.2

the number of points in the plot containing all the points from

the MA plots of parallel loess involving array j from np to p.

Sincetherearep spotsonanarray, theintentistoreplaceeach

collection of n points per spot i (one per each pairwise MA

plot—array j with array 1, array j with array 2,..., array

j with array n) with a single point, which summarizes the

collection of n points for spot i. The new plot would contain

p ‘summary’ points and we would perform a smooth on this

plot to obtain the spot corrections for array j.

Parallel loess variant two

The next idea is to reduce

A way to produce a point for spot i that summarizes the n

points for spot i that involve array j is to set the x-coordinate

equal to the average of the n x-coordinates:

1

n

?(yi1+ yi1)

2

+(yi1+ yi2)

2

+ ··· +(yi1+ yin)

2

?

=yi1+ ¯ yi·

2

Here ¯ yi·istherowmeanfortheithrowofthedata, whichis

equal to the mean expression value of spot i across the arrays.

The vertical position for spot i on array j is the average of the

y-coordinates:

1

n[(yi1− yi1) + (yi1− yi2) + ··· + (yi1− yin)] = yi1−¯ yi·.

The next step is to determine corrections for the spots on

chip j by fitting the loess smooth on this set of p points. This

would be repeated for each of the n−1 other chips; a total of

n smooths need to be computed for one pass through all the

data, similar to the parallel loess variant 1 above. However,

thenumberofpointsforeachloesssmoothisp forthisvariant

compared to np for variant 1.

Repeating the bias computation found in Section 2.2, the

plot for array 1 versus array 2 has vertical and horizontal axes

aY andbY,respectively,witha = (1−1/n,−1/n,...,−1/n)

and b = (1 + 1/n,1/n,...,1/n)/2. This yields?aj = 0,

tion between aY and bY and a potentially biased correction.

Figure4comparesthecomputedcorrectionsproducedbythis

variant of parallel loess (parallel variant two), parallel loess,

and cyclic loess for two of the four arrays in our simple data

example. Clearly, thebiasissevere. Inthenextsectionweuse

a linear models argument to motivate a similar plot, but with

x-coordinate of ¯ yi·, leading to b = (1/n,...,1/n) and

?ajbj= 0 and an unbiased estimate.

?bj= 1and?ajbj= (n−1)/nleadingtopositivecorrela-

2781

at University of Portland on May 23, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 5

K.V.Ballman et al.

0

-1.0

-0.8

-0.6

computed correction

-0.4

-0.2

0.0

1.0

1.2

1.4

computed correction

1.6

1.8

2.0

2468 1002468 10

# parallel var 2

x parallel

O cyclic

Fig. 4. The computed corrections for arrays 2 and 4 of the simulated example of 4 arrays. The corrections are from the smooth of the 5000

derived points (parallel var 2), from parallel loess (parallel), and from cyclic loess (cyclic).

3

Cyclic loess can be conceptualized as a smooth (loess),

coupled with a very simple linear model. Consider the case of

two arrays, 1 and 2, with gene expression levels represented

by yi1and yi2for i = 1,...,p, and the simplest possible

linear model for the data yij= αi+?ij, an intercept for each

spot. The solution is of course ˆ αi = ¯ yi·. In the MA plot for

the two arrays, the x-coordinate is (yi1+ yi2)/2 = ¯ yi·= ˆ yi·,

the y-coordinate is yi1− yi2= 2(yi1− ˆ yi·), and the adjust-

ment to spot i on the the first chip is 1/2 the height of the

loess smooth on this plot at the x-coordinate corresponding

to spot i.

Extending this idea to the n array case suggests creating a

modifiedMAplotforeacharrayj.ThemodifiedMAplotcon-

sistsofarrayj andaconstructed‘average’array; theintensity

level of spot i on the ‘average’ array is equal to the average

intensityofspoti acrossallarraysfori = 1,...,p. Thecom-

putedcorrectionsforarrayj wouldbeobtainedfromtheloess

smooth on the points of this plot. This would be done for each

of the j = 1,...,n arrays. This variant of cyclic loess, called

fastlo, requires n loess smooths each on p points for one pass

through the data. Obviously, this is considerably faster than

cyclic loess, which requires Cn

The steps of fastlo given below are done on the log2of the

spot intensities:

FAST LINEAR LOESS

2loess smooths.

(1) Create the vector ˆ yi·= the row mean of Y. Note that

this is the same as creating an ‘average’ array.

(2) Plot ˆ y versus (yi− ˆ y) for each array j. This plot has

one point for each spot (modified MA plot).

(3) Fit a loess curve f(x) through the data.

(4) Subtract f(x) from array j.

(5) Repeat for all remaining arrays.

(6) Repeat until the algorithm converges.

A further interesting aspect of fastlo is that it requires only

1 or at most 2 iterations to converge. If there is no outlier

downweighting, thentheloesssmootherwillbealinearoper-

ator and the average of the smooths will be the smooth of all

the points:

??

where Sm is the smoother. If this holds, then for any given

row of Y, some elements increase and some decrease, but the

mean stays the same. If the row means do not change, the

algorithm has converged.

1

n

?

Sm(yn− ¯ y) ≈ Sm

n

y − ¯ y

?

= Sm(0) = 0

4

Quantile normalization makes the overall distribution of

values for each array identical, while preserving the overall

distribution of the values. It consists of two steps.

QUANTILE NORMALIZATION

(1) Create a mapping between ranks and values. For rank

1 find the n values, one per array, that are the smallest

value on the array, and save their average. Similarly for

rank 2 and the second smallest values, and on up to the

n largest values, one per array.

(2) For each array, replace the actual values with these

averages.

As mentioned, this produces identical distributions of val-

ues on each array; quite an aggressive normalization process.

On the other hand, quantile normalization is extremely fast—

it only requires a sort of the arrays and a computation of

2782

at University of Portland on May 23, 2011

bioinformatics.oxfordjournals.org

Downloaded from