# Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference.

**ABSTRACT** We propose an algorithm for selecting and clustering genes according to their time-course or dose-response profiles using gene expression data. The proposed algorithm is based on the order-restricted inference methodology developed in statistics. We describe the methodology for time-course experiments although it is applicable to any ordered set of treatments. Candidate temporal profiles are defined in terms of inequalities among mean expression levels at the time points. The proposed algorithm selects genes when they meet a bootstrap-based criterion for statistical significance and assigns each selected gene to the best fitting candidate profile. We illustrate the methodology using data from a cDNA microarray experiment in which a breast cancer cell line was stimulated with estrogen for different time intervals. In this example, our method was able to identify several biologically interesting genes that previous analyses failed to reveal.

**0**Bookmarks

**·**

**94**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Small scale time-course microarray experiments are those which have a small number of time points. They comprise about 80 percent of all time-course microarray experiments conducted up to 2005. Several statistical methods for the small scale time-course microarray experiments have been proposed. In this paper we applied three methods, namely, QR method, maSigPro method and STEM, to a real time-course microarray experiment which had six time points. We compared the performance of these three methods based on a simulation study and concluded that STEM outperformed, in general, in terms of power when the FDR was set to be 5%.Korean Journal of Applied Statistics. 01/2008; 21(1). - [Show abstract] [Hide abstract]

**ABSTRACT:**Clustering analysis based on temporal profile of genes may provide new insights in particular biological processes or conditions. We report such an integrative clustering analysis which is based on the expression patterns but is also influenced by temporal changes. The proposed platform is illustrated with a temporal gene expression dataset comprised of pellet culture-conditioned human primary chondrocytes and human bone marrow-derived mesenchymal stem cells (MSCs). We derived three clusters in each cell type and compared the content of these classes in terms of temporal changes. We further considered the induced biological processes and the gene-interaction networks formed within each cluster and discuss their biological significance. Our proposed methodology provides a consistent tool that facilitates both the statistical and biological validation of temporal profiles through spatial gene network profiles.IEEE Journal of Biomedical and Health Informatics 05/2014; 18(3):799-809. · 1.98 Impact Factor - SourceAvailable from: ocean.kisti.re.kr[Show abstract] [Hide abstract]

**ABSTRACT:**In this paper, we propose a pattern consistency index for detecting heterogeneous time series that deviate from the representative pattern of each cluster in clustering time course gene expression data using the Pearson correlation coefficient. We examine its usefulness by applying this index to serum time course gene expression data from microarrays.Korean Journal of Applied Statistics. 01/2005; 18(2).

Page 1

BIOINFORMATICS

Vol. 19 no. 7 2003, pages 834–841

DOI: 10.1093/bioinformatics/btg093

Gene selection and clustering for time-course

and dose–response microarray experiments

using order-restricted inference

Shyamal D. Peddada1,∗, Edward K. Lobenhofer2, Leping Li1,

Cynthia A. Afshari2, 3,†, Clarice R. Weinberg1and

David M. Umbach1

1Biostatistics Branch,2Gene Regulation Group, Laboratory of Molecular

Carcinogenesis and3NIEHS Microarray Center, National Institute of Environmental

Health Sciences, Research Triangle Park, NC 27709, USA

Received on August 8, 2002; revised on November 14, 2002; accepted on November 29, 2002

ABSTRACT

We propose an algorithm for selecting and clustering

genes according to their time-course or dose–response

profiles using gene expression data. The proposed

algorithm is based on the order-restricted inference

methodology developed in statistics. We describe the

methodology for time-course experiments although it is

applicable to any ordered set of treatments. Candidate

temporal profiles are defined in terms of inequalities

among mean expression levels at the time points. The

proposed algorithm selects genes when they meet a

bootstrap-based criterion for statistical significance and

assigns each selected gene to the best fitting candidate

profile. We illustrate the methodology using data from a

cDNA microarray experiment in which a breast cancer

cell line was stimulated with estrogen for different time

intervals. In this example, our method was able to iden-

tify several biologically interesting genes that previous

analyses failed to reveal.

Contact: peddada@embryo.niehs.nih.gov

INTRODUCTION

A number of methods have been proposed for selecting

genes that exhibit interesting changes in expression

between classes of samples. Depending on the data

available, any of these methods can be employed to select

genes that are differentially expressed across time points.

Examples of these methods include the standard two-

sample t-test and its modifications (Golub et al., 1999;

Long et al., 2001; Tusher et al., 2001) and a confidence

interval method (Chen et al., 1997). A different approach

to the selection of a subset of discriminative genes is the

∗To whom correspondence should be addresed.

†Present address: Amgen Inc. Thousand Oaks, CA 91320, USA.

multivariate

methodology (Li et al., 2001a,b).

An important application of microarray technology is to

study patterns of gene expression across a series of time

points or of doses levels. The premise is that genes sharing

similar expression profiles might be functionally related or

co-regulated. Therefore, microarray data may provide in-

sight into gene–gene interactions, gene function and path-

way identification. In toxicogenomics, these studies can

also provide information about the dynamic responses of

cells (tissues) to chemical insults (Hamadeh et al., 2001).

We focus on time-course studies although our methodol-

ogy is applicable to dose-response studies as well. None

of the previously mentioned methods, however, take ad-

vantage of the ordering in a time-course study. In contrast

to those methods, explicit use of the temporal ordering

should allow more sensitive detection of genes that dis-

play a consistent pattern over time.

Some authors have developed correlation-based meth-

ods for clustering genes with similar temporal profiles

(Chu et al., 1998; Heyer et al., 1999). Chu et al. (1998)

applied their methodology to select genes from a yeast

cell line into seven temporal patterns of expression. Their

approach pre-identifies a few candidate temporal profiles

along with a sample of three to eight genes per profile.

Using these template genes, they estimate the mean

expression at each time point for each profile. Thus, each

candidate profile is defined by an estimated time-course

curve. Each remaining gene is then either assigned to

one of the candidate profiles or not assigned into any, de-

pending upon the magnitude of the correlation coefficient

between the gene’s experimentally determined profile

and each of the candidate profiles. Heyer et al. (1999),

employing a jack-knifed correlation coefficient, also pro-

posed a procedure for clustering genes from time-course

genetic algorithm/k-nearestneighbors

834

Published by Oxford University Press

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 2

order-restricted inference

Time (hours)

Gene expression

010 203040

2.0

2.5

3.0

o

+

o

o

o

o

o

+

+

+

+

+

Gene 1: o

Gene 2: +

Corr. coeff. = .58

(a)

Time (hours)

Gene expression

0 1020 3040

-1.0

-0.5

0.0

0.5

1.0

o

o

o

o

o

o

+

+

+

+

+

+

Gene 1: o

Gene 2: +

Corr. coeff. = .87

(b)

Fig. 1. (a) Two genes with similar profiles (maxima at 24 hours)

that may not be clustered together by correlation-based methods.

(b) Two genes with different profiles (monotone versus up–down)

that are likely to be clustered together by correlation-based methods.

experiments. Although their basic procedure did not

require candidate profiles, they describe a modification

where the clustering algorithm is seeded with candidate

profiles.

Correlation-based procedures using candidate profiles

require the scientist to specify expression levels defining

each profile in advance. This requirement means that

researchers must estimate each candidate profile using a

small sample of handpicked genes. More importantly, the

clustering that results depends upon the genes initially

chosen as templates; thus, important genes may be

missed. Furthermore, the sample size for each correlation

coefficient is the number of time points, not the number

of actual observations. The correlation coefficient may

not be a reliable measure of association when an experi-

ment has few time points. Moreover, a large correlation

coefficient does not necessarily indicate two similarly

shaped profiles, nor does a small correlation coefficient

necessarily indicate differently shaped profiles. Figure 1a

and b each shows hypothetical mean expression levels for

two genes at six time points. The two genes in Figure 1a

arguably display similar patterns, in that both attain a peak

value at the 4th time point. Yet their correlation coefficient

is only 0.58, suggesting that they might not be grouped

together by correlation-based methods. On the other hand,

the two genes in Figure 1b display apparently different

patterns over time. One increases monotonically whereas

the other has a peak at the 4th time point, yet they have

a high correlation coefficient of 0.87 and would likely

be clustered together by a correlation-based approach.

Thus, correlation-based methods may either miss some

important genes or cluster genes with different profiles.

Herein, we propose an alternative methodology to se-

lect and cluster genes using the ideas of order-restricted

inference, where estimation makes use of known inequali-

ties among parameters. The first step is to define potential

candidate profiles of interest and to express them in terms

of inequalities between the expected gene expression lev-

els at various time points. For a given candidate profile,

we estimate the mean expression level of each gene using

the procedure developed in Hwang and Peddada (1994).

The best fitting profile for a given gene is then selected

using the goodness-of-fit criterion and the bootstrap test

procedure developed in Peddada et al. (2001). A pair of

genes g1and g2fallintothesameclusterifalltheinequali-

ties between the expected expression levels at various time

points are the same, that is, if they follow the same tempo-

ral profile. In this sense, the genes of Figure 1a are similar,

while those of Figure 1b are not. Our procedure is less

restrictive than those that define profiles via pre-specified

expression levels because only the general shape of the

profile is needed.

METHODOLOGY

Suppose a time-course experiment includes T time points

denoted by 1,2,...,T, and at each time point there are

M arrays, each with G genes. Let Yigt denote the ith

expression measurement taken on gene g at time point

t. Let ¯Ygt denote the sample mean of gene g at time

point t and let¯Yg = (¯Yg1,¯Yg2,...,¯YgT)?. The unknown

true mean expression level of gene g at time point t is

E(¯Ygt) = µgt. Inequalities between the components of

µg = (µg1,µg2,...,µgT)?define the true profile for

gene g. Our procedure seeks to match a gene’s true profile,

estimated from the observed data, to one of a specified set

of candidate profiles.

Examples of inequality profiles are given below. For

simplicity, we often drop the subscript g.

Null profile:C0= {µ ∈ RT: µ1= µ2= ··· = µT}.

Monotone increasing profile (simple order):

C = {µ ∈ RT: µ1? µ2? ··· ? µT}

(with at least one strict inequality). One may similarly

define a monotone decreasing profile by replacing ? by

? in Equation (1).

(1)

Up–down profile with maximum at i (umbrella order):

C = {µ ∈ RT: µ1? µ2? ··· ? µi? µi+1? ··· ? µT}

(with at least one strict inequality among µ1 ? µ2 ?

··· ? µiand one among µi? µi+1? ··· ? µT).

Genes satisfying this profile have mean expression

values non-decreasing in time up to time point i and

(2)

835

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 3

S.D.Peddada et al.

non–increasing thereafter. One may similarly define a

down–up profile with minimum at i.

Cyclical profile with minima at 1, j, and T and maxima at

i and k:

C = {µ ∈ RT: µ1? µ2? ··· ? µi? µi+1? ···

? µj? µj+1? ··· ? µk? µk+1? ··· ? µT}

(3)

(with at least one strict inequality among each monotone

sub-profile). Cyclical profiles may be important in long

time-course experiments where the mean expression value

could oscillate.

Incomplete inequality profiles:

C = {µ ∈ RT: µ1? µ2? ··· ? µi,µi+1? ···

? µj,µj+1? ··· ? µk,µk+1? ··· ? µT} (4)

(with at least one strict inequality among each monotone

sub-profile). Profiles like Equation (4) are useful when

the investigator is unable to specify inequalities between

certain means.

For compactness, we drop µ ∈ RTand the phrase ‘with

a strict inequality’.

DEFINITION 1. Two parameters in a given profile are

saidtobelinked iftheinequalitybetweenthemisspecified

a priori.

DEFINITION 2. For a given profile, a parameter is said

to be nodal if it is linked with every other parameter in the

profile.

For example, µi is the only nodal parameter in

Equation (2) while there are no nodal parameters in

Equation (3).

DEFINITION 3. Define the ?∞ norm of an estimated

profile as the maximum difference between the estimates

of two linked parameters.

Other norms could replace ?∞. Our choice is motivated

by its connection to well-known procedures for order-

restricted inference. For example, Williams’ test for trend

in normal means (Williams, 1977) and Dunnett’s multiple

comparison test procedure (Dunnett, 1955) are based on

?∞ norm. In the case of profile Equation (2), ?∞ =

max{ˆ µi − ˆ µ1, ˆ µi − ˆ µT}, where ˆ µj is an estimate of

µj, j = 1,2,...,T.

DEFINITION 4. An inequality sub-profile Ci within a

profile C is described by the inequalities between the

components of the sub-vector µi = (µi1,µi2,...,µis),

where {i1,i2,...,is} ⊆ {1,2,...,T}.

The proposed algorithm

STEP 1. Pre-specify a collection of candidate profiles.

Denote these profiles by C1,C2,...,Cp.

EXAMPLE 2.1. Suppose an experiment consists of four

time points at 1, 2, 3 and 4 hours, and we are interested

in identifying genes belonging to either of the following

profiles: C1= {µ ∈ RT: µ1? µ2? µ3? µ4},C2=

{µ ∈ RT: µ1? µ2? µ3? µ4}.

For each gene g,g = 1,2,...,G, perform the follow-

ing steps.

STEP 2. Obtain the estimates of µg1,µg2,...,µgT

under each of the candidate profiles C1,C2,...,Cpusing

Hwang and Peddada (1994). See the Appendix for details.

EXAMPLE 2.1 (CONTINUED). Suppose

mean expression levels of a gene at the four time points

are 0.2, 0.4, 0.8, and 0.5, respectively. Assume the sample

sizes are equal for all time points.

the sample

Estimation under C1

ter in C1, we first estimate µ2. Maintaining all the known

inequalities in C1 and assigning arbitrary inequalities

where they are unknown, we take µ4? µ1? µ3? µ2.

Using formula (A2) in the appendix, we obtain ˆ µ2= 0.6.

Now estimate µ3and µ4, nodal parameters in the sub-

profile µ4? µ3? µ2, and µ1, a nodal parameter in the

sub-profile µ1 ? µ2. Using Equation (A3) and (A4) in

the appendix, we obtain ˆ µ1= 0.2, ˆ µ3= 0.6, ˆ µ4= 0.5.

Estimation under C2

In this case, ˆ µ1 = 0.2, ˆ µ2 =

0.4, ˆ µ3= 0.8, ˆ µ4= 0.5.

STEP 3. For each Ci,i = 1,2,... p, compute ?g(i)

r be such that ?g(r)

∞

EXAMPLE 2.1 (CONTINUED). Here?g(1)

0.6, hence ?g(r)

∞

STEP 4 (BOOTSTRAP NULL DISTRIBUTION).

Assuming that the true means and variances are the

same at every time point, draw N bootstrap samples.

Each bootstrap sample is obtained as follows. Combine

the actual observations from all the time points into a

vector of length MT and draw T simple random samples

with replacement, each of size M. Repeat Steps 2 and

3 for each bootstrap sample. This results in a bootstrap

distribution for maxi?g(i)

Since µ2is the only nodal parame-

∞ . Let

= maxi?g(i)

∞ .

∞

= 0.4,?g(2)

∞

=

= 0.6 and r = 2.

∞ , which is used for testing

H0: µ ∈ C0, Ha: µ ∈

p ?

i=1

Ci

(5)

836

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 4

order-restricted inference

Time

0 5 1020 3040

50

-0.1

0.2

0.5

0.8

Decreasing: Max. at 1 hr.

Time

0 5 10 203040

50

-0.6

0.0

0.6

1.2

Up-Down: Max. at 4 hrs.

Time

0 5 1020 3040

50

-0.1

0.2

0.5

0.8

Up-Down: Max. at 12 hrs.

Time

0 5 10203040

50

-0.2

0.4

1.0

1.6

Up-Down: Max. at 24 hrs.

Time

0 5 10 2030 40

50

-0.1

0.3

0.7

1.1

Up-Down: Max. at 36 hrs.

Time

0 5 10203040

50

-0.4

-0.1

0.2

0.5

Down-Up: Min. at 4 hrs.

Time

0 5 10 203040

50

-0.9

-0.5

-0.1 0.2

Down-Up: Min. at 12 hrs.

Time

0 5 10203040

50

-1.0

-0.4

0.2

0.8

Down-Up: Min. at 24 hrs.

Time

0 5 10 2030 40

50

-0.6

-0.3

0.0

Down-Up: Min. at 36 hrs.

Fig. 2. Estimated profiles of the selected top 50 genes. Curves represent order-restricted estimates of mean log expression ratios. Vertical

lines correspond to the six time points.

STEP 5. Assign gene g to profile Cr if ?g(r)

where z∗

distribution derived in Step 4. If ?g(r)

profiles are tied then do not classify g into any of the p

profiles.

∞

? z∗

α,

αis the upper αth percentile of the bootstrap

∞

? z∗

αor if two

STEP 6. Repeat Steps 2–5 with every gene.

STEP 7 (OPTIONAL). Some genes selected by the

above process may have small mean expression levels

at each time point. Some investigators may want to

restrict attention to those genes that have large expression

levels at one time point at least. If so, we suggest the

following filtering process. If the data are centered, then

for each gene g selected after Step 6, let tg =?T

i=1ˆ µ2

i;

alternatively, let

tg=

T ?

i=1

(ˆ µi−¯ˆ µ)2,

where

¯ˆ µ =1

T

T ?

i=1

ˆ µi.

Large values of tgindicate that the mean expression of

gene g is high for at least one time point. Arrange the

genes in descending order of tgand retain the top R genes.

APPLICATION TO BREAST CANCER CELL

LINE DATA

We illustrate the proposed methodology using log-

transformed relative expression data from Lobenhofer et

al. (2002). In that study, the MCF-7 breast cancer cell

line was treated with 17β-estradiol or ethanol (vehicle

control). Samples were harvested at 1, 4, 12, 24, 36 and

48 hours after treatment. At each time point M = 8

hybridizations were performed. Each array consisted of

G = 1900 genes. For each gene, we assumed that the

variance of the log relative expression was homoscedastic

over time. For each gene g, we tested Equation (5) where

the alternative hypothesis is the union of the following 10

profiles: monotone decreasing, C1; monotone increasing,

C2; four up–down profiles with maxima at 4, 12, 24, 36

hours, C3–C6, respectively; and 4 down–up profiles with

minima at 4, 12, 24, and 36 hours, C7–C10, respectively.

Using Steps 1–6 with N = 50000, we obtained 124

genes with a p-value?0.0025. Of these,10 were clustered

into C1, 14 into C2, four into C3, 31 into C4, 12 into

C5, one into C7, nine into C8, 34 into C9and nine into

C10. Applying Step 7 we selected the top 50 genes among

these 124. These 50 genes display nine of the 10 candidate

profiles (Table 1, Fig. 2).

837

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 5

S.D.Peddada et al.

Table 1. Genes classified according to inequality profile

Clone ID Gene nameFunctional categoryPreviously identified

Decreasing with maximum at 1 hour

417226v-myc viral oncogene homologTranscription/Chromatin Structure Yes

Up–Down with maximum at 4 hours

110022

428733

362059

417503

248613

Cyclin D1

Protein kinase C, delta

Laminin, alpha 3, kalinin, epilegrin

EST

v-myb viral oncogene homolog

Cell Cycle

Cellular Signaling

Extracellular Matrix/Cell Structure

Unknown

Transcription/Chromatin Structure

Yes

Yes

Yes

Yes

No

Up–Down with maximum at 12 hours

563187

321207

196676

CDC6

Polymerase (DNA directed), epsilon

Replication factor C (activator 1) 4

Cell Cycle

DNA Replication/Repair

DNA Replication/Repair

Yes

Yes

No

Up–Down with maximum at 24 hours

129140

248008

489092

285427

359119

415639

488059

563809

293274

49950

346838

359465

487757

49940

52713

339075

136609

198205

229509

200573

366842

MAD2L1

Deoxythymidylate kinase

Deoxythymidylate kinase

CSE1L

CDC28 protein kinase 2

Serine/threonine kinase 15

Tubulin, gamma 1

CDC20

Cyclin-dependent kinase inhibitor 3

Flap structure-specific endonuclease 1

Minichromosome maintenance deficien 3

Dihydrofolate reductase

Ligase I, DNA, ATP-dependent

Replication factor C (activator 1) 5

Vitronectin

Karyopherin alpha 2

v-myb homolog-like 1

v-myb homolog-like 2

coagulation factor V

EST

EST

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

Cell Cycle

DNA Replication/Repair

DNA Replication/Repair

DNA Replication/Repair

DNA Replication/Repair

DNA Replication/Repair

Extracellular Matrix/Cell Structure

Protein Degradation/Synthesis/Targeting

Transcription/Chromatin Structure

Transcription/Chromatin Structure

Miscellaneous

Unknown

Unknown

Yes

Yes

No

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

Yes

No

Yes

No

Up–Down with maximum at 36 hours

264117

150163

238545

242182

509614

510595

470480

Cathepsin D

Neuropeptide Y receptor Y1

ADP-ribosylation factor-like 3

Protein kinase inhibitor beta

High-mobility group protein 1

Lactate dehydrogenase A

Autocrine motility factor receptor

Cell Cycle

Cellular Signaling

Cellular Signaling

Cellular Signaling

Transcription/Chromatin Structure

Miscellaneous

Miscellaneous

Yes

Yes

Yes

Yes

Yes

Yes

No

Down-Up with minimum at 4 hours

487407Insulin induced gene 1Miscellaneous Yes

Down–Up with minimum at 12 hours

361381

145093

485875

34821

Myeloid cell leukemia sequence 1

Myeloid cell leukemia sequence 1

EFEMP1

CHRNA 4

Apoptosis

Apoptosis

Extracellular Matrix/Cell Structure

Miscellaneous

Yes

No

Yes

Yes

Down–Up with minimum at 24 hours

359191

180789

Protein kinase H11

Low density lipoprotein-related protein 1

Cellular Signaling

Protein Degradation/Synthesis/Targeting

Yes

Yes

838

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 6

order-restricted inference

Table 1. Continued.

Clone IDGene name Functional categoryPreviously identified

162479

430235

545242

268652

E74-like factor 3

H2B histone family, member Q

STAT 1

p21/CIP 1

Transcription/Chromatin Structure

Transcription/Chromatin Structure

Transcription/Chromatin Structure

Cell Cycle

Yes

Yes

Yes

No

Down–Up with minimum at 36 hours

29682

365147

Protein kinase C binding protein 1

v-erb-b2 homolog 2

Cellular Signaling

Cellular Signaling

Yes

No

The confidence-interval approach of Chen et al.

(1997) identified 105 genes that demonstrated estrogen-

responsive expression (Lobenhofer et al., 2002). Of these

105, 39 were also among our top 50. Most of the 39 genes

selected in common are involved in cell cycle progression

and DNA replication (Lobenhofer et al., 2002), reflecting

the known sensitivity of MCF-7 cells to estrogen.

Most of our 11 newly identified genes also display

typical phenotypes for estrogen-treated MCF-7 cells. For

example, one initial step in DNA replication is the binding

of a complex of proteins (known as replication factor C) to

DNA in order to recruit other proteins necessary for DNA

synthesis. The confidence-interval approach identified

one subunit (replication factor C 3) as being regulated

by estrogen. Our order-restricted-inference approach

identified an additional two components of the complex

(replication factors C 4 and C 5) as having increased

levels of expression at time points when the estrogen-

stimulated cells are undergoing DNA synthesis. Another

interesting observation was the decreased expression of

cyclin-dependent kinase inhibitor 1A (p21 and Cip1), as

shown previously by Prall et al. (2001). This inhibitory

gene not only functions in the cell cycle at the transition

from the G1 into the S phase (during which genome repli-

cation occurs) but also in the process of DNA synthesis.

Therefore, the fact that estrogen induces MCF-7 cells

to divide supports the finding that a gene that inhibits

this process would be repressed. Finally, several genes

were represented by two different spots (clones) on the

microarray chips. Using the confidence-interval approach,

deoxythymidylate kinase (248008) and myeloid cell

leukemia sequence 1 (361381) were seen to be regulated

by estrogen. Interestingly, the order-restricted-inference

approach not only identified these genes as exhibiting

altered expression in the presence of estrogen but also

identified them based on two different spots that represent

the same genes (Clone IDs 489092 and 145093; Table 1).

These findings illustrate that our methodology can iden-

tify genes whose estrogen responses are biologically

interpretable.

A simulation experiment

We investigated the false positive rate of our procedure

using a small simulation study. To generate unpatterned

null data, we created 48 new observations for each gene

by randomly assigning the 48 original observations (with

replacement), eight to each of the six time points. By this

device, we generated 1900 genes whose true underlying

profiles lack any systematic pattern. Our simulations

suggestthatourmethodologyprovidesfairlyaccuratetype

I error rates and tends to be conservative for smaller levels

of significance. For example, corresponding to a nominal

level of 0.0025, our estimated Type I error was 0.0005;

and, for a nominal level of 0.05, our estimated Type I error

was 0.049.

DISCUSSION

In studies where experimental conditions have an inherent

ordering, making use of ordering information can improve

inference. In microarray experiments, the ability to exploit

ordering information may be especially valuable because

genes whose expression levels change in concert through

time may be components of the same cellular process

or may share regulatory elements. Yet, virtually none of

the commonly used methods for analysis of microarray

data take account of time ordering. Those researchers

who have recognized the importance of time-course infor-

mation (Chu et al., 1998; Heyer et al., 1999) developed

procedures based on correlation coefficients, an approach

fundamentally different from ours. We have proposed an

algorithm based on the statistical theory of order-restricted

inference that makes explicit use of ordering information

when selecting differentially expressed genes. Our ap-

proach selects genes whose expression levels through time

are both significantly different from the null profile and

similar to one of a set of pre-identified candidate profiles.

Consequently, selected genes are naturally clustered into

classes with similarly shaped profiles.

Our methodology is general and enjoys several desirable

properties. First, the estimated mean expression levels,

subject to an inequality profile, satisfy certain optimality

839

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 7

S.D.Peddada et al.

properties discussed in Hwang and Peddada (1994).

In particular, the estimator universally dominates the

unrestricted maximum likelihood estimator. Secondly,

genes are selected into clusters based in part on a

statistical significance criterion. Groupings obtained using

unsupervised methods such as cluster analytic algorithms

cannot make claims about Type I error rates. A related

and important feature of our procedure is that it can select

genes with subtle but reproducible expression changes

over time and, hence, uncover some genes that may not be

detectable by other approaches. Our example illustrated

this feature with respect to the approach of Chen et al.

(1997).

Both our procedure and correlation-based procedures

require that investigators pre-specify a set of candidate

profiles. What exactly is required of the candidate profiles,

however, differs sharply between the two approaches.

With our procedure, one need only describe the shapes

of profiles in terms of mathematical inequalities; whereas,

with the correlation-based procedures, one must specify

numerical values at each time point for each candidate

profile. Since exact values at time points are rarely known

a priori, correlation-coefficient-based procedures often

use averages from selected small subsets of genes to

establish candidate profiles. Those genes that establish the

profiles are essentially exempted from the analysis, and

they may be the only genes representing their profiles. Our

candidate profiles are specified without reference to data

from the study, and a candidate profile may turn out to

be represented by no genes. In fact, no genes in the top

50 of our example followed the monotone non-decreasing

profile. Thus, our methodology is much less restrictive

than the correlation-coefficient-based alternative.

Kerr and Churchill (2001) have advocated investigating

the reliability of clustering results from microarray stud-

ies. Although we have not formally examined the reliabil-

ity of our clustering results in that sense, one can conceive

of embedding our procedure into a general bootstrapping

framework similar in spirit to their approach.

Investigators might be interested in distinguishing more

subtle patterns than considered here. For example, Chu et

al.(1998)displaytwocandidateprofiles(EarlyIandEarly

II in their Figure 4(b) that rise to a maximum at 7 hours

and then decrease. Thus, both are up–down profiles with

a maximum at 7 hours and would not be distinguished by

the candidate profiles that we have described. These two

profiles differ, however, in that one rises rapidly after the

first time point then more slowly to the peak whereas the

other exhibits a rapid rise after the second time point. Our

approach could be adapted to distinguish such sub-profiles

by imposing order restrictions on suitable differences

among mean expression levels.

Although the procedure that we described is designed

for genes with a constant variance through time, it can

be generalized to handle situations when the variances

change or are subject to order restrictions themselves.

In such situations, the estimation of mean gene expres-

sion outlined in this paper may be modified along the

lines of Shi (1994). The required modifications to the

bootstrap described in Step 4 remain a subject for future

investigation.

In conclusion, we believe that methods of analysis that

exploit the ordering of treatments to improve estimation

will become increasingly valuable for time-course and

dose–response microarray experiments. Our approach

based on order-restricted inference should improve

gene selection and clustering whenever treatments are

inherently ordered.

ACKNOWLEDGEMENTS

The authors thank Drs D. Dunson, K. Kerr, F. Parham,

N.WalkerandR.Wolfingerfortheircarefulreadingofthis

manuscript and for their useful comments that improved

the presentation.

REFERENCES

Chen,Y., Dougherty,E. and Bittner,M. (1997) Ratio-based decisions

and the quantitative analysis of cDNA microarray images. J.

Biomed. Optics, 2, 364–374.

Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O.

and Herskowitz,I. (1998) The transcriptional program of sporu-

lation in budding yeast. Science, 282, 699–705.

Dunnett,C. (1955) A multiple comparison procedure for comparing

several treatments with a control. J. Amer. Statist. Assoc., 50,

1096–1121.

Golub,T.,Slonim,D., Tamayo,P.,

Coller,H., Loh,M., Downing,J., Caliguiri,M., Bloomfield,C.

and Lander,E. (1999) Molecular classification of cancer:

class discovery and class prediction by gene expression monitor-

ing. Science, 286, 531–537.

Hamadeh,H., Bushel,P., Paules,R. and Afshari,C. (2001) Discovery

in toxicology: mediation by gene expression array technology. J.

Biochem. Molec. Tox., 15, 231–242.

Heyer,L.J., Kruglyak,S. and Yooseph,S. (1999) Exploring expres-

sion data: identification and analysis of coexpressed genes.

Genome Res., 9, 1106–1115.

Hwang,J. and Peddada,S. (1994) Confidence interval estimation

subject to order restrictions. Ann. Statist., 22, 67–93.

Kerr,M. and Churchill,G. (2001) Bootstrapping cluster analysis: as-

sessing the reliability of conclusions from microarray experi-

ments. Proc. Natl Acad. Sci. USA, 98, 8961–8965.

Li,L., Weinberg,C., Darden,T. and Pedersen,L. (2001a) Gene se-

lection for sample classification based on gene expression data:

study of sensitivity to choice of parameters of the GA/KNN

method. Bioinformatics, 17, 1131–1142.

Li,L., Darden,T., Weinberg,C., Levine,A. and Pedersen,L. (2001b)

Gene assessment and sample classification for gene expression

datausingageneticalgorithm/k-nearestneighbormethod.Comb.

Chem. High Throughput Screening, 4, 727–739.

Huard,C.,Caasenbeek,J.,

840

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 8

order-restricted inference

Lobenhofer,E., Bennett,L., Cable,P,, Li,L., Bushel,P, and Afshari,C,

(2002) Regulation of DNA replication fork genes by 17beta-

estradiol. Molec. Endocrin., 16, 1215–1229.

Long,A.,Mangalam,H.,Chan,B.,Tolleri,L.,Hatfield,G.andBaldi,P.

(2001) Improved statistical inference from DNA microarray data

using analysis of variance and a Bayesian statistical framework.

Analysis of global gene expression in Escherichia coli K12. J.

Biol. Chem., 276, 19 937–19 944.

Peddada,S., Prescott,K. and Conaway,M. (2001) Tests for order

restrictions in binary data. Biometrics, 57, 1219–1227.

Prall,O., Carroll,J. and Sutherland,R. (2001) A low abundance pool

of nascent p21WAF1/Cip1 is targeted by estrogen to activate

cyclin E*Cdk2. J. Biol. Chem., 276, 45 433–45 442.

Shi,N. (1994) Maximum likelihood estimation of means and vari-

ances from normal opulations under simultaneous order estric-

tions. J. Mult. Anal., 50, 282–293.

Tusher,V., Tibshirani,R. and Chu,G. (2001) Significance analysis of

microarrays applied to the ionizing radiation response. Proc. Natl

Acad. Sci. USA, 98, 5116–5121.

Williams,D. (1977) Some inference procedures for monotoni-

cally ordered normal means. Biometrika, 64, 9–14.

APPENDIX: ESTIMATION OF PARAMETERS

(HWANG AND PEDDADA, 1994)

There are two types of profiles, those with at least one

nodal parameter and those with no nodal parameters. We

first describe estimation for the former.

(A) PROFILES WITH AT LEAST ONE NODAL

PARAMETER

For a gene g at time t, suppose¯Yt is the sample mean

based on ntobservations. We assume that Var(¯Yt) =σ2

Repeat Steps A1–A4 described below until all parameters

are estimated.

nt.

Estimation of nodal parameters in a given

inequality profile

STEP A1. Suppose µtis a nodal parameter in the pro-

file. Maintaining all the known inequalities and assigning

arbitrary inequalities among those parameters where the

inequalities are unknown, one obtains a non-decreasing

order of the form Equation (1). For i = 1,2,...,T, let

the ordered true means be denoted by µ(i)where, µ(1)?

µ(2)? ··· ? µ(T), and the corresponding sample means

and sample sizes be denoted by¯Y(i), and n(i), respectively.

Thus, for some s,µt≡ µ(s).

EXAMPLE A1. Consider a profile with four parameters

such that µ1 ? µ2 ? µ3 ? µ4. Here the only nodal

parameter is µ2. Inequalities between µ1,µ3and between

µ1,µ4 are unknown. To estimate µ2, we may arrange

the four parameters as µ4 ? µ1 ? µ3 ? µ2. Thus

µ4≡ µ(1),µ1≡ µ(2),µ3≡ µ(3),µ2≡ µ(4).

STEP A2. Estimate the nodal parameter µt ≡ µ(s)

using the following formula:

ˆ µt≡ ˆ µ(s)= min

r?smax

q?s

r?

k=q

n(k)¯Y(k)

r?

k=q

n(k)

(A1)

In the example, the estimate of µ2≡ µ(4)is

ˆ µ2≡ ˆ µ(4)= max

q?4

4 ?

k=q

n(k)¯Y(k)

4 ?

k=q

n(k)

(A2)

STEP A3. Once a parameter is estimated, in all future

calculations replace its sample mean¯Y by its estimated

value ˆ µ from Step A2, and its sample size n by B, where

B → ∞.

Estimation of non-nodal parameters

STEP A4. To estimate a non-nodal parameter µt, iden-

tify the largest sub-profile having µtas a nodal parameter.

Using the data corresponding to the sub-profile, estimate

µtby applying Steps A1–A3.

EXAMPLE A1 (CONTINUED). The largest sub-profile

in which µ3is nodal is also the largest in which µ4is

nodal: µ4 ? µ3 ? µ2. Hence µ3and µ4are estimated

using formulae derived from (A1):

?

?

Note that µ1is nodal in the sub-profile µ1? µ2. Hence,

from (A1) we deduce:

ˆ µ3= minmax

?

¯Y3,n3¯Y3+ n4¯Y4

n3+ n4

¯Y4,n3¯Y3+ n4¯Y4

n3+ n4

?

, ˆ µ2

?

,

ˆ µ4= min

, ˆ µ2

?

(A3)

ˆ µ1= min{¯Y1, ˆ µ2}

(A4)

(B) PROFILES WITH NO NODAL PARAMETERS

STEP B1. Identify the largest sub-profile with at least

one nodal parameter. Estimate all parameters of the sub-

profile using Steps A1–A4.

STEP B2. Identify the next largest sub-profile contain-

ing at least one nodal parameter. Using Steps A1–A4, es-

timate all parameters in the sub-profile. Repeat Step B2

until all parameters in the profile are estimated.

841

by guest on July 15, 2011

bioinformatics.oxfordjournals.org

Downloaded from