Page 1

BioMed Central

Page 1 of 9

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Research

Determination of the minimum number of microarray experiments

for discovery of gene expression patterns

Fang-Xiang Wu*1,2, WJ Zhang1,2 and Anthony J Kusalik2,3

Address: 1Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada, 2Division of Biomedical

Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada and 3Department of Computer Science, University of Saskatchewan,

Saskatoon, SK, S7N 5C9, Canada

Email: Fang-Xiang Wu* - faw341@mail.usask.ca; WJ Zhang - chris.zhang@usask.ca; Anthony J Kusalik - kusalik@cs.usask.ca

* Corresponding author

Abstract

Background: One type of DNA microarray experiment is discovery of gene expression patterns

for a cell line undergoing a biological process over a series of time points. Two important issues

with such an experiment are the number of time points, and the interval between them. In the

absence of biological knowledge regarding appropriate values, it is natural to question whether the

behaviour of progressively generated data may by itself determine a threshold beyond which

further microarray experiments do not contribute to pattern discovery. Additionally, such a

threshold implies a minimum number of microarray experiments, which is important given the cost

of these experiments.

Results: We have developed a method for determining the minimum number of microarray

experiments (i.e. time points) for temporal gene expression, assuming that the span between time

points is given and the hierarchical clustering technique is used for gene expression pattern

discovery. The key idea is a similarity measure for two clusterings which is expressed as a function

of the data for progressive time points. While the experiments are underway, this function is

evaluated. When the function reaches its maximum, it indicates the set of experiments reach a

saturated state. Therefore, further experiments do not contribute to the discrimination of

patterns.

Conclusion: The method has been verified with two previously published gene expression

datasets. For both experiments, the number of time points determined with our method is less

than in the published experiments. It is noted that the overall approach is applicable to other

clustering techniques.

Background

Recent advances in microarray technologies [1,2] and

genome sequencing have allowed the expression level of

thousands of genes to be monitored in parallel. This tool

provides important new information for the fundamental

understanding of biological processes at the molecular

from Symposium of Computations in Bioinformatics and Bioscience (SCBB06) in conjunction with the International Multi-Symposiums on Computer and

Computational Sciences 2006 (IMSCCS|06)

Hangzhou, China. June 20–24, 2006

Published: 12 December 2006

BMC Bioinformatics 2006, 7(Suppl 4):S13doi:10.1186/1471-2105-7-S4-S13

<supplement> <title> <p>Symposium of Computations in Bioinformatics and Bioscience (SCBB06)</p> </title> <editor>Youping Deng, Jun Ni</editor> <note>Research</note> <url>http://www.biomedcentral.com/content/pdf/1471-2105-7-S4-info.pdf</url> </supplement>

© 2006 Wu et al; licensee BioMed Central Ltd

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

BMC Bioinformatics 2006, 7(Suppl 4):S13

Page 2 of 9

(page number not for citation purposes)

level. Such understanding has proved very useful in med-

ical diagnosis, disease treatment, and drug design. From a

viewpoint of data analysis, microarray experiments may

be categorized into (1) classification of subjects into vari-

ous subtypes based on gene expressions, (2) discovery of

gene expression patterns over a set of conditions, and (3)

discovery of gene expression patterns for a cell line over a

series of time points while a biological process is under-

way. This paper concerns the third category of experi-

ments, which may also be called temporal gene

expression pattern discovery. The biological significance

of understanding temporal gene expression patterns has

been well recognized; see Eisen et al. [3].

An important feature of this category of microarray exper-

iments is dependency among gene expression data corre-

sponding to different time points. An important issue is

thus the specification of time points, including their

number and the span between them. In the absence of

knowledge from the biologist about this specification,

one naturally questions whether the behaviour of data

generated from a progressive microarray experiment may

help by itself determine a "cut off", beyond which further

micorarray experiments do not contribute to the discrim-

ination of gene expression patterns. Additionally, such a

cut-off value implies the minimum number of microarray

experiments, which is important because these experi-

ments are costly in terms of time, reagents, and well-

trained technicians [4,5]. Therefore, it is useful to develop

a method for determining the minimum number of time

points required to obtain useful patterns for this class of

microarray experiments. The study reported in this paper

develops such a method. Initially we assume that the

interval between consecutive time points is given (preset).

Later in the document, this assumption is discussed fur-

ther.

A clustering technique is typically used for discovering

patterns in gene expression data. There are many cluster-

ing algorithms available in literature for gene expression

profiling, including the hierarchical clustering [6], K-

means clustering [7], self-organized maps [8] and mixture

model-based clustering [9]. In this study, we adopt hierar-

chical clustering technique. However, the method for

determination of the minimum number of microarray

experiments can be similarly developed with other clus-

tering techniques.

There appears to be only few related studies in the litera-

ture. The most related work may refer to Hwang et al. [5],

in which a method was developed for determining the

minimum sample size in the context of supervised classi-

fication of gene expression data. Their method has

addresses the problem with the first category of experi-

ments. In Hwang's study [5], they assumed that different

samples were statistically independent, and thus they

were able to apply power analysis in statistics. However,

their method cannot be applied to the third category of

experiments because temporal gene expression data are

dependent to each other in the time parameter and thus

are not statistically independent. Other related studies

include those of Lee et al. [10] and Pan et al. [11]. Lee et

al. [10] studied a generic problem of determining the

number of replicates needed for producing high-quality

gene expression data. Their method was based a mixed

probability density function with two normal distribu-

tions. Pan et al. [11] further extended Lee et al.'s model

into a mixture of a number of normal distributions.

There are two key ideas in this paper. First, a statistics-

based similarity measure for two clusterings produced

with the hierarchical clustering technique is defined. Sec-

ond, a procedure is developed for determining whether an

experiment after time point k further contributes to the

identification of patterns. The procedure compares two

clusterings based on data over the first m - 1 and m time

points.

Results and Discussions

To evaluate the proposed method, a program implement-

ing it was run on two datasets: the fibroblast dataset and

the cdc15 dataset (see the "Methods" section for the

details about these datasets). The function c(m) is

employed to measure the similarity of two clusterings

based on expression data over the first m and m - 1 time

points (see the "Method" section for the definition). Fig-

ures 2 and 3 depict the profiles of c(m) with respect to the

number of time points m for the fibroblast dataset and for

the cdc15 dataset, respectively. Correspondingly, Tables 2

and 3 list the numerical values of c(m). It can be seen from

these two figures that the c(m) values in both datasets ini-

tially increase monotonically with respect to the number

of time points, reach a maximum, and then appear to ran-

domly fluctuate thereafter.

Table 2 and Figure 2 show that c(m) reaches an initial

maximum when data from the first 9 time points are used

to cluster genes and then appears to randomly fluctuate

when more data are added. Therefore, it is reasonable to

claim that nine is the minimum number of time points

necessary for clustering genes for the fibroblast experi-

ment. This result matches very well with the fact that the

fibroblast dataset from the first nine time points were col-

lected over the first 16 hours after serum stimulation, and

the period of cell division is 16 hours (see Table 1). It

should be noted that to detect the maximum, the tenth

data sample needs to be added. Thus in fact, the whole

experiment requires 10 time points.

Page 3

BMC Bioinformatics 2006, 7(Suppl 4):S13

Page 3 of 9

(page number not for citation purposes)

Tables 3 and Figure 3 show that c(m) reaches an initial

maximum when data from the first 8 time points are used

to cluster genes and then appears to randomly fluctuate

when more data are added. Again it is reasonable to claim

that eight is the minimum number of time points neces-

sary for clustering genes in the cdc15 experiment. This

result also correlates very well with the fact that data from

the first eight time points were collected over the first 100

minutes after cdc15-based synchronization, and the

period of cell division is about 100 minutes (see Table 1).

For the same reason as in the fibroblast case, the cdc15

experiment actually requires 9 time points.

The function d(k, m) is defined to measure the similarity

of two k-partition clusterings based on expression data

over the first m and m - 1 time points and obtained at a

proper level of the corresponding hierarchical clusterings

(see the "Method" section for the definition). In the fol-

lowing, we examine the behaviour of d(k, m) by setting k

= 3, 4,?10, respectively. This is very important as we want

to understand how the number of clusters, k, could affect

the results, specifically, whether the minimum number of

microarray experiments obtained with c(m) is applicable

to different partitions. Figure 4 shows the profiles of d(k,

m) for the fibroblast experiment for the various values of

k. When the fourth sample is added, the probability that

any two gene pairs are clustering-invariant for partition

with 3 clusters is only about 60% while the probability for

the partition with 10 clusters is about 80%. It is found that

the possibilities for all possible partitions increase as more

data are added. For instance, when the seventh sample is

added d(3, m) reaches its initial maximum, and at this

Profile of function c(m) with respect to the number of time points for the cdc15 experiment

Figure 3

Profile of function c(m) with respect to the number of time

points for the cdc15 experiment.

A dendrogram of hierarchical clustering of 5 objects

Figure 1

A dendrogram of hierarchical clustering of 5 objects.

The numbers on the horizon axis represent the indices of

objects, and the numbers on the vertical axis represent the

distance between any two objects connected. In this exam-

ple, objects 1 and 3 are merged in the first level; object 4 and

5 are merged in the second level; objects (1, 3) and 2 are

merged in the third level; and finally objects (1, 2, 3) and (4,

5) merged in the fourth level. when the dendrogram is cut at

the fourth level, two clusters (1, 2, 3) and (4, 5) are obtained;

When the dendrogram is cut at the third level, three clusters

(1, 2), (3) and (4, 5) are obtained; and so on.

Level 1

Level 2

Level 3

Profile of function c(m) with respect to the number of time points for the fibroblast experiment

Figure 2

Profile of function c(m) with respect to the number of time

points for the fibroblast experiment.

Page 4

BMC Bioinformatics 2006, 7(Suppl 4):S13

Page 4 of 9

(page number not for citation purposes)

point, the probability that any two gene pairs are cluster-

ing-invariant is about 95%. When the eighth sample is

added d(10, m) reaches a maximum for the first time, and

at this point, the probability that any two gene pairs are

clustering-invariant is about 94%. For other partitions,

their corresponding values of d(k, m) reach their initial

maxima before the ninth sample is added (see Figure 4).

It is interesting to observe from the above discussion

regarding the behaviour of d(k, m) that fewer samples may

be needed to obtain a k-partition when the number of

clusters k is known as a priori information. This seems to

be reasonable as more clusters require more discriminant

features (i.e. more samples). However, there may exist a

kind of 'saturated' k (i.e., the number of clusters), beyond

which the increase of k will not call for more samples. For

instance, the case of the fibroblast experiment, such a sat-

urated number of clusters is 7, as the same number of

samples (i.e., 8) is required for numbers of clusters more

than 7. A similar situation can be observed for the cdc15

dataset.

Conclusion

The method proposed in this paper to determine the min-

imum number of time points required in DNA microarray

experiments for clustering genes has been shown to be

effective by analyzing two previously published datasets:

fibroblast and CDC15. These two datasets have temporal

gene expression profiles with definite periods; specifically

about 16 hours for the fibroblast and about 100 minutes

for the cdc15. The periodic behaviours of these two data-

sets were observed in the originating experiments; specifi-

cally the number of time points is 12 for the fibroblast

datasets, while the number of points is 24 for the cdc15

dataset. With our method, we obtain the following num-

bers of microarray experiments: 10 for the fibroblast

experiment and 9 for the cdc15 experiment. These

minima imply a significant reduction of time points (i.e.,

microarray experiments), especially for the cdc15 experi-

ment.

Another finding is regarding the use of the average linkage

method of the hierarchical clustering technique with

Euclidean distance measure and the γc measure for cluster-

ing similarity employed in our method. Overall our com-

putational experiments have shown that such a

combination appears to work well for applications, which

is consistent with the result and conclusion obtained by

Dougherty et al. [12]. Last, the index D is able to give

detailed information about the object pair invariant prop-

erty, which can be useful when the number of clusters is

given perhaps by a biologist. In such a case, the number of

microarray experiments can be further reduced.

There are several limitations with this study at its present

stage. First, the span between two consecutive time points

obviously affects the minimum number of microarray

experiments required. The shorter the interval, the better

the resolution of a time-series gene expression profile. The

present study assumed that the interval is given. When the

underlying biological processes do not suggest an appro-

priate interval, the study presented by Langmead et al.

[13] is helpful. In it a computational approach was devel-

oped for determining a reasonable period between two

consecutive time points. It is expected that the combina-

tion of their method and the method presented in this

paper would further reduce the number of microarray

experiments necessary for pattern discovery.

One of the problems with our method is that beyond the

cut-off (i.e., the minimum number of experiments) both

c(m) and d(k, m) functions fluctuate, which seems to chal-

lenge the legitimacy of our idea (i.e., to pick up the first

maximum of c(m) or d(k, m)). Our experience is that such

fluctuations should not be statistically significant. While a

probabilistically sound proof of this statement is war-

ranted, it would require a large number of samples, the

cost of which is unaffordable by many labs.

Another problem lies in computational overhead with the

agglomerative hierarchical clustering techniques. For an

Table 2: c(m) for the fibroblast experiment. The bold number is the value at which c(m) reaches a maximum for the first time.

m

45678910 1112

c(m)0.61620.67150.77860.84860.8830

0.9030

0.87750.91840.8732

Table 1: The summary of two datasets.

Name of dataset# of time points# of selected genesPeriod

fibroblast

cdc15

12

24

517

813

About 16 hours

100 ± 10 minutes

Page 5

BMC Bioinformatics 2006, 7(Suppl 4):S13

Page 5 of 9

(page number not for citation purposes)

experiment with a large number of genes, the run time to

perform a hierarchical clustering can be very long. There

are two solutions to this problem. One is from a compu-

tation perspective, e.g., introducing a parallel algorithm or

using a computer cluster. The other is to look into the bio-

logical process itself, screening for a small set of genes

which play dominant roles. For example, Hwang et al. [5]

first determined a set of dominant genes and then con-

structed a classifier with this reduced set of genes.

In conclusion, this study has produced a method for

determining the minimum number of microarray experi-

ments for collecting temporal gene expression data when

the interval of time points is pre-specified from a biologi-

cal viewpoint. The presented method works for the hierar-

chical clustering technique and for two situations: (1) the

number of clusters is not given, and (2) the number of

clusters is given. Although this method appears to be

more useful for cyclic gene expression profiles (noticea-

bly, the method has been validated by two sets of data

from cyclic cell division processes), it can be used for any

other situation. For example, it can be applied to microar-

ray experiments monitoring the development or growth

process of a cell.

Methods

Datasets and gene selection

Two previously published datasets from DNA microarray

experiments were used in this study. Iyer et al. [14] studied

the responses of human fibroblasts to serum and meas-

ured the temporal changes of 8613 human genes in

mRNA levels at 12 time points, ranging from 15 minutes

to 24 hours after serum stimulation. They selected 517 out

of 8613 genes for their study. The selection criteria they

used are: if either (i) their expression profiles have at least

two log2-ratio values whose magnitudes are greater than

log2(2.2); or (ii) their standard deviations for log2-trans-

formed expression values exceeded 0.7. Our first dataset

comes from Iyer et al.'s experiments and consists of

expression data of these 517 genes at 12 time points. This

dataset is available at http://genome-www.stanford.edu/

serum.

Spellman et al. [15] studied the mitotic cell division cycle

of yeast and monitored more than 6000 ORFs of yeast

(Saccharomyces cerevisiae) at 24 time points in a cdc15-syn-

chronized experiment. The original dataset is available at

http://genome-www.stanford.edu/SVD. Our second data-

set is based on this dataset. We selected 813 out of 6113

genes using the same selection criteria as Iyer et al. [14]

together with the criterion that the expression profiles had

no missing data in the 24 arrays. Note that both datasets

describe cyclic cell division processes. Specifically, the

period of cell division in the human fibroblast cell-cycle is

about 16 hours while the period of cell division in the

yeast cdc15-synchronized cell cycle is 100 ± 10 minutes

[13-16]. Table 1 summarizes these two datasets.

Similarity Measures

Clustering algorithms for gene expression data need a

similarity/distance measure. The choice of a similarity/

distance measure may be as important as the choice of

clustering algorithms. Two types of measures are exten-

sively used in the comparison of gene expression profiles:

correlation coefficient and Euclidean distance. For the fol-

lowing definitions assume that g1 = (g11, g12,?g1m) and g2

= (g21, g22,?g2m) represent the temporal expression pro-

files for genes g1 and g2, respectively, where m is the

number of time points, and gij represents the expression

value of gene i at time point j.

Correlation coefficient is defined as:

where gioffset (i = 1, 2) are two constants. The correlation

coefficient r(·,·) has the range of [-1, 1]; specifically, r(g1,

g2) = 1 means that genes g1 and g2 have a co-regulated

response to a biological process in a same direction, and

r(g1, g2) = -1 means that genes g1 and g2 have a co-regu-

lated response to a biological process in an opposite direc-

tion. When gioffset (i = 1, 2) are set to the means of

expression profiles of genes g1 and g2, respectively, r(g1, g2)

r g g

(

1

gggg

gg

joffset

1

joffset

2

j

m

j offset

1

j

= =

,)

()(

∑

)

()

2

12

1

1

2

=

−−

−

=

∑

=

∑

−

1

22

2

1

1

m

joffset

j

m

gg

()

( )

Table 3: c(m) for the cdc15 experiment. The bold number is the value at which c(m) reaches a maximum for the first time.

m

456789 10 11 1213

c(m) 0.60300.6940 0.75450.7713

0.8847

0.8562 0.86700.8305 0.81870.9046

m

14 1516 1718 1920 2122 23

c(m) 0.9023 0.86590.79130.88340.8835 0.92010.8600 0.90630.9120 0.9466