ArticlePDF Available

Replicated Microarray Data

October 2001
Statistica Sinica 12(1)

October 2001
12(1)

Authors:

Uppsala University

cDNA microarrays permit us to study the expression of thousands of genes simultaneously. They are now used in many dierent contexts to compare mRNA levels between two or more samples of cells. Microarray experiments typically give us expression measurements on a large number of genes, say 10,000-20,000, but with few, if any replicates for each gene. Traditional methods using means and standard deviations to detect dierential expression are not completely satisfactory in this context, and so a dierent approach seems desirable. In this paper we present an empirical Bayes method for analysing replicated microarray data. Data from all the genes in a replicate set of experiments are combined into estimates of parameters of a prior distribution. These parameter estimates are then combined at the gene level with means and standard deviations to form a statistic B which can be used to decide whether dierential expression has occurred. The statistic B avoids the problems of using averages or t-statistics. The method is illustrated using data from an experiment comparing the expression of genes in the livers of SR-BI transgenic mice with that of the corresponding wild-type mice. In addition we present the results of a simulation study estimating the ROC curve of B and three other statistics for determining dierential expression: the average and two simple modications of the usual t-statistic. B was found to be the most powerful of the four, though the margin was not great. The data were simulated to resemble the SR-BI data. Keywords: cDNA microarray, dierential expression, empirical Bayes, replication, ROC curve, t-statistic Department of Mathematics, Uppsala University y Correspondence should be addressed to Ingrid Lonnstedt, telephone/fax +46-18-4712842/4713201, e...

B vs M. plot for SR-BI. B is our proposed statistic, the log posterior odds for each gene to be differentially expressed. It depends on the average as well as on the standard error for each gene.

…

Comparison of the four different statistics M , t, S and B for the 100 simulated datasets. For a certain cutoff value, each method defines the numbers of false negative and false positive genes in each of the simulated datasets. The lines reflect the averages of these numbers over a range of cutoffs.

…

Figures - uploaded by Ingrid M Lönnstedt

Content may be subject to copyright.

Content uploaded by Ingrid M Lönnstedt

Content may be subject to copyright.

Statistica Sinica 12(2002), 31-46

REPLICATED MICROARRAY DATA

Ingrid L¨onnstedt and Terry Speed

†

Uppsala University,

†

University of California, Berkeley and

†

Walter and Eliza Hall Institute

Abstract: cDNA microarrays permit us to study the expression of thousands of genes

simultaneously. They are now used in many diﬀerent contexts to compare mRNA

levels between two or more samples of cells. Microarray experiments typically give

us expression measurements on a large number of genes, say 10,000-20,000, but

with few, if any, replicates for each gene. Traditional methods using means and

standard deviations to detect diﬀerential expression are not completely satisfactory

in this context, and so a diﬀerent approach seems desirable. In this paper we

present an empirical Bayes method for analysing replicated microarray data. Data

from all the genes in a replicate set of experiments are combined into estimates of

parameters of a prior distribution. These parameter estimates are then combined

at the gene level with means and standard deviations to form a statistic B which

can be used to decide whether diﬀerential expression has occurred. The statistic

B avoids the problems of using averages or t-statistics. The method is illustrated

using data from an experiment comparing the expression of genes in the livers of

SR-BI transgenic mice with that of the corresponding wild-type mice. In addition

we present the results of a simulation study estimating the ROC curve of B and

three other statistics for determining diﬀerential expression: the average and two

simple modiﬁcations of the usual t-statistic. B was found to be the most powerful

of the four, though the margin was not great. The data were simulated to resemble

the SR-BI data.

Key words and phrases: cDNA microarray, diﬀerential expression, empirical Bayes,

replication, ROC curve, t-statistic.

1. Introduction

cDNA microarrays are used to compare gene expression in diﬀerent samples

of cells. They permit us to study the expression levels of thousands of genes

simultaneously. The technique has a wide range of applications including learn-

ing how genes interact, which genes are used in diﬀerent cell types, and which

genes change their expression in cells due to disease or drug stimuli. Microarray

experiments typically result in expression measurements from a large number of

genes (usually 10,000-20,000), but with few if any replicates for each gene (usu-

ally 1-10). Throughout this paper the data from one microarray are always a

32 INGRID L

ONNSTEDT AND TERRY SPEED

comparison of the expression levels in two cell samples. For gene i on array j,

we use the value M

where

=log

(expression level in sample 1)

(expression level in sample 2)

.(1)

The numerator and denominator are often referred to as the red and green in-

tensities, R

and G

, because of the experimental procedure described below.

If there are N genes on each array, and n replicates (arrays), the complete set

of data from the experiment consists of (M

), i =1,...,N and j =1,...,n.

Our task is to determine which genes have diﬀerent expression levels in the two

samples, i.e., which genes have M -values genuinely diﬀerent from zero. This is

often expected to be true for only a small proportion (say 1%) of the genes. The

variation of the rest of the genes would then be due to chance.

Natural statistics which might be used to select the diﬀerentially expressed

genes are the mean and standardized mean expression levels, (M

)and(t

/SE

), where SE

= s

√

n is the standard error of M

, s

being the standard

deviation of the values M

,j =1,...,n.

There are problems with both of these traditional statistics. For example,

a large mean might be driven by an outlier, something which occurs quite fre-

quently in this context. A large t might arise because of a small denominator

SE, even though the mean itself is small. Because of the large number of genes

on each array, there will usually be genes with very small standard errors, and

some of these will have small means as well. A simple solution to this problem

is to discount genes with a small absolute mean whose standard errors are in

the bottom 1%, say. A more sophisticated statistic for use in this context has

recently been presented (Tusher, Tibshirani and Chu (2001)), slightly tuning the

t-statistic by adding a suitable constant to each SE. Here we introduce another

alternative based on the empirical Bayes approach. Data from all the genes in a

replicate set of experiments are combined into estimates of parameters of a prior

distribution. These parameter estimates are then combined at the gene level

with means and standard deviations to form a statistic B which is a Bayes log

posterior odds. B can then be used to determine if diﬀerential expression has

occurred. It avoids the problems of the average M and the t-statistic just men-

tioned. We also carry out a simulation comparison of the four diﬀerent methods,

relating the power to detect diﬀerentially expressed genes to the false positive

rate in an ROC (receiver operating characteristic) curve.

The paper is organized as follows. Sections 1.1-1.2 describe the scientiﬁc

background and procedure of the microarray experiments. In Section 2 the four

diﬀerent statistics are described, and they are illustrated in Section 4 using data

introduced in Section 3. The comparison is based on simulated datasets presented

REPLICATED MICROARRAY DATA 33

in Section 5 together with the analysis. Our ﬁndings are summarized in Section

6. The derivation of our statistic B is presented in the Appendix.

1.1. Background on cDNA microarrays

A cell contains a complete set of its host’s genetic code (genes), stored in

a DNA molecule. Depending on the function of the cell, it uses diﬀerent genes,

so that brain cells and liver cells express diﬀerent genes. When a cell wants to

use a gene, the code of that gene is copied into messenger RNA (mRNA) in a

procedure called transcription. The mRNA then gets translated into a protein,

which is the functional product of most genes. Transcription occurs all the time

and for all the genes currently used by the cell, at diﬀerent levels. Microarray

experiments measure the concentration of mRNA ﬂoating around in a set of cells,

and a high concentration of mRNA for a given gene reﬂects a high expression

level of that gene. In practice, the levels of expression of a gene across two (or

more) cell samples are always compared, not measured in absolute terms on a

cDNA microarray.

1.2. Construction of a microarray

A string of mRNA is a single-stranded copy of the DNA sequence for a gene.

It is possible to construct a complementary or cDNA copy of an mRNA molecule

and further experimentation with the mRNA occurs via the cDNA copy.

A microarray is a glass slide on which thousands of spots of cDNA repre-

senting diﬀerent genes are printed using a robotic arrayer. The arrayer has a

grid of pins, or print tips, which can pick up sets of samples from cDNA clones

and print them onto the slide. Each spot will contain thousands of copies of the

cDNA fragment from one gene. Normally all the spots hold cDNA from diﬀer-

ent genes, but sometimes replicate spots are printed for each gene on the same

microarray.

The two cell samples to be compared for gene expression are often cells sub-

ject to some treatment and normal (non-treated) cells, tumour cells and normal

tissue, or just two diﬀerent kinds of tissue. For each of the two samples, the

mRNA is isolated and each sequence is labeled with a ﬂuorescent dye at the time

of conversion to cDNA. Usually, the treatment (or tumour) cDNA is labeled red

and the normal green.

By adding equal amounts of the two labeled cDNA samples to the microarray,

thesamplecDNAwillhybridize to the cDNA spots on the glass slide, i.e., it will

pair with its complementary fragments of cDNA on the slide. If a gene has a

higher level of expression in the treatment sample than in the normal, there will

be more red than green dye on its spot. The intensities of red and green dyes on

each spot are detected using a laser scanner. The red and green intensities of the

34 INGRID L

ONNSTEDT AND TERRY SPEED

scanned image are the measurements which are the starting point of a statistical

analysis. These are combined into M-values as described by (1).

2. Statistics to be Compared

The data from one microarray experiment consist of (M

), i =1,...,N,

j =1,...,n. We will assume that these are already normalized according to

Yang, Dudoit, Luu and Speed (2001), that is, a smooth intensity-dependent

adjustment is made to the log ratio value to remove red-green color bias. For

each gene g, we ask whether its observed vector of M-values, M

,provides

evidence to suggest that the true M-value of that gene is diﬀerent from zero.

We compare four diﬀerent methods of addressing this question. Each method

assigns a score to each gene, and the putative diﬀerentially expressed genes are

selected using a cutoﬀ value for that score. For none of the methods is it obvious

how to choose the cutoﬀ in such a way as to control the type 1 error, or how

to assign a p-value to a score. To do these things we would need to know the

null distribution of our scores, and that is not so straightforward. Much of the

time this is not a serious limitation. Often the genes that would be signiﬁcant

are just a few, very extreme ones, many of which are already well known to the

investigators. Mostly what is wanted are a few more genes that are probably

diﬀerentially expressed, and there is generally a willingness to permit several of

these to be false positives in order to avoid missing too many of the true positives.

A better method is the one with the lower type I and type II errors over a range

of cutoﬀs.

The ﬁrst statistic we consider is the average (M

) for each gene g.Inpractice

we are interested in comparing the absolute value of the average with (positive)

cutoﬀs. The average statistic will sometimes be referred to as M where it is

obvious that we mean the statistic rather than the separate M-values for each

gene and slide. M does not depend on the standard error of the genes, and hence

treats a highly variable gene in the same way as a stable one.

By dividing each average by its corresponding standard error, giving (t

/SE

), variation across replicates can be accounted for in the usual way,

though frequently the number of degrees of freedom is small. Because of the

large number of genes included in microarray experiments, there will always be

some genes with a very small sum of squares across replicates, so that their

(absolute) t-values will be large whether or not their averages are large. Some of

these will turn out to be false positives for the t-statistic. To reduce this problem

in the comparison to come, we have chosen to ignore those genes whose standard

errors fall in the bottom 1% if their absolute average value is smaller than 0.01.

REPLICATED MICROARRAY DATA 35

Tusher et al. (2001) have proposed a reﬁnement of t which avoids the diﬃ-

culty just mentioned. By adding a constant term to the denominator of the stan-

dardized average, the denominator is prevented from getting too small. The fac-

tor, a

, is suggested by Efron, Tibshirani, Goss and Chu (2000) to be equal to the

90th percentile of the standard errors of all the genes. This is used throughtout

the current paper, although a diﬀerent approach is suggested in Tusher, Tibshi-

rani and Chu (2001). Hence we study the absolute value of S

= M

/(s

+ a

)

where s

is the standard deviation for gene sample (M

), j =1,...,n.

Our empirical Bayes log posterior odds statistic is called B and is similar

to S above, although the argument justifying it is more complex. We regard

the (M

) as random variables from a normal distribution with mean (µ

)and

variance (σ

), something which is found empirically to be roughly the case. More

fully, we suppose that, independently for all i and k,

|µ

,σ

∼ N(µ

,σ

). (2)

Most genes have µ

= 0, but a small proportion p of genes have some µ

=0,

indicated by I

= 1 as opposed to I

= 0. The parameters (µ

,σ

) are treated

as i.i.d. realizations of random parameters with some prior distributions. We

calculate the log posterior odds for gene g to be diﬀerentially expressed, B

log

Pr(I

=1|(M

))

Pr(I

=0|(M

))

. By assuming conjugate prior distributions for the variances of

the genes, as well as for their means where not zero, an explicit formula for B is

found to be

=log

1 − p

√

1+nc





a + s

+ M

a + s

1+nc





ν+

. (3)

Here a and ν are hyperparameters in the inverse gamma prior for the vari-

ances, and c is a hyperparameter in the normal prior of the nonzero means. For

details see the Appendix (where we also include an application of B to replicated

spots within as well as between microarray slides). In particular, s

is in this

case the sum of squares over n, rather than over n − 1asint and S.

The only gene-speciﬁc part of B is the last ratio, which is always ≥ 1since

1/(1 + nc) < 1. We deduce that increasing relative gene expression (and hence

increasing M

)increasesB

, and more so if the variance is small. If M

is small

too, a ensures that the ratio cannot be expanded by a very small variance. B

will be illustrated further below.

3. Data

For the illustrations and simulations in this paper we restrict ourselves to one

dataset, the experiment comparing gene expression in liver cells from scavenger

receptor transgenic (SR-BI) mice to those of the corresponding wild-type control

36 INGRID L

ONNSTEDT AND TERRY SPEED

mice, see Callow, Dudoit, Gong, Speed and Rubin (2000). Eight SR-BI mice

are all compared to a reference sample of pooled cDNA from the control mice,

resulting in 8 microarray data sets. After image processing and normalization

as described in Yang et al. (2001) and Buckley (2000), our M-values consisted

of the 8 sets of log ratios as in (1). There were 6,356 genes on each slide. The

resulting data are displayed in an average M vs average A plot in Figure 1a,

where for each gene and array the A-value is the log of the geometric mean of

the two intensity channels (A =0.5(log

R +log

G)). The M vs A plot is often

used for scientiﬁc reasons when displaying and analysing microarray data. Figure

1b displays the average M-values vs the log sample variances, to be compared to

simulated data further below.

-6

-4

-2

-1

16 20

24 28

M .

log Var (M )

Average M vs average A Average M vs variance

Figure 1. The ﬁrst plot is the M

vs A

plot of the SR-BI data, showing the

average M-value vs the average mean log dye intensity for each gene. The

second plot displays M

vs the log variance.

4. Illustration of the Statistics

Below we illustrate the function of B and how it diﬀers from M and t.

For illustrations of S we refer to Tusher, Tibshirani and Chu (2001) and Efron,

Tibshirani, Goss and Chu (2000).

The shape of a plot of (B)vs(M

) is usually parabolic, as we see for SR-

BI in Figure 2. Most genes have an average value around zero, and these have

the minimum values of B. The larger the average expression level of a gene,

the larger is the chance of a high B. The actual B-level of genes with high

averages depends on the variance, so that the outer ends of the parabola will be

rather fuzzy. Two genes with exactly the same average might be far apart in B.

REPLICATED MICROARRAY DATA 37

Ideally we would like to be able to say that genes with B>0 have a signiﬁcantly

changed expression, but since we cannot be sure about p a priori, or estimate it,

this cutoﬀ cannot be relied upon. The threshold has to be judged on a case by

case basis. Figure 2 shows that the rationale for our method seems valid. We

now take a closer look at the details and assess the number of false negatives and

false positives in M

and t

= M

/SE

-5

-2 -1

012

M .

B vs average M

Figure 2. B vs M

plot for SR-BI. B is our proposed statistic, the log

posterior odds for each gene to be diﬀerentially expressed. It depends on the

average as well as on the standard error for each gene.

The statistics M, t and B are illustrated for the SR-BI data in Figure 1,

where we have labelled sets of genes as in Table 1.

Genes are selected as extreme for M if |M

| > 0.5, for t if |t

| > 4.5and

for B if B

> −0.5. These cutoﬀs are chosen so that there will be several genes

Table 1. Sets of genes. A 1 indicates that the genes in the set are extreme

for that statistic.

Set MtB

S1 0 0 1

S2 0 1 0

S3 0 1 1

S4 1 0 0

S5 1 0 1

S6 1 1 0

S7 1 1 1

38 INGRID L

ONNSTEDT AND TERRY SPEED

in each set (except for S6) in order to illustrate the ideas. If our purpose is to

identify diﬀerentially expressed genes with reasonable certainty, the cutoﬀs would

probably need to be a bit larger.

-4

638-6

222

Average M vs variance t vs average A

M .

t (standardized M.)

log Var (M ) A.

B vs average M

M .

Figure 3. SR-BI for diﬀerent statistics. The three plots are of average M

vs the log variance, standardized average t vs the mean intensity, and the

log posterior odds B vs the average expression level. The diﬀerent sets of

genes display the dependencies between the three statistics as well as the

variance. 1 means extreme in B (B>−0.5) only; 2 means extreme in t

(|t| > 4.5) only; 3 means extreme in B and t only; 4 means extreme in M

only (|M

| > 0.5); 5 means extreme in M and B only; 6 means extreme in

M and t only; 7 means extreme in all thee statistics M, t and B.

Compare the locations of the diﬀerent sets of genes in the plots in Figure

3. The plots display the average M vs the variance (in fact the logarithm of the

sum of squares), t vs A (the average intensity), and B vs the average M again.

S7 is the set of genes which are high for all the three statistics, and this shows

clearly in the graphs. The set S3 is also clearly high for B due to the small

variances (large t’s), although the means are only moderately enhanced. These

genes would not have been selected by M, being false negatives there, but are

REPLICATED MICROARRAY DATA 39

readily detected with B. S5 are genes that are extreme in M and B, but not in

t. This makes sense when we note that their t-values are actually rather large,

probably just below our cutoﬀ. Also, these are genes just above the borderline in

B, but there are not that many of them. Similarily, S1 are just above the cutoﬀ

for B but neither for M nor for t. Both S1 and S5 are what we call false negatives

in t : they would not have been detected with t, but they are assumed to have

a true large (but not extreme) variance and a large (but not too large) average.

There are not always any genes in these sets, but there sometimes are, and they

should be selected. S2 are false positives for t: they have a small average but are

driven by tiny variances. It is reassuring to see that these are not extreme in B.

Finally, S4 is a large set of genes with a high average but with standard errors

that are too large for the result to be trusted. They are false positives for M ,and

are properly downweighted in B. Note that the absence of genes in S6 conﬁrms

that genes high in both M and t are also high in B. It follows that by using B

we have avoided the main problems with false positives and false negatives in M

and t.

5. Methods

One hundred diﬀerent datasets were simulated and analysed for diﬀerentially

expressed genes using each of the four diﬀerent methods presented in Section 2.

The aim is to compare the type I and type II errors of the methods without

having to involve p-values, for the reasons mentioned above.

5.1. Simulation of datasets

The SR-BI data was used as a model for the simulated dataset, so that

they all contained n = 8 replicates for each of N =6, 356 genes. The genes were

modeled as independent, see the Discussion. Given the parameters, the replicates

for each gene i were produced as independent observations from a distribution

N(µ

,σ

), i =1,...,N, as in (2). For the variances, an inverse gamma prior

distribution was used, common to all genes. For most genes, the expectation

µ was ﬁxed at zero, whereas for a proportion p of the genes, a normal prior

distribution was used instead, to produce truly diﬀerentially expressed genes. See

(6) (Appendix) for details on the prior distributions. The hyperparameters were

estimated from the SR-BI dataset according to the procedure in A.2 (Appendix),

giving ν =2.8, a =0.040 and c =1.2. The proportion p of diﬀerentially expressed

genes was ﬁxed at p =0.01 throughout the simulations, as well as in the following

analysis. An example of a simulated dataset is shown in Figure 4, where the genes

with a true non-zero expectation are highlighted.

40 INGRID L

ONNSTEDT AND TERRY SPEED

−3

-6 -5

-4 -2 -1

-1.0 -0.5

0.0 0.5 1.0

M .

log Var (M )

Simulated average M vs variance

True inﬂuenced genes

Figure 4. This is the M

vs log variance plot corresponding to Figure 1b for

one of the 100 simulated datasets. Highlighted genes are truly diﬀerentially

expressed, i.e., they were simulated from a normal distribution with a non-

zero mean.

5.2. Analysis

For each gene in each of the 100 simulated datasets, the four statistics M, t, S

and B (Section 2) were calculated. For B the estimation method in the Appendix

was used separately for each of the datasets, ignoring the known parameters used

in the simulations. For a range of cutoﬀ values for each statistic, the numbers

of false positive and false negative genes could be deduced for each of the 100

datasets. The observed numbers of false positive and false negative genes were

then averaged over the datasets for each cutoﬀ value of each statistic. If c

,...,c

) denotes the vector of cutoﬀs for the statistic M ,theresultsfor

M is summarized by the vectors r

=(r

,...,r

)andr

−

=(r

−

,...,r

−

containing the average numbers of false positive and negative genes respectively.

Similarly for t, S and B.

5.3. Results

The cutoﬀ vectors were chosen to give reasonable ranges in the results. For

each statistic, the number of false positives ranges from 0 to approximately 200

(out of the 6,356 genes) and the number of false negatives ranges from 30 to 60.

In the model used to simulate the datasets, the expectations of the activated

genes are normally distributed around zero. This implies that many of these

genes will have a negligible average, as do the false genes. We will not be able

to detect them using any method, unless we allow the number of false positive

genes to be unrealisticly large. Thus the range of the numbers of false negatives

starts around 30 rather than 0.

REPLICATED MICROARRAY DATA 41

The average numbers of false positive and false negative genes for all four

statistics are plotted against one another in Figure 5. This Reciever Operating

Characteristic (ROC) curve allows us to compare the type I and type II errors

for the statistics without involving p-values. The better of two methods is the

one with lower scores on both axes.

Compare errors

Close up

# false positives

# false negatives

Figure 5. Comparison of the four diﬀerent statistics M, t, S and B for the

100 simulated datasets. For a certain cutoﬀ value, each method deﬁnes the

numbers of false negative and false positive genes in each of the simulated

datasets. The lines reﬂect the averages of these numbers over a range of

cutoﬀs.

It turns out that B is the best of the four statistics in the sense that it

minimizes the two error rates in our analysis. However, it is not very diﬀerent

from S. If we allow the number of false positives to exceed 200, the statistic S is

preferred to B, but following such a large number of genes up is hardly intresting

in practice. On the other hand, M and t are clearly found to have a larger error

rate than B and S.

6. Discussion and Conclusions

The problem of identifying diﬀerentially expressed genes using data from

replicated microarray experiments has been addressed. In contrast to the ap-

proaches of Roberts et al. [10] and Ideker et al. [7], both of whom have explicit

42 INGRID L

ONNSTEDT AND TERRY SPEED

error-models for their data, we impose a lighter modelling structure on our ob-

servations. Our method uses information from all the genes in a series of experi-

ments to estimate a posterior odds score for each gene, indicating whether a gene

has changed its expression level. The resulting statistic B deals properly with

diﬃculties met by well-known statistics, such as the average or the standardized

average t. It is easy to use and computationally cheap.

We now need to make some remarks about our modelling assumptions and

philosophy. Although the discussion has been framed in either Bayesian or clas-

sical testing terms, we present our analysis primarily as a way of ranking genes.

The normality and independence assumptions we make are not meant to be taken

literally, rather, they are to be regarded as devices leading to a formula which

we believe improves upon the average and the t statistic in this context. While

the normality assumption for log ratios is probably a good ﬁrst approximation,

it would not be wise to base formal inference on it. We do not see our dis-

cussion as doing so because we make no probabilistic claims for our procedure.

Assuming independence of genes is clearly unrealistic. In reality, the unknown

dependence structure will include near complete dependence between essentially

duplicate genes, through varying degrees of dependence among genes which are

biologically related, to total independence. None of this implies that a statistic

such as B derived using independence assumptions is without value. We hope

that the way in which B is seen to overcome real problems with the average and

t support our claims.

One drawback in using B is that we need a value for the prior proportion

of diﬀerentially expressed genes (see the Appendix). Although the rankings of

genes by their B values change only marginally for genes with large B when the

parameter estimates are altered, the actual scale changes. Thus we cannot rely

on any standard cutoﬀ value, such as B = 0, for the selection of diﬀerentially

expressed genes. However, the same diﬃculty arises when we use more traditional

statistics such as average M or t.

We have found B is an improvement over the average and t methods. This

conclusion is based on data simulated from the same model that we used to derive

our statistic B. Since this model describes the data produced in our microarray

experiments reasonably well, giving only a small proportion p of genes with non-

zero expectation, we are not too worried it works in favour for B. However,

there are two points that deserve comment. First, since the simulation model

does not explicitly produce outliers in the data, the simulated data will contain

fewer such genes than real data. It follows that the gap between the average

M and the other methods can be expected to be even greater in practice than

what we see in Figure 5. Second, it is possible that the fact that p isthesamein

the simulations and the analysis might improve B slightly relative to the other

statistics. To investigate this we carried out the analysis with p =0.005 as well

REPLICATED MICROARRAY DATA 43

as p =0.05, although the data sets were simulated with p =0.01. The range

of cutoﬀs we had to choose for B in these new cases diﬀered from those used

initially, but in the plot corresponding to Figure 5, the three lines (analysed

using p =0.005, p =0.01 and p =0.05 respectively) were hardly separable by

eye. Each of them was best compared to the other three statistics. Note that this

observation also suggests that the statistic B is not very dependent on the choice

of the parameter p, as long as we can choose the cutoﬀ to suit the p-speciﬁc

results. In a way, we can regard B as providing a ranking of the genes with

respect to the posterior probability of each gene being diﬀerentially expressed. A

suitable cutoﬀ for the detection of these genes can then be chosen by the ranking

in combination with experimental preferences, e.g., so that it selects the top 50,

100 or 150 genes. The number of genes selected can depend on the size, aim,

background and follow-up plans of the experiment.

The log posterior odds method has a large potential for extensions beyond

the application to spots within and between microarray slides (see (8) in the

Appendix). It can be modiﬁed to apply to linear models across experiments,

including several cell samples. ANOVA could also be considered, for example, in

comparing the eﬀects of diﬀerent treatments.

The software producing these analyses is available at http://cran.r-project.

org/src/contrib/PACKAGES.html#sma.

Acknowledgements

Our thanks go to the Swedish Foundation for International Cooperation in

Research and Higher Education (STINT), to David Freedman for many valuable

discussions, and to Tom Britton.

A. The Statistic B

A.1. Posterior odds B of diﬀerential expression

Let N be the number of genes in a microarray experiment, n the num-

ber of replicates for each gene (the number of microarray slides), and M

log R

− log G

, i =1,...,N and j =1,...,n, the log ratios of our green and

red intensities for each gene. We regard the M

as random variables from a

normal distribution with mean µ

and variance σ

, so that, independently and

identically,

|µ

,σ

∼ N(µ

,σ

) for all i. (4)

Let I indicate whether the gene g is diﬀerentially expressed (µ

=0). For

each gene g we are interested in Pr(I

=1|(M

)), equivalently, in the log odds

B for this, B

=log

Pr(I

=1|(M

))

Pr(I

=0|(M

))

.ThusP (I

=1|(M

)) >P(I

=0|(M

)) if

44 INGRID L

ONNSTEDT AND TERRY SPEED

and only if B

> 0. (The parameters (µ

,σ

) are treated as i.i.d. realizations of

random parameters with some prior distributions.)

Now, by Bayes’ Theorem and independence across genes,

=log

1 − p

Pr((M

)|I

=1)

Pr((M

)|I

=0)

=log

1 − p

Pr(M

=1)

Pr(M

=0)



i=g

Pr(M

=1)



i=g

Pr(M

=0)

=log

1 − p

Pr(M

=1)

Pr(M

=0)

, (5)

where M

is the vector of the n measurements for gene g and p is the proportion

of diﬀerentially expressed genes in the experiment, p = Pr(I

=1),foranyi in

1,...,N. We need to calculate f

)andf

We usually have very few replicates for each gene (sometimes n = 2), but we

are investigating many genes simultaneously (e.g., N =10, 000). To use all our

knowledge about the means and variances we collect the information gained from

the complete set of genes in estimated joint prior distributions for them. We let

the prior distribution of 1/σ

be gamma, and that of µ

given σ

be normal. This

is a conjugate prior distribution, allowing us to calculate B

explicitly. See e.g.,

Hartigan (1983). For integer degrees of freedom ν and scale parameters a>0,

c>0, we set τ

= na/2σ

and suppose that

∼ Γ(ν, 1), (6a)

|τ



0ifI

N(0,cna/2τ

)ifI

=1,

(6b)

for all i =1,...,N. The parameter c expresses dependence between the priors

for µ

and τ

and is necessary for the calculations. Our densities are then

f(τ

Γ(ν)

ν−1

−τ

(µ

|τ

)=(2π)

−1/2



2τ



−1/2

−

2τ

cna

(µ

)=δ(0),

f(M

|µ

,τ

)=(2π)

−n/2



2τ



−n/2

−

2τ

−µ

)

=(2π)

−n/2



2τ



−n/2

−

2τ

(

−M

)

+n(M

−µ)

)



,µ

,τ

) dµ

dτ

REPLICATED MICROARRAY DATA 45



f (M

|µ

,τ

) f

(µ

|τ

) f (τ

) dµ

dτ



f (M

|µ

,τ

) f

(µ

|τ

) f (τ

) dµ

dτ



f (M

|µ

=0,τ

) f (τ

) dτ

The integrations of the joint densities are performed by identifying the pos-

terior normal distribution of µ

|τ

, N(

1+nc

2τ

1+nc

) for the case I

=1,

and the posterior gamma distribution of τ

,Γ(ν +

, 1+

1+nc

)), s

−M

)

, so that both of these are integrated out. The results are scaled

t-statistics,

Γ(ν+

)

Γ(ν)

(2π)

−





−

(1 + nc)

−1/2



1+nc

)



−(ν+

)

Γ(ν+

)

Γ(ν)

(2π)

−





−



+ M

)



−(ν+

)

Hence for a gene g (from (5))

=log

1 −p

√

1+nc



a + s

+ M

a + s

1+nc



ν+

. (7)

The only gene speciﬁc part of B is the last ratio, which is always ≥ 1since

1/(1 + nc) < 1. We can deduce that an increasing diﬀerential expression (and

hence an increasing M

)increasesB

, all the more if the variance is small. If

is small too, a ensures that the ratio cannot be expanded by a very small

variance.

Expression (7) is referred to as B in this paper. It is possible to generalize (7)

so that it is valid for microarray experiments where each gene has m replicated

spots within each of n slides, i.e., where we have a variance component within

slides and one between slides. The result is then



=log

1 − p

√

1+mnc



a +

(SSB

+ kSSW

)+M

g..

a +

(SSB

+ kSSW

g..

1+mnc



ν+

, (8)

where M

g..

are the overall averages for each gene, SSB

and SSW

are the sums

of squares between and within slides respectively, and k is an unavoidable extra

global parameter reﬂecting the ratio of SSB to SSW.

A.2. Estimation of hyperparameters of B

There are four global parameters in the model for B: p, ν, a and c.Unfor-

tunately, it is diﬃcult to estimate (p, ν, a, c). Therefore, we ﬁx p and estimate

46 INGRID L

ONNSTEDT AND TERRY SPEED

ν, a|p and c|p, ν, a. The parameters ν and a are such that τ

2σ

∼ Γ(ν, 1) for

all i. We use the observed variance estimates (

n−1

−M

)

)toestimateν

and a by the method of moments.

The diﬃculty in estimating c is that it only occurs in the distribution of

genes which are diﬀerentially expressed (I

= 1), and we do not know which

ones these are. We have chosen to compare the observed normal density of the

averages (M

)

i∈T

,whereT is the given top proportion p of genes with respect

to B, with the observed normal density of all averages M

. This leads naturally

to an estimate of c.

In the absence of a satisfactory estimate, p is ﬁxed at some sensible value

such as 0.01 or 0.001. The choice of p does not change the shape of the B vs

-plot very much, but it does move it up and down the y-axis. A consequence

of this is that we cannot ﬁx a cutoﬀ (e.g., B = 0, as suggested for (5)), to select

all the genes with a higher score as diﬀerentially expressed.

References

Buckley, M. J. (2000). The Spot user’s guide. CSIRO Mathematical and Information Sciences.

http://www.cmis.csiro.au/IAP/Spot/spotmanual.htm.

Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P. and Rubin, E. M. (2000). Microarray

expression proﬁling identiﬁes genes with altered expression in HDL-deﬁcient mice. Genome

Res. 10, 2022-2029.

Efron, B., Tibshirani, R., Goss, V. and Chu, G. (2000). Microarrays and their use in a compar-

ative experiment. Technical report, Stanfond University.

Hartigan, J. A. (1983). Bayes Theory. Springer-Verlag, New York.

Ideker, T., Thorsson, V., Siegel, A. F. and Hood, L. E. (2000). Testing for diﬀerentially ex-

pressed genes by maximal likelihood analysis of microarray data. J. Computat. Biolo gy 7,

805-817.

Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics. J. Computat.

Graph. Statist. 5, 299-314.

Roberts, C. J., Nelson, B., Marton, M. J., Stoughton, R., Meyer, M. R., Bennett, H. A., He, Y.

D., Dai, H., Walker, W. L., Hughes, T. R., Tyers, M., Boone, C. and Friend, S. H. (2000).

Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene

expression proﬁles. Science 287, 873-880.

Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Signiﬁcance analysis of microarrays applied

to the ionizing radiation response. Proc. Natl. Acad. Sci. 98, 5116-5121.

Wang, Y. W., Dudoit, S., Luu, P. and Speed, T. P. (2001). Normalization for cDNA microarray

data. SPIE BiOS, San Jose, California.

Department of Mathematics, Uppsala University, Box 480 S-751 06 Uppsala, Sweden.

E-mail: ingrid@math.uu.se

Department of Statistics, University of California, Berkeley, U.S.A.

E-mail: terry@stat.berkeley.edu

Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research,

Melbourne

E-mail: terry@wehi.edu.au

(Received July 2001; accepted October 2001)

Empirical partially Bayes multiple testing and compound $\chi^2$ decisions

Preprint

Mar 2023

We study multiple testing in the normal means problem with estimated variances that are shrunk through empirical Bayes methods. The situation is asymmetric in that a prior is posited for the nuisance parameters (variances) but not the primary parameters (means). If the prior were known, one could proceed by computing p-values conditional on sample variances; a strategy called partially Bayes inference by Sir David Cox. These conditional p-values satisfy a Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. If the variances are in fact fixed, the approach retains type-I error guarantees.

A Generalized Mixture Model for Detecting Differentially Expressed Genes in Microarray Experiments

Article

Dec 2016

bakR: Uncovering differential RNA synthesis and degradation kinetics transcriptome-wide with Bayesian Hierarchical modeling

Article

Full-text available

Apr 2023

Differential expression analysis of RNA sequencing (RNA-seq) data can identify changes in cellular RNA levels, but provides limited information about the kinetic mechanisms underlying such changes. Nucleotide-recoding RNA-seq methods (NR-seq; e.g., TimeLapse-seq, SLAM-seq, etc.) address this shortcoming and are widely used approaches to identify changes in RNA synthesis and degradation kinetics. While advanced statistical models implemented in user-friendly software (e.g., DESeq2) have ensured the statistical rigor of differential expression analyses, no such tools that facilitate differential kinetic analysis with NR-seq exist. Here we report the development of Bayesian analysis of the kinetics of RNA (bakR), an R package to address this need. bakR relies on Bayesian hierarchical modeling of NR-seq data to increase statistical power by sharing information across transcripts. Analyses of simulated data confirmed that bakR implementations of the hierarchical model outperform attempts to analyze differential kinetics with existing models. bakR also uncovers biological signals in real NR-seq datasets and provides improved analyses of existing datasets. This work establishes bakR as an important tool for identifying differential RNA synthesis and degradation kinetics.

Molecular subtypes of high grade serous ovarian cancer across racial groups and gene expression platforms

Article

Full-text available

Nov 2023

Introduction High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by race may contribute to poorer HGSC survival among Black versus non-Hispanic White individuals. Methods We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We assigned subtypes using K-means clustering. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. Results Cluster-specific gene expression was similar across gene expression platforms. Comparing the Black study population to the White and Japanese study populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). Conclusions A single, platform-agnostic pipeline can be used to assign HGSC gene expression subtypes. While the observed prevalence of HGSC subtypes varied by race, subtype-specific survival was similar. Statement of Significance A single pipeline was used to subtype ovarian high-grade serous carcinoma (HGSC) with array-based or RNA-Seq gene expression data. Subtype distributions differed by race, but subtype-specific survival was similar across racial groups.

Recomposition de l’interactome du ribosome au cours des infections virales

Thesis

Nov 2021

Thibault Sohier

Viruses are obligatory parasites that depend entirely on the host-cell machinery to produce their own proteins. Ribosomes, the supramolecular machines that translate messenger RNAs into proteins are consequently at the core of host-pathogen interactions. Historically, ribosomes have been largely considered as monolithical machines, withpoor intrinsic regulatory activity. However, recent data suggest that ribosomes do not constitute an homogenous population. In fact, they can differ by the post-translational modifications of ribosomal proteins, post-transcriptional modifications of ribosomal RNAs or the diversity of proteins that associate with them. During my PhD, I have developed a new method of affinity purification of ribosomes in order to identify the pool of proteins that associate to them. This strategy, which involves the endogenous tagging of ribosomal proteins by the use of the CRISPR/Cas9 system, allows the purification of ribosomes with unprecedented purity, allowng the robust determination of the ribosome interactome. One optimized, I have applied this method to the study of the changes in ribosome partners durng viral infection. Using the alphavirus SINV as a model, I was able to show that viral infection drastically remodels the ribosome interactome. This data shows the molecular manifestations of host translational shut-down, as well as perturbations to ribosome maturation. Finally, certain proteins whose association with the ribosome is dynamic along the course of infection appear to participate directly in the translation of the viral messenger mRNAs.

Improved RNA stability estimation through Bayesian modeling reveals most Salmonella transcripts have subminute half-lives

Article

Mar 2024
P NATL ACAD SCI USA

RNA decay is a crucial mechanism for regulating gene expression in response to environmental stresses. In bacteria, RNA-binding proteins (RBPs) are known to be involved in posttranscriptional regulation, but their global impact on RNA half-lives has not been extensively studied. To shed light on the role of the major RBPs ProQ and CspC/E in maintaining RNA stability, we performed RNA sequencing of Salmonella enterica over a time course following treatment with the transcription initiation inhibitor rifampicin (RIF-seq) in the presence and absence of these RBPs. We developed a hierarchical Bayesian model that corrects for confounding factors in rifampicin RNA stability assays and enables us to identify differentially decaying transcripts transcriptome-wide. Our analysis revealed that the median RNA half-life in Salmonella in early stationary phase is less than 1 min, a third of previous estimates. We found that over half of the 500 most long-lived transcripts are bound by at least one major RBP, suggesting a general role for RBPs in shaping the transcriptome. Integrating differential stability estimates with cross-linking and immunoprecipitation followed by RNA sequencing (CLIP-seq) revealed that approximately 30% of transcripts with ProQ binding sites and more than 40% with CspC/E binding sites in coding or 3′ untranslated regions decay differentially in the absence of the respective RBP. Analysis of differentially destabilized transcripts identified a role for ProQ in the oxidative stress response. Our findings provide insights into posttranscriptional regulation by ProQ and CspC/E, and the importance of RBPs in regulating gene expression.

E -values as unnormalized weights in multiple testing

Article

Sep 2023

We study how to combine p-values and e-values, and design multiple testing procedures where both p-values and e-values are available for every hypothesis. Our results provide a new perspective on multiple testing with data-driven weights: while standard weighted multiple testing methods require the weights to deterministically add up to the number of hypotheses being tested, we show that this normalization is not required when the weights are e-values that are independent of the p-values. Such e-values can be obtained in meta-analysis where a primary dataset is used to compute p-values, and an independent secondary dataset is used to compute e-values. Going beyond meta-analysis, we showcase settings wherein independent e-values and p-values can be constructed on a single dataset itself. Our procedures can result in a substantial increase in power, especially if the nonnull hypotheses have e-values much larger than one.

Bayesian Methods for Gene Expression Analysis

Chapter

Aug 2019

Statistical Methodologies for Analyzing Genomic Data

Chapter

Apr 2023

The purpose of this chapter is to describe and review a variety of statistical issues and methods related to the analysis of microarray data. In the first section, after a brief introduction of the DNA microarray technology in biochemical and genetic research, we provide an overview of four levels of statistical analyses. The subsequent sections present the methods and algorithms in detail.In the second section, we describe the methods for identifying significantly differentially expressed genes in different groups. The methods include fold change, different t-statistics, empirical Bayesian approach, and significance analysis of microarrays (SAM). We further illustrate SAM using a publicly available colon cancer dataset as an example. We also discuss multiple comparison issues and the use of false discovery rate.In the third section, we present various algorithms and approaches for studying the relationship among genes, particularly clustering and classification. In clustering analysis, we discuss hierarchical clustering, and k-means and probabilistic model-based clustering in detail with examples. We also describe the adjusted Rand index as a measure of agreement between different clustering methods. In classification analysis, we first define some basic concepts related to classification. Then we describe four commonly used classification methods including linear discriminant analysis (LDA), support vector machines (SVM), neural network, and tree-and-forest-based classification. Examples are included to illustrate SVM and tree-and-forest-based classification.The fourth section is a brief description of the meta-analysis of microarray data in three different settings: meta-analysis of the same biomolecule and same platform microarray data, meta-analysis of the same biomolecule but different platform microarray data, and meta-analysis of different biomolecule microarray data.We end this chapter with final remarks on future prospects of microarray data analysis.KeywordsSupport vector machineArtificial neural networkFalse discovery rateMicroarray dataRandom forest

Épigénétiques et génétiques de l’infection par Cryptosporidium et leurs impacts dans le développement des néoplasies digestives

Thesis

Apr 2022

Manasi Sawant

Cryptosporidium est un parasite Apicomplexa infectant le tractus gastro-intestinal d'un grand nombre de vertébrés dont l'homme. La diarrhée est la principale manifestation clinique. Elle peut s'avérer grave, voire mortelle chez les individus immunodéprimés. De plus, de nombreuses études expérimentales et épidémiologiques suggèrent qu'il existe un lien de causalité entre l'infection par Cryptosporidium et le cancer digestif. En effet, il a été rapporté que ce parasite pouvait induire le développement d'adénocarcinomes digestifs invasifs chez un modèle expérimental de souris SCID traitées à la dexaméthasone. Bien qu'il ait été considéré depuis 2006, par l'Organisation Mondiale de la Santé (OMS) comme étant un problème de santé publique, sa pathogénie reste mal connue. Pour toutes ces raisons, il nous a semblé intéressant, dans le cadre de ce projet de thèse, d'étudier les interactions entre Cryptosporidium et son hôte afin d'explorer son rôle dans le développement du cancer. Ce travail s'articule donc autour de deux principaux objectifs : 1) Etudier le rôle de l'épigénétique dans les interactions entre Cryptosporidium et son hôte, et 2) Etudier les voies de signalisation impliquées dans le développement du cancer digestif induit par C. parvum. Dans un premier temps, considérant que l'épigénétique joue un rôle important dans la régulation de la transcription, nous avons étudié la méthylation de la lysine des histones, au cours du cycle biologique du parasite. Une première analyse in silico a permis d'identifier les histone lysine méthyltransférases (HKMTs) potentielles de Cryptosporidium. Ensuite, l'alignement des séquences primaires et l'analyse phylogénétique ont permis d'identifier les HKMTs putatives de C. parvum et leurs spécificités de substrat. De plus, nous avons également pu prédire par modélisation, l'existence de trois HKMTs structurellement actives, à savoir CpSET1, CpSET2 et CpSET8. Leur rôle fonctionnel a été justifiée par l'observation de la méthylation de lysine d'histones telles que H3K4Me3, H3K36Me3 et H4K20Me3 pendant le développement intracellulaire de C. parvum. Nous avons réussi via des clonages et des tests d'activité in vitro, à montrer l'existence d'une HKMT CpSET8 active. De plus, les événements de méthylation de la lysine de l'hôte ont également été explorés et les résultats ont mis en évidence le potentiel du parasite à exploiter la régulation épigénétique de l'hôte à son avantage. C'est la première étude mettant en évidence l'existence de mécanismes épigénétiques régulant le cycle biologique de Cryptosporidium. Dans un second temps, nous avons réalisé, toujours sur le même modèle murin, une étude transcriptomique qui a montré que C. parvum est capable d'échapper à la réponse immunitaire innée de l'hôte en résistant à la surexpression des gènes stimulés par l'IFNγ, et en régulant à la baisse l'expression des α-défensines, donnant lieu à une inflammation chronique. Cette inflammation systématique peut donc contribuer au microenvironnement tumoral (TME) immunosuppressif induit par C. parvum. Il s'agit de la première description de l'altération du profil d'expression génétique associé à la néoplasie induite par C. parvum. Enfin, cette étude a également permis de mettre en évidence le fait que la voie de signalisation phosphatidylinositol 3-kinase (PI3K)/Akt était également modulée et sachant que PI3K est une kinase de signalisation impliquée dans la réorganisation du cytosquelette lors de l'infection par C. parvum, nous avons décidé d'étudier plus en profondeur l'implication de cette voie dans l'induction du cancer par ce parasite. Les résultats des analyses biochimiques et immunohistochimiques, ont confirmé l'activation de cette cascade de signalisation dans les cellules épithéliales infectées par C. parvum. Toutes ces données plaident en faveur d'une induction multifactorielle des néoplasies digestives par le parasite impliquant des facteurs liés à la fois au parasite et à l'hôte.

Bayes Theory

Book

Jan 1983

J. A. Hartigan

CSIRO Mathematical and Information Sciences

Article

M. J. Buckley

The Spot user''s guide

Article

M. J. Buckley

Signaling and Circuitry of Multiple MAPK Pathways Revealed by a Matrix of Global Gene Expression Profiles

Article

Feb 2000
SCIENCE

Genome-wide transcript profiling was used to monitor signal transduction during yeast pheromone response. Genetic manipulations allowed analysis of changes in gene expression underlying pheromone signaling, cell cycle control, and polarized morphogenesis. A two-dimensional hierarchical clustered matrix, covering 383 of the most highly regulated genes, was constructed from 46 diverse experimental conditions. Diagnostic subsets of coexpressed genes reflected signaling activity, cross talk, and overlap of multiple mitogen-activated protein kinase (MAPK) pathways. Analysis of the profiles specified by two different MAPKs—Fus3p and Kss1p—revealed functional overlap of the filamentous growth and mating responses. Global transcript analysis reflects biological responses associated with the activation and perturbation of signal transduction pathways.

R: A Language for Data Analysis and Graphics

Article

Sep 1996

In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine what we felt were useful features from two existing computer languages. We feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scoping.

Microarray Expression Profiling Identifies Genes with Altered Expression in HDL-Deficient Mice

Article

Dec 2000
GENOME RES

Based on the assumption that severe alterations in the expression of genes known to be involved in high-density lipoprotein (HDL) metabolism may affect the expression of other genes, we screened an array of >5000 mouse expressed sequence tags for altered gene expression in the livers of two lines of mice with dramatic decreases in HDL plasma concentrations. Labeled cDNA from livers of apolipoprotein AI (apoAI)-knockout mice, scavenger receptor BI (SR-BI) transgenic mice, and control mice were cohybridized to microarrays. Two-sample t statistics were used to identify genes with altered expression levels in the knockout or transgenic mice compared with control mice. In the SR-BI group we found nine array elements representing at least five genes that were significantly altered on the basis of an adjusted P value < 0.05. In the apoAI-knockout group, eight array elements representing four genes were altered compared with the control group (adjusted P < 0.05). Several of the genes identified in the SR-BI transgenic suggest altered sterol metabolism and oxidative processes. These studies illustrate the use of multiple-testing methods for the identification of genes with altered expression in replicated microarray experiments.

Significance Analysis of Microarrays Applied to The Ionizing Radiation Response

Article

May 2001

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data

Article

Feb 2000

Although two-color fluorescent DNA microarrays are now standard equipment in many molecular biology laboratories, methods for identifying differentially expressed genes in microarray data are still evolving. Here, we report a refined test for differentially expressed genes which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene. This test uses a statistical model to describe multiplicative and additive errors influencing an array experiment, where model parameters are estimated from observed intensities for all genes using the method of maximum likelihood. A generalized likelihood ratio test is performed for each gene to determine whether, under the model, these intensities are significantly different. We use this method to identify significant differences in gene expression among yeast cells growing in galactose-stimulating versus non-stimulating conditions and compare our results with current approaches for identifying differentially-expressed genes. The effect of sample size on parameter optimization is also explored, as is the use of the error model to compare the within- and between-slide intensity variation intrinsic to an array experiment.

Microarrays and Their Use in a Comparative Experiment

Article

Nov 2000

Microarrays enable genetic researchers to measure expression levels for thousands of genes simultaneously. At least that's the idea. In fact the gene expression information arrives in highly variable form, producing great quantities of data and intriguing problems of statistical analysis. This paper describes one such data analysis, involving several thousand genes and two million expression levels. The analysis is mostly at an applied level, but does offer some ideas on a more general theory for microarray data.

Normalization for cDNA microarray data

Jan 2001

Y W Wang
S Dudoit
P Luu
T P Speed

Wang, Y. W., Dudoit, S., Luu, P. and Speed, T. P. (2001). Normalization for cDNA microarray data. SPIE BiOS, San Jose, California.

Replicated Microarray Data

Abstract and Figures

Recommended publications

Automatic techniques for gridding CDNA microarray images

Hierarchical Bayes models for CDNA microarray gene expression

Identifying Differentially Expressed Genes in cDNA Microarray Experiments

Iterative linear regression by sector: renormalization of cDNA microarray data and cluster analysis...