ArticlePDF Available

Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression

February 2002
Bioinformatics 18 Suppl 1(Suppl. 1):S96-104

February 2002
18 Suppl 1(Suppl. 1):S96-104

DOI:10.1093/bioinformatics/18.suppl_1.S96

Source
PubMed

Authors:

Holger Sültmann

German Cancer Research Center

Show all 5 authorsHide

We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation h for intensity measurements, and a difference statistic Δh whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses. For the transformation h, the parametric form h(x)=arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and Δh with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental platforms, including two-colour cDNA arrays and a series of Affymetrix oligonucleotide arrays. Availability: Software is freely available for academic use as an R package at http://www.dkfz.de/abt0840/whuber Contact: w.huber@dkfz.de

Graph of the variance stabilizing transformation (4) (solid line), and of the logarithm function (dashed line). The histogram shows the intensity distribution of one colour channel on an 8400- element cDNA microarray. The parameters of the transformation (4) were estimated from the comparison with the intensities from the other colour channel.

…

The difference between the two colour channels of a cDNA microarray versus the rank of their average. Plot a) shows the untransformed intensity data, plots b-f) show the effect of five different transformations (see text). The y-axes of plots b-d) correspond to the usual 'log ratio', the y-axis of plot f) to the difference statistic h as proposed in this article.

…

Display of the data from a two-colour cDNA slide, taken from the kidney data set. Analogous to Figure 3, the difference statistic h is plotted along the y-axis, and the rank of the average spot intensity along the x-axis. The variance of the measurement error is constant over the whole intensity range, and horizontal dotted lines are plotted at multiples of its estimated standard deviation. Note that the figure shows the complete intensity data from the slide, without any thresholding or truncation. The circles and triangles represent genes that have been found up-and downregulated , respectively, in renal cell cancer in a previous study (Boer et al., 2001). Most of these are verified by the present slide, while for some, possibly due to biological or experimental variation, the value of h is close to zero.

…

Figures - uploaded by Holger Sültmann

Content may be subject to copyright.

Content uploaded by Holger Sültmann

Content may be subject to copyright.

BIOINFORMATICS

Vol. 18 Suppl. 1 2002

Pages S96–S104

Variance stabilization applied to microarray data

calibration and to the quantiﬁcation of differential

expression

Wolfgang Huber

, Anja von Heydebreck

, Holger S¨ultmann

Annemarie Poustka

and Martin Vingron

Department of Molecular Genome Analysis, German Cancer Research Center, INF

280, Heidelberg, 69120, Germany and

Department of Computational Molecular

Biology, Max-Planck-Institute for Molecular Genetics, Dahlem, Berlin, 14195,

Germany

Received on January 24, 2002; revised and accepted on March 30, 2002

ABSTRACT

We introduce a statistical model for microarray gene

expression data that comprises data calibration, the

quantiﬁcation of differential expression, and the quan-

tiﬁcation of measurement error. In particular, we derive

a transformation h for intensity measurements, and a

difference statistic h whose variance is approximately

constant along the whole intensity range. This forms a

basis for statistical inference from microarray data, and

provides a rational data pre-processing strategy for multi-

variate analyses. For the transformation h, the parametric

form h( x ) = arsinh(a + bx) is derived from a model of

the variance-versus-mean dependence for microarray

intensity data, using the method of variance stabilizing

transformations. For large intensities, h coincides with

the logarithmic transformation, and h with the log-ratio.

The parameters of h together with those of the calibration

between experiments are estimated with a robust variant

of maximum-likelihood estimation. We demonstrate our

approach on data sets from different experimental plat-

forms, including two-colour cDNA arrays and a series of

Affymetrix oligonucleotide arrays.

Availability: Software is freely available for academic use

as an R package at http://www.dkfz.de/abt0840/whuber

Contact: w.huber@dkfz.de

INTRODUCTION

Microarrays simultaneously measure transcript abun-

dances for thousands of genes in a cell population or

tissue sample. The measurement is performed by quan-

titating the ﬂuorescence intensities from labeled sample

cDNA that has hybridized to the probes on the array.

Multiple samples of interest are processed either by label-

ing them with different dyes, and letting them hybridize

simultaneously against a single array, or by labeling them

with the same dye, and letting them hybridize separately

against multiple arrays. In each case, statements about the

relative abundance of a gene transcript in these samples

can be made by comparing the corresponding ﬂuores-

cence intensities. Due to variations in sample treatment,

labeling, dye efﬁciency and detection, the ﬂuorescence

intensities can in general not be compared directly, but

only after appropriate calibration, which is sometimes also

called ‘normalization’. One way of quantifying relative

transcript abundance is the fold-change, that is the ratio

of calibrated intensities. As the intensities are associated

with measurement error, the usefulness of the fold-change

or of any other measure of relative abundance depends on

knowing its error distribution: one needs to know whether,

for example, a calculated ratio of 1.5 is noteworthy, or

whether it is most likely just a chance ﬂuctuation. To

understand the error distribution, it is necessary to ﬁrst

consider that of the original spot intensities.

The analysis of replicate microarray data typically

shows that the variance of the measured spot intensities

increases with their mean. For high intensities, the co-

efﬁcient of variation is approximately constant, that is,

the standard deviation increases roughly linearly with the

mean. In a pioneering paper Chen et al. (1997) built a

model based on the assumption of a constant coefﬁcient

of variation, and derived the distribution of the ratios

of intensities. The distribution has one parameter, the

coefﬁcient of variation, and according to the model is

the same for all probes on the array. To ﬁt their model

to the intensity data from a two colour cDNA array, they

used a multiplicative calibration, which is estimated along

with the coefﬁcient of variation in an iterative algorithm.

The model of Chen et al. (1997) motivates the use of

logarithm-transformed intensities: ratios in the original

data correspond to differences in the transformed data,

the calibration amounts to a simple shift, and the constant

S96

 Oxford University Press 2002

Variance stabilization for microarray data

coefﬁcient of variation in the original data corresponds

to an approximately constant standard deviation in the

transformed data.

These concepts have been widely used in microarray

data analysis. However, it has also become clear that for

many data sets that are encountered in practice they are

insufﬁcient (e.g. Beißbarth et al. (2000); Hughes et al.

(2000); Rocke and Durbin (2001); Newton et al. (2001);

Baldi and Long (2001); Baggerly et al. (2001); Theilhaber

et al. (2001)). The limitations mostly affect the data from

weakly expressed genes. The signiﬁcance of a ratio of, say,

1.5, is higher when it is observed in the high intensity

range, than when it is observed in the low intensity

range. Furthermore, many image quantization methods

produce a certain fraction of non-positive intensities,

for which ratios make no sense and the (real-valued)

logarithm is not deﬁned. Often, measurements below a

threshold are dismissed, but it is unclear where to set the

threshold and what to do with the missing values in the

subsequent analysis. At the root of these problems lies

the fact that with real microarray data the relationship

between variance and mean typically is of a different

form than that assumed by the model of Chen et al.

Another limitation is that Chen et al. consider only

linear calibration transformations. With this, the data

should lie along a straight line in the scatterplot of the

log-transformed data. In many data sets, however, one

observes deviations from the straight line, resulting in, for

example, ‘banana-shaped’ scatter plots.

In order to overcome these limitations, we generalize

the approach of Chen et al. (1997). A major component

is a model for the distribution of measurement error that

has been proposed by Rocke and Durbin (2001), which

leads to a quadratic variance-versus-mean dependence.

Based on this, we derive a parametric family of transfor-

mations of the measured intensities, such that the variance

of the transformed intensities becomes approximately

independent of the mean. Together with the calibration

transformations, these are incorporated into a statistical

model which allows for maximum likelihood estimation

of its parameters. Moreover, this generalized model

is formulated for an arbitrary number d of replicates,

extending the setup of Chen et al. (1997), who considered

the case of d = 2, with the two-colour cDNA array

technology in mind. The case of d > 2 is relevant for the

analysis of series of one-colour arrays, such as Affymetrix

arrays, or cDNA membranes, and could also be useful for

multi-colour slides, should this technology emerge.

The utility of the model is twofold: First, it allows the

construction of a ‘difference statistic’ h whose variance

does not depend on the mean intensity, and whose value

is a measure of differential expression. h may be viewed

as a generalization of the log-ratio, and the two coincide

for the highly expressed genes. Second, our approach

provides the calibration as part of the model ﬁtting. This

offers a model-based, interpretable solution to the problem

of normalization and may be preferable to commonly used

ad hoc procedures.

The estimation of the model parameters uses replicate

data from either the different colour channels of one

array, or from a series of one-colour arrays. The different

samples need not be exact biological replicates. Rather,

the samples should be biologically related closely enough

that the expression of most genes does not change. We

use a robust estimation technique, which seeks to ignore

the differentially expressed genes, and ﬁts the model only

to that subset of data points (typically, 50 to 90%) that is

closest to the model mean.

We validate our approach on experimental data. First we

provide evidence for the claimed form of the variance-

versus-mean dependence. After this, we look at the dis-

tribution of the proposed difference statistic h as a func-

tion of the mean spot intensity. We ﬁnd that it is centered

around zero, and has constant width along the whole inten-

sity range. Finally, we evaluate our approach with respect

to the identiﬁcation of differentially expressed genes. This

is accomplished by comparing how the power of standard

statistical tests depends on the method used for calibration

and quantiﬁcation of differential expression.

THE MODEL

A microarray data set may be pictured as a rectangular

table y

of real numbers. The rows k correspond to the

probes on the arrays, representing genes, and the columns

i to the samples. The number n of probes may range from a

few hundred to tens of thousands. The number of columns

is d = 2 for the two-colour glass chip technology, and

may range up to dozens or a few hundred for series of

one-colour arrays. The values y

, with k = 1,...,n

and i = 1,...,d are the intensity data as produced by

the image quantization software. Many programs estimate

local background intensities, which may be subtracted.

Due to variations in experimental factors such as

amount of sample mRNA, or labeling and hybridization

efﬁciencies, the values y

cannot directly be compared.

We assume that the different columns (samples) can be

brought on the same scale through afﬁne-linear mappings,

parametrized by the 2d − 2 real-valued parameters

,...,o

and s

,...,s

> 0:

→ ˜y

= o

+ s

(1)

where i = 1,...,d, and o

= 0, s

= 1 without

loss of generality. After this, one can calculate measures

of differential expression, quantifying how much the

intensity of a certain probe is different in one sample from

another. For example, one may consider the difference

between calibrated intensities, or the ratio. We use the

general term difference statistic for such measures.

S97

W.Huber et al.

–200 0 200 400 600 800 1000

Fig. 1. Graph of the variance stabilizing transformation (4) (solid

line), and of the logarithm function (dashed line). The histogram

shows the intensity distribution of one colour channel on an 8400-

element cDNA microarray. The parameters of the transformation (4)

were estimated from the comparison with the intensities from the

other colour channel.

For a non-differentially expressed gene, the values ˜y

for i = 1,...,d scatter around the true value according to

the distribution of the measurement error of probe k.We

may thus regard ˜y

as realizations of the random variables

with mean E(Y

) = u

and variance Var(Y

) = v

We assume that v

only depends on k through a quadratic

function of the mean u

of the following form:

= v(u

) = (c

+ c

)

+ c

, with c

> 0. (2)

We will discuss the motivations for this assumption in the

next section. The method of variance stabilization can be

used to derive a transformation h such that the variance

Var (h(Y

)) is approximately independent of the mean

E(h( Y

)). An expression for h is given by (Tibshirani,

1988)

h(y) =





v(u) du, (3)

and results from a linear approximation of h(Y

) around

h(u

) (‘Delta method’). Inserting (2) into (3) yields

h(y) = γ arsinh(a + by), (4)

where the parameters of h are related to those of (2)

through γ = c

−1

, a = c

√

, and b = c

√

A graph of the arsinh function is shown by the solid line

in Figure 1. Relationships between the arsinh function and

the logarithm are given by

arsinh(x) = log(x +



+ 1),

lim

x→∞

(

arsinh x − log x − log 2

)

= 0. (5)

Hence, for large intensities the transformation (4) becomes

equivalent to the usual logarithmic transformation. How-

ever, unlike the logarithm, it does not have a singularity

at zero, and continues to be smooth and real-valued in the

range of small or negative intensities.

Now we apply the variance stabilizing transfor-

mation (4) to the calibrated data ˜y

from Equa-

tion (1) to obtain the transformation y

→ h( ˜y

) =

arsinh(a + b(o

+ s

)). The parameter γ may be

omitted since it is merely an overall scaling factor. Setting

= a + bo

, b

= bs

, and

) = arsinh(a

+ b

), (6)

we can incorporate the calibration transformation (1), as

well as the variance-versus-mean dependence (2) of the

both together in the following statistical model:

) = µ

+ ε

, k ∈ K . (7)

Here, K denotes the set of probes representing not

differentially expressed genes, µ

= E(h(Y

)) is the

mean, and the variance of the error term is constant,

E(ε

) = 0, Va r(ε

) = σ

. (8)

The condition E(ε

) = 0reﬂects the goal of calibration,

whereas the common variance σ

of the error term is

aimed at by variance stabilization. We ﬁx the higher

moments by assuming that the ε

are i. i. d. normal. In

Section Parameter Estimation we will provide a robust

variant of the maximum likelihood estimator for the 2d

parameters a

and b

. Using the estimated transformations

, the difference statistic that quantiﬁes the change in

expression between samples i and j of a gene represented

by probe k is

h

k;ij

) −

), k = 1,...,n. (9)

One may express Equation (9) in terms of the arsinh

function:

h

k;ij

= arsinh(ˆz

) − arsinh(ˆz

)

= log

ˆz



ˆz

+ 1

ˆz



ˆz

+ 1

where ˆz

=ˆa

and ˆz

=ˆa

are the

calibrated intensities. This shows that in the limit of large

intensities, h

k;ij

coincides with the log-ratio, whereas

for near-zero intensities, it approaches the difference ˆz

−

ˆz

THE VARIANCE-VERSUS-MEAN

DEPENDENCE

The basic assumption underlying the results of the pre-

vious section is that the variance v

depends on k as in

S98

Variance stabilization for microarray data

0 100 200 300 400 500

0 10000 30000 50000

0 100 200 300 400

0 50 100 150 200

Fig. 2. Variance-versus-mean dependence v(u) in microarray data. Shown is the data from one mRNA sample, labeled both in red and green

and hybridized against an 8400-element cDNA slide. The plots show the variance versus the mean (left), and the standard deviation versus

the mean (right). The dots correspond to single-spot estimates ˆv

= (y

− y

)

/2, ˆu

= (y

+ y

)/2, the solid lines show a moving

average. The axis units are arbitrary.

Equation (2). First, this assumption implies that v

de-

pends on k mainly through the mean intensity u

, and that

other factors such as sequence-speciﬁc effects, or effects

associated with array geometry or the production process

may be neglected. This would have to be veriﬁed from

case to case, but appears to be plausible in many exper-

iments. Second, we make a particular parametric ansatz,

namely a quadratic function of the form (2). There are sev-

eral motivations for this. One is provided by the following

model for the measurement error of gene expression ar-

rays (Rocke and Durbin, 2001):

Y = α + βe

+ ν, (10)

where β is the expression level in arbitrary units, α is an

offset, and ν and η are additive and multiplicative error

terms, respectively. ν and η are assumed to be independent

and normally distributed with mean zero. This leads to

E(Y ) = α + m

β (11)

Var (Y ) = s

+ s

(12)

where m

and s

are mean and variance of e

, and s

is the

variance of ν. Inserting Equation (11) into Equation (12)

yields a quadratic expression of the form of Equation (2),

and the relation between the parameters of model (10) and

those of the variance stabilizing transformation (4) is given

by a =−αs

/(m

), b = s

/(m

), γ = m

A further motivation for the quadratic ansatz (2) is

provided by estimating v(u) directly from microarray

data. A typical example is shown in Figure 2. The right

plot shows how the assumption of constant coefﬁcient

of variation breaks down in the low intensity range: the

curve has a non-zero intercept, that is, v(0)>0, and

its convexity is in agreement with the assumption that

> 0 in Equation (2). Similar curves have been observed

for many slides, and also for other levels of replication,

e.g. with data from replicate spots on one array, or from

replicate arrays. The essential features of these curves may

be captured by parametrizing v(u) as a quadratic function

of the form (2).

PARAMETER ESTIMATION

The parameters of the model (7) are estimated from data

with a robust variant of maximum likelihood estimation.

The detailed derivation, as well as results on convergence

and identiﬁability are described in (Huber et al., 2002).

Given the data (y

), k ∈ K , i = 1,...,d, the proﬁle

log-likelihood (Murphy and van der Vaart, 2000) for the

parameters a

, b

,...,a

, b

−

|K |d

log





k∈K



i=1

) −ˆµ

)





k∈K



i=1

log h



), (13)

with h

as in Equation (6). For a ﬁxed set of probes

K , we maximize (13) numerically under the constraints

> 0. The set of probes K is determined iteratively by

a version of least trimmed sum of squares (LTS) regres-

sion (Rousseuw and Leroy, 1987). Brieﬂy, K consists

S99

W.Huber et al.

of those probes for which r



i=1

(

) −ˆµ

)

smaller than an appropriate quantile of the r

. The LTS

regression addresses the fact that the data distribution

of y

is produced by a mixture from genes that are

differentially expressed, and ones that are not.

VALIDATION

In this section we investigate how the variance stabiliza-

tion and calibration work on real data, and how useful the

resulting difference statistic h is to quantify differential

gene expression.

To visualize the variance-versus-mean dependence, we

plot the difference between the gene expression data for a

pair of samples against the rank of their mean. Plotting

against the rank distributes the data evenly along the

x-axis and thus facilitates the visualization of variance

heterogeneity. We look at data from a cDNA microarray

experiment where samples from closely neighbouring

parts of a kidney tumor were labeled with green and

red ﬂuorescent dyes, respectively. The expression levels

of almost all of the genes in the two samples are

expected to be unchanged, so that observed differences

should represent the distribution of h in the absence of

differential expression.

Figure 3 shows such plots for six different types

of data transformation. First, Figure 3a corresponds to

applying no transformation at all. Clearly, the width

of the difference distribution increases with the signal

average. After applying the logarithm transformation, the

data looks as in Figure 3b. Here, all intensities below

1 have been replaced by 1 before taking the logarithm.

This results in the two bands of points on the left

side, corresponding to probes where one of two values

was below 1, and one was not. If instead we dismiss

the non-positive measurements, we obtain Figure 3c. In

addition, a non-linear calibration (Yang et al., 2001) has

been applied. It has been argued that the problem of

small, or non-positive intensity values is an artifact of the

image analysis’ local background estimation, and hence

one might consider using the spot intensities without

local background subtraction. The result is depicted in

Figure 3d. In fact, over a wide range the difference

distribution now happens to have a practically constant

width. However, the distribution no longer follows a

horizontal line. Instead of the logarithm transformation,

another plausible choice is the rank transformation. This is

shown in Figure 3e. Finally, Figure 3f shows the difference

statistic h,asdeﬁned in Equation (9). The distribution

is centered around the x-axis, and its width is constant

along the whole range. Similarly good results have been

observed with microarray expression data from many

different sources.

We now turn to the question how well the values of

h

k;ij

, for a given probe k and measured over many pairs

of samples i and j,reﬂect potential differential expression

of the gene represented by that probe. We compare this

to other commonly used difference statistics: the log-ratio

together with different normalization methods, and the

difference of ranks. As test data, we consider two data

sets that contain highly replicated expression data. Both

data sets compare two biological conditions, the ﬁrst one

clear cell renal cell cancer with non-cancerous kidney

cortex tissue, and the second one acute myeloid with acute

lymphoblastic leukemia (Golub et al., 1999). Given that

there is a large number of genes differentially expressed

between the two conditions, we determined the number of

those that were detected by a statistical test on the values

of h

k;ij

and, in comparison, on the other difference

statistics. Since the permutation test we used allows to

control the type I error, the number of genes detected

indicates how well the various difference statistics do in

fact represent differential expression.

The ﬁrst data set was produced at the Department

of Molecular Genome Analysis at the German Cancer

Research Center, using cDNA slides with about 4200

clones spotted in duplicate. Paired cancerous and non-

cancerous tissue samples from 19 patients were used, and

each tissue pair was hybridized against two slides, with the

dyes swapped between repetitions, resulting in a total of

38 slides. From this data, we calculated (i) the difference

statistic h

k;ij

, as well as log-ratios. For the latter, in

order to deal with the negative intensity values produced

by subtracting the image analysis software’s background

estimates, four different rules were tried: (ii) ignore the

background estimates, (iii) replace the negative values

by 1 before taking the logarithm, (iv) subtract the 5%-

quantile, then replace the remaining negative values by

1, and (v) ﬂag them as missing values, For (ii)–(iv), a

multiplicative calibration was estimated by the midpoint

of the shorth of the uncalibrated log-ratios. The shorth

of a univariate distribution is deﬁned as the shortest

interval containing half of the values, and for a unimodal

distribution, its midpoint is a robust estimator of its mode.

For (v), we used the local regression proposed by (Yang

et al., 2001), using the implementation in the R package

sma (http://www.r-project.org) with default parameters.

Finally, we calculated (vi) the rank differences. Each of

the difference statistics (i)–(vi) was averaged over the two

replicate spots, and over the two replicate arrays, resulting

in one value per gene per patient, and hence in a matrix

with 4200 rows, for the clones, and 19 columns, for the

patients. The mean of each row was compared against

its permutation distribution, obtained from performing

random column-wise sign ﬂips. We counted the number

of genes that were at the extremes of their respective

permutation distribution, as a function of the quantile α.

The result is shown in Figures 4a and b. The test based

S100

Variance stabilization for microarray data

0 2000 4000 6000 8000

–1000 –500 0 500 1000

a) y

0 2000 4000 6000 8000

–4 –20 2 4

b) log(y)

0 2000 4000 6000 8000

–4 –20 2 4

c) log(y), loess

0 2000 4000 6000 8000

–0.3 0.1 0.0 0.1 0.2 0.3 0.4

d) log(y

)

0 2000 4000 6000 8000

–3000 –1000 0 1000 2000 3000

e) rank(y)

0 2000 4000 6000 8000

–0.4 –0.2 0.0 0.2 0.4

f) h(y)

Fig. 3. The difference between the two colour channels of a cDNA microarray versus the rank of their average. Plot a) shows the

untransformed intensity data, plots b-f) show the effect of ﬁve different transformations (see text). The y-axes of plots b-d) correspond

to the usual ‘log ratio’, the y-axis of plot f) to the difference statistic  h as proposed in this article.

on the difference statistic h uniformly had the highest

power.

Figures 4a and b correspond to two one-sided tests,

testing the row mean of the expression matrix against the

hypothesis that it is less or equal to zero, or greater or equal

to zero, respectively. We chose this procedure in order

to make the comparison insensitive to potential subtle

biases in the estimation of the calibration parameters. Such

biases could be caused by a difference in the number of

up- and down-regulated genes, and could consequently

lead to biases in any of the difference statistics (i)-(vi).

However, they would have opposite effects on the number

of detected genes in the two tests. The fact that the

difference statistic h detects more genes in both one-

sided tests veriﬁes that its better performance is not related

to such potential calibration errors.

To evaluate our method with data from a different

technological platform and experimental design, we

used an expression data set measured on Affymetrix

oligonucleotide arrays. It comprises 47 samples of

acute myeloid leukemia and 25 samples of acute lym-

phoblastic leukemia (Golub et al., 1999). From the

data matrix provided at Golub et al.’s (1999) website

(http://www-genome.wi.mit.edu/mpr) we calculated cali-

brated and transformed data h

), with k = 1,...,7129

and i = 1,...,72. We used the data as is, with no further

selection or tresholding, and ignored the A/M/P-ﬂags

that the Affymetrix software associated with each value.

The simultaneous estimation of the 2d = 144 parameters

posed no particular problem. In contrast, Golub et al.

(1999) used a calibration method based on a linear

regression, which in a pairwise fashion referenced arrays

2 ...38 to array 1, and arrays 40 ...72 to array 39. We

used a two-sample permutation t-test to detect genes

differentially expressed between AML and ALL. The

result is shown in Figures 4c and d. Again, the test based

on h has higher power.

Finally, an example for how the difference statistic h

leads to more easily interpretable data displays is depicted

in Figure 5. Since the distribution of h is independent

of the mean intensity, observed values can directly be

compared to the marginal empirical distribution, shown

S101

W.Huber et al.

5e04 1e03 2e03 5e03 1e02

200 300 400 500

a) test for upregulation (kidney data)

∆h(y)

∆log loess

∆rank

(y)

∆log(y − Q5)

∆log(y

)

∆log(y)

5e04 1e03 2e03 5e03 1e02

200 300 400 500

b) test for downregulation (kidney data)

∆h(y)

∆logloess

∆rank

(y)

∆log(y − Q5)

∆log(y

)

∆log(y)

5e05 1e04 2e04 5e04 1e03 2e03

10 20 50 100 200

c) test for upregulation (Golub data)

●

● ● ●

●

● ●

●

∆h(y)

Golub − orig.

5e05 1e04 2e04 5e04 1e03 2e03

50 100 150 200

d) test for downregulation (Golub data)

● ● ●

●

∆h(y)

Golub − orig.

Fig. 4. Sensitivity and speciﬁcity of different methods for the quantiﬁcation of differential expression. Top row: comparison of h to

4 methods based on log-ratios, and to the rank difference, on two-colour cDNA glass chip data. Bottom row: comparison of h to the

procedure used in (Golub et al., 1999) on the AML/ALL data. The plots show the number of genes selected by permutation tests against the

signiﬁcance level α. The test based on the difference statistic h uniformly has the best power.

in the histogram to the right. A scale on the h axis

may be deﬁned through a robust measure of width σ

h

the empirical distribution, as indicated in Figure 5. Note,

however, that in general the null distribution of h is

not known, and in the presence of an unknown subset

of differentially expressed genes, it is also not easy to

estimate it.

DISCUSSION

A long-standing problem in the analysis of microarray

gene expression experiments is how to take into account

the dependence of the standard deviation of a spot

intensity of its mean. In a seminal paper by Chen et al.

(1997), this relationship has been modelled as a linear

function. A main consequence is the use of logarithmic

ratios as a measure of differential expression. Here, we

have shown that their approach, although alleviating the

problem, does not solve it entirely. The main limitation

of the log-ratio as a measure of differential expression

is the dependence of its variability on the intensity. To

address this fact, we propose the general approach of

applying a variance stabilizing transformation in order to

achieve a constant signal-to-noise ratio. This results in

the difference statistic h which displays approximately

constant variance independent of the spot intensity, and

replaces the log-ratio as a measure of differential gene

expression.

S102

Variance stabilization for microarray data

rank(average)

∆h

0 2000 4000 6000 8000

–20 2

–54 ––3 –2 –1012345

∆h

histogram

Fig. 5. Display of the data from a two-colour cDNA slide, taken

from the kidney data set. Analogous to Figure 3, the difference

statistic h is plotted along the y-axis, and the rank of the average

spot intensity along the x -axis. The variance of the measurement

error is constant over the whole intensity range, and horizontal

dotted lines are plotted at multiples of its estimated standard

deviation. Note that the ﬁgure shows the complete intensity data

from the slide, without any thresholding or truncation. The circles

and triangles represent genes that have been found up- and down-

regulated, respectively, in renal cell cancer in a previous study (Boer

et al., 2001). Most of these are veriﬁed by the present slide, while

for some, possibly due to biological or experimental variation, the

value of h is close to zero.

Alternative approaches to this problem (e.g. Hughes

et al. (2000); Baggerly et al. (2001); Theilhaber et al.

(2001)) have also put forward quantitative models for

this intensity-dependence, and propose to augment log-

ratio values by ‘signiﬁcance values’ calculated from such

models.

Additionally, however, our approach takes into account

the problem of calibration. Due to differential behaviour of

dyes, or variations between samples and arrays, intensity

measurements need to be brought on a common scale

before they can be compared. In alternative approaches,

the estimation of the calibration parameters is complicated

by the non-constant variances. Furthermore, the resulting

ratios may sensitively depend on the calibration. These

problems are overcome in our approach: the estimation

of the calibration parameters is simpliﬁed through the

use of a variance stabilizing transformation, and is an

integral part of the overall model ﬁtting. Furthermore,

our approach is parsimonious with parameters. No non-

parametric curve estimation is required, which helps to

provide robustness and avoid overﬁtting. Finally, since the

difference statistic h is simply obtained as the difference

between the transformed data of the individual samples,

our approach not only allows for the comparison of two

samples, but without any further effort can also be used for

multivariate analyses comparing more than two samples.

In our model, the variance stabilizing transformation h

turns out to be an arsinh function. This generalizes earlier

results, as follows. Using a quadratic ansatz, the variance-

versus-mean dependence at the basis of our approach has

three parameters, which may be related to those of the

model of Rocke and Durbin (2001). If in their model the

additive noise component vanishes, the resulting limiting

case turns out to be the logarithmic transformation with

pseudocounts, h

(y) = log(y + y

), which has been

used by various authors to overcome limitations of the

logarithmic transformation (e.g. Beißbarth et al. (2000);

Newton et al. (2001)). Furthermore, if both the constant

and the linear term in the quadratic function vanish, our

model turns into that of Chen et al. (1997)

Our approach is based on the following main assump-

tions: First, the variance of the measurements on a probe

mainly depends on the mean intensity, and the relationship

may be described by a second order polynomial with neg-

ative discriminant. This is grounded in the analysis of a

large number of experiments and is in agreement with the

model of Rocke and Durbin (2001). Second, we assume

that the relationship of measurements between samples is

captured by an afﬁne-linear transformation. While non-

linear behaviour may be observed under certain condi-

tions, it has been demonstrated (e.g. Ramdas et al. (2001))

that current day microarray technology has a large, prac-

tically accessible working range in which intensities in-

crease linearly with mRNA concentrations. It appears to us

that in many cases apparent non-linearities that have been

observed in the logarithmic plot (for an example, see Fig-

ure 3d) are an artifact of the logarithmic transformation,

and disappear when using the appropriate afﬁne-linear cal-

ibration. However, the general approach we proposed can

easily be modiﬁed to incorporate a different class of cali-

bration transformations or a different form of the variance-

versus-mean dependence.

A third assumption concerns the statistical distribution

of the intensity measurements. The variance-stabilized

intensities per spot are assumed to be normally distributed.

The parameter estimation draws on this assumption in

particular near the center of the distribution, but because

of our use of a robust regression procedure, it should not

be affected by possible deviations from normality in the

tails of the distribution.

A crucial point in modelling and parameter estimation

from data is identiﬁability. The transformations h

have

2d parameters (cf. Equations (6) and (13)), which we

need to determine from nd data points. d ranges from

d = 2 up to a few dozen or hundred, while n is

typically in the order of several thousands. Given this

generally favourable relation between the amount of

data and number of parameters, and according to our

experience with simulations and jackknife sampling, the

transformations are well identiﬁable.

S103

W.Huber et al.

One might see a drawback of our method in the fact that

it measures expression differences in terms of a function

h with two estimated, experiment-speciﬁc parameters a

and b, while the log-ratio can be calculated directly

from the calibrated data, with no further parameters,

and is easily interpreted as a fold-change. However, for

large intensities, the values of h and of the log-ratio

coincide (cf. Equation (5)), irrespective of the values of

the experiment-speciﬁc parameters, and hence h may

just as well be interpreted as the logarithm of a fold

change. For small intensities that are near the detection

limit of the experiment, the values of h are contracted

towards 0 in comparison to those of the log-ratio. The

onset and magnitude of this contraction are parametrized

by the parameters a and b, which in this way encode

the intensity dependent measurement error distribution

of the experiment. We note that corresponding intensity-

dependent thresholds are also used in the analysis of log-

ratios, albeit usually in a less systematic manner.

Finally, and perhaps most importantly, our method also

proves successful in the application to real data. It can

typically be used off-the-shelf, without any particular

tuning, and has been applied to different platforms, such

as two-colour slides, Affymetrix chips, and radioactive

membranes. Like in the ANOVA approach by Kerr et al.

(2000), calibration is done not necessarily for pairs of

samples but simultaneously for a whole set. The simple

error distribution of the transformed intensities h

makes

them particularly suitable as input for clustering or other

multivariate analysis methods. Software is provided as an

R package, which is freely available for academic use.

ACKNOWLEDGEMENTS

We thank G

unther Sawitzki and Dirk Buschmann for

fruitful discussions, and Rainer Spang for critical reading

of the manuscript. Bernd Korn has supplied the RZPD

Oncoset of clones for the slides used in the kidney

experiment, and Stephanie S

uß provided expert technical

assistance.

REFERENCES

Baggerly,K.A., Coombes,R.R., Hess,K.R., Stivers,D.N.,

Abruzzo,L.V. and Zhang,W. (2001) Identifying differen-

tially expressed genes in cDNA microarray experiments. J.

Comput. Biol., 8, 639–659.

Baldi,P. and Long,A.D. (2001) A Bayesian framework for the

analysis of microarray expression data: regularized t-test and

statistical inferences of gene changes. Bioinformatics, 17, 509–

519.

Beißbarth,T., Fellenberg,K., Brors,B., Arribas-Prat,R., Boer,J.M.,

Hauser,N.C., Scheideler,M., Hoheisel,J.D., Sch

utz,G.,

Poustka,A. and Vingron,M. (2000) Processing and quality con-

trol of DNA array hybridization data. Bioinformatics, 16, 1014–

1022.

Boer,J.M., Huber,W., S

ultmann,H., Wilmer,F., von Heydebreck, A.,

Haas,S., Korn,B., Gunawan,B., Vente,A., F

uzesi,L., Vingron,M.

and Poustka,A. (2001) Identiﬁcation and classiﬁcation of dif-

ferentially expressed genes in renal cell carcinoma by expres-

sion proﬁling on a global human 31 500-element cDNA array.

Genome Res., 11, 1861–1870.

Chen,Y., Dougherty,E.R. and Bittner,M.L. (1997) Ratio-based deci-

sions and the quantitave analysis of cDNA microarray images..

J. Biomed. Opt., 2, 364–374.

Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasen-

beek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R.,

Caligiuri,M.A., Bloomﬁeld,C.D. and Lander,E.S. (1999) Molec-

ular classiﬁcation of cancer: Class discovery and class prediction

by gene expression monitoring. Science, 286, 531–537.

Huber,W., von Heydebreck,A., S

ultmann,H., Poustka,A. and Vin-

gron,M. A model for the calibration of microarray data and

the quantiﬁcation of differential expression. Preprint available

on request.

Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J.,

Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H.,

He,Y.D., Kidd,M.J., King,A.M., Meyer,M.R., Slade,D.,

Lum,P.Y.Y., Stepaniants,S.B., Shoemaker,D.D., Gachotte,D.,

Chakraburtty,K., Simon,J., Bard,M. and Friend,S.H. (2000)

Functional discovery via a compendium of expression proﬁles.

Cell, 102, 109–126.

Kerr,K.M., Martin,M. and Churchill,G.A. (2000) Analysis of vari-

ance for gene expression microarray data. J. Comput. Biol., 7,

819–837.

Murphy,S.A. and van der Vaart,A.W. (2000) On proﬁle likelihood.

J. Amer. Stat. Assoc., 95, 449–465.

Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and

Tsui,K.W. (2001) On differential variability of expression ra-

tios: improving statistical inference about gene expression

changes from microarray data. J. Comput. Biol., 8,37–52.

Ramdas,L., Coombes,K.R., Baggerly,K., Abruzzo,L.V., High-

smith,W.E., Krogmann,T., Hamilton,S.R. and Zhang,W. (2001)

Sources of nonlinearity in cDNA microarray expression mea-

surements. Genome Biol., 2, research0047.1–research0047.7.

Rocke,D.M. and Durbin,B. (2001) A model for measurement error

for gene expression analysis. J. Comput. Biol., 8, 557–569.

Rousseuw,P.J. and Leroy,A.M. (1987) Robust Regression and

Outlier Detection. Wiley.

Theilhaber,J., Bushnell,S., Jackson,A. and Fuchs,R. (2001)

Bayesian estimation of fold-changes in the analysis of gene

expression: The PFOLD algorithm. J. Comput. Biol., 8, 585–614.

Tibshirani,R. (1988) Estimating transformations for regression via

additivity and variance stabilization. J. Amer. Stat. Assoc., 83,

394–405.

Yang,Y.H., Dudoit,S., Luu,P., Lin,D.M., Peng,V., Ngai,J. and

Speed,T.P. (2002) Normalization for cDNA microarray data: a

robust composite method addressing single and multiple slide

systematic variation. Nucleic Acids Res., 30, e15:1–e15:11.

S104

Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference

Article

Full-text available

May 2024

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew’s correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Systematic identification of structure-specific protein–protein interactions

Article

May 2024
MOL SYST BIOL

The physical interactome of a protein can be altered upon perturbation, modulating cell physiology and contributing to disease. Identifying interactome differences of normal and disease states of proteins could help understand disease mechanisms, but current methods do not pinpoint structure-specific PPIs and interaction interfaces proteome-wide. We used limited proteolysis–mass spectrometry (LiP–MS) to screen for structure-specific PPIs by probing for protease susceptibility changes of proteins in cellular extracts upon treatment with specific structural states of a protein. We first demonstrated that LiP–MS detects well-characterized PPIs, including antibody–target protein interactions and interactions with membrane proteins, and that it pinpoints interfaces, including epitopes. We then applied the approach to study conformation-specific interactors of the Parkinson’s disease hallmark protein alpha-synuclein (aSyn). We identified known interactors of aSyn monomer and amyloid fibrils and provide a resource of novel putative conformation-specific aSyn interactors for validation in further studies. We also used our approach on GDP- and GTP-bound forms of two Rab GTPases, showing detection of differential candidate interactors of conformationally similar proteins. This approach is applicable to screen for structure-specific interactomes of any protein, including posttranslationally modified and unmodified, or metabolite-bound and unbound protein states.

Longitudinal plasma proteomic analysis of 1117 hospitalized patients with COVID-19 identifies features associated with severity and outcomes

Article

May 2024

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection is characterized by highly heterogeneous manifestations ranging from asymptomatic cases to death for still incompletely understood reasons. As part of the IMmunoPhenotyping Assessment in a COVID-19 Cohort study, we mapped the plasma proteomes of 1117 hospitalized patients with COVID-19 from 15 hospitals across the United States. Up to six samples were collected within ~28 days of hospitalization resulting in one of the largest COVID-19 plasma proteomics cohorts with 2934 samples. Using perchloric acid to deplete the most abundant plasma proteins allowed for detecting 2910 proteins. Our findings show that increased levels of neutrophil extracellular trap and heart damage markers are associated with fatal outcomes. Our analysis also identified prognostic biomarkers for worsening severity and death. Our comprehensive longitudinal plasma proteomics study, involving 1117 participants and 2934 samples, allowed for testing the generalizability of the findings of many previous COVID-19 plasma proteomics studies using much smaller cohorts.

A Review on: Gene Expression Analysis Techniques and its Application

Article

Apr 2024

Kavya S

Functional and multi-omic aging rejuvenation with GLP-1R agonism

Preprint

Full-text available

May 2024

Identifying readily implementable methods that can effectively counteract aging is urgently needed for tackling age-related degenerative disorders. Here, we conducted functional assessments and deep molecular phenotyping in the aging mouse to demonstrate that glucagon-like peptide-1 receptor agonist (GLP-1RA) treatment attenuates body-wide age-related changes. Apart from improvements in physical and cognitive performance, the age-counteracting effects are prominently evident at multiple omic levels. These span the transcriptomes and DNA methylomes of various tissues, organs and circulating white blood cells, as well as the plasma metabolome. Importantly, the beneficial effects are specific to aged mice, not young adults, and are achieved with a low dosage of GLP-1RA which has a negligible impact on food consumption and body weight. The molecular rejuvenation effects exhibit organ-specific characteristics, which are generally heavily dependent on hypothalamic GLP-1R. We benchmarked the GLP-1RA age-counteracting effects against those of mTOR inhibition, a well-established anti-aging intervention, observing a strong resemblance across the two strategies. Our findings have broad implications for understanding the mechanistic basis of the clinically observed pleiotropic effects of GLP-1RAs, the design of intervention trials for age-related diseases, and the development of anti-aging-based therapeutics.

Plasma ALS and Gal-3BP differentiate early from advanced liver fibrosis in MASLD patients

Article

Full-text available

Apr 2024

Background Metabolic dysfunction-associated steatotic liver disease (MASLD) is estimated to affect 30% of the world’s population, and its prevalence is increasing in line with obesity. Liver fibrosis is closely related to mortality, making it the most important clinical parameter for MASLD. It is currently assessed by liver biopsy – an invasive procedure that has some limitations. There is thus an urgent need for a reliable non-invasive means to diagnose earlier MASLD stages. Methods A discovery study was performed on 158 plasma samples from histologically-characterised MASLD patients using mass spectrometry (MS)-based quantitative proteomics. Differentially abundant proteins were selected for verification by ELISA in the same cohort. They were subsequently validated in an independent MASLD cohort (n = 200). Results From the 72 proteins differentially abundant between patients with early (F0-2) and advanced fibrosis (F3-4), we selected Insulin-like growth factor-binding protein complex acid labile subunit (ALS) and Galectin-3-binding protein (Gal-3BP) for further study. In our validation cohort, AUROCs with 95% CIs of 0.744 [0.673 – 0.816] and 0.735 [0.661 – 0.81] were obtained for ALS and Gal-3BP, respectively. Combining ALS and Gal-3BP improved the assessment of advanced liver fibrosis, giving an AUROC of 0.796 [0.731. 0.862]. The {ALS; Gal-3BP} model surpassed classic fibrosis panels in predicting advanced liver fibrosis. Conclusions Further investigations with complementary cohorts will be needed to confirm the usefulness of ALS and Gal-3BP individually and in combination with other biomarkers for diagnosis of liver fibrosis. With the availability of ELISA assays, these findings could be rapidly clinically translated, providing direct benefits for patients. Graphical Abstract

Yeast poly(A)-binding protein (Pab1) controls translation initiation in vivo primarily by blocking mRNA decapping and decay

Preprint

Full-text available

Apr 2024

Poly(A)-binding protein (Pab1 in yeast) is involved in mRNA decay and translation initiation, but its molecular functions are incompletely understood. We found that auxin-induced degradation of Pab1 reduced bulk mRNA and polysome abundance in a manner suppressed by deleting the catalytic subunit of decapping enzyme (dcp2Δ), demonstrating that enhanced decapping/degradation is the major driver of reduced mRNA abundance and protein synthesis at limiting Pab1 levels. An increased median poly(A) tail length conferred by Pab1 depletion was also nullified by dcp2Δ, suggesting that mRNA isoforms with shorter tails are preferentially decapped/degraded at limiting Pab1. In contrast to findings on mammalian cells, the translational efficiencies (TEs) of many mRNAs were altered by Pab1 depletion; however, these changes were broadly diminished by dcp2Δ, suggesting that reduced mRNA abundance is a major driver of translational reprogramming at limiting Pab1. Thus, assembly of the closed-loop mRNP via PABP-eIF4G interaction appears to be dispensable for normal translation of most yeast mRNAs in vivo. Interestingly, histone mRNAs and proteins are preferentially diminished on Pab1 depletion dependent on Dcp2, accompanied by activation of internal cryptic promoters in the manner expected for reduced nucleosome occupancies, revealing a new layer of post-transcriptional control of histone gene expression.

Increase of Cardiac Autoantibodies Against Beta-2-adrenergic Receptor During Acute Cellular Heart Transplant Rejection

Article

May 2024
TRANSPLANTATION

Background Acute cellular rejection (ACR) in heart transplant (HTx) recipients may be accompanied by cardiac cell damage with subsequent exposure to cardiac autoantigens and the production of cardiac autoantibodies (aABs). This study aimed to evaluate a peptide array screening approach for cardiac aABs in HTx recipients during ACR (ACR-HTx). Methods In this retrospective single-center observational study, sera from 37 HTx recipients, as well as age and sex-matched healthy subjects were screened for a total of 130 cardiac aABs of partially overlapping peptide sequences directed against structural proteins using a peptide array approach. Results In ACR-HTx, troponin I (TnI) serum levels were found to be elevated. Here, we could identify aABs against beta-2-adrenergic receptor (β-2AR: EAINCYANETCCDFFTNQAY) to be upregulated in ACR-HTx (intensities: 0.80 versus 1.31, P = 0.0413). Likewise, patients positive for β-2AR aABs showed higher TnI serum levels during ACR compared with aAB negative patients (10.0 versus 30.0 ng/L, P = 0.0375). Surprisingly, aABs against a sequence of troponin I (TnI: QKIFDLRGKFKRPTLRRV) were found to be downregulated in ACR-HTx (intensities: 3.49 versus 1.13, P = 0.0025). A comparison in healthy subjects showed the same TnI sequence to be upregulated in non-ACR-HTx (intensities: 2.19 versus 3.49, P = 0.0205), whereas the majority of aABs were suppressed in non-ACR-HTx. Conclusions Our study served as a feasibility analysis for a peptide array screening approach in HTx recipients during ACR and identified 2 different regulated aABs in ACR-HTx. Hence, further multicenter studies are needed to evaluate the prognostic implications of aAB testing and diagnostic or therapeutic consequences.

TurboID mapping reveals the exportome of secreted intrinsically disordered proteins in the transforming parasite Theileria annulata

Article

May 2024

Theileria annulata is a tick-transmitted apicomplexan parasite that gained the unique ability among parasitic eukaryotes to transform its host cell, inducing a fatal cancer-like disease in cattle. Understanding the mechanistic interplay between the host cell and malignant Theileria species that drives this transformation requires the identification of responsible parasite effector proteins. In this study, we used TurboID-based proximity labeling, which unbiasedly identified secreted parasite proteins within host cell compartments. By fusing TurboID to nuclear export or localization signals, we biotinylated proteins in the vicinity of the ligase enzyme in the nucleus or cytoplasm of infected macrophages, followed by mass spectrometry analysis. Our approach revealed with high confidence nine nuclear and four cytosolic candidate parasite proteins within the host cell compartments, eight of which had no orthologs in non-transforming T. orientalis . Strikingly, all eight of these proteins are predicted to be highly intrinsically disordered proteins. We discovered a novel tandem arrayed protein family, nuclear intrinsically disordered proteins (NIDP) 1–4, featuring diverse functions predicted by conserved protein domains. Particularly, NIDP2 exhibited a biphasic host cell-cycle-dependent localization, interacting with the EB1/CD2AP/CLASP1 parasite membrane complex at the schizont surface and the tumor suppressor stromal antigen 2 (STAG2), a cohesion complex subunit, in the host nucleus. In addition to STAG2, numerous NIDP2-associated host nuclear proteins implicated in various cancers were identified, shedding light on the potential role of the T. annulata exported protein family NIDP in host cell transformation and cancer-related pathways. IMPORTANCE TurboID proximity labeling was used to identify secreted proteins of Theileria annulata , an apicomplexan parasite responsible for a fatal, proliferative disorder in cattle that represents a significant socio-economic burden in North Africa, central Asia, and India. Our investigation has provided important insights into the unique host-parasite interaction, revealing secreted parasite proteins characterized by intrinsically disordered protein structures. Remarkably, these proteins are conspicuously absent in non-transforming Theileria species, strongly suggesting their central role in the transformative processes within host cells. Our study identified a novel tandem arrayed protein family, with nuclear intrinsically disordered protein 2 emerging as a central player interacting with established tumor genes. Significantly, this work represents the first unbiased screening for exported proteins in Theileria and contributes essential insights into the molecular intricacies behind the malignant transformation of immune cells.

FlexStat: combinatory differentially expressed protein extraction

Article

Apr 2024

Motivation Mass spectrometry-based system proteomics allows identification of dysregulated protein hubs and associated disease-related features. Obtaining differentially expressed proteins (DEPs) is the most important step of downstream bioinformatics analysis. However, the extraction of statistically significant DEPs from datasets with multiple experimental conditions or disease types through currently available tools remains a laborious task. More often such an analysis requires considerable bioinformatics expertise, making it inaccessible to researchers with limited computational analytics experience. Results To uncover the differences among the many conditions within the data in a user-friendly manner, here we introduce FlexStat, a web-based interface that extracts DEPs through combinatory analysis. This tool accepts a protein expression matrix as input and systematically generates DEP results for every conceivable combination of various experimental conditions or disease types. FlexStat includes a suite of robust statistical tools for data preprocessing, in addition to DEP extraction, and publication-ready visualization, which are built on established R scientific libraries in an automated manner. This analytics suite was validated in diverse public proteomic datasets to showcase its high performance of rapid and simultaneous pairwise comparisons of comprehensive datasets. Availability and implementation FlexStat is implemented in R and is freely available at https://jglab.shinyapps.io/flexstatv1-pipeline-only/. The source code is accessible at https://github.com/kts-desilva/FlexStat/tree/main.

Robust Regression & Outlier Detection

Book

Full-text available

Sep 1987

This is a book, not a paper.

Molecular classification of cancer: class discovery and class prediction by gene monitoring

Article

Full-text available

Nov 1999

Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Processing and quality control of DNA array hybridization data

Article

Full-text available

Dec 2000

The technology of hybridization to DNA arrays is used to obtain the expression levels of many different genes simultaneously. It enables searching for genes that are expressed specifically under certain conditions. However, the technology produces large amounts of data demanding computational methods for their analysis. It is necessary to find ways to compare data from different experiments and to consider the quality and reproducibility of the data. Data analyzed in this paper have been generated by hybridization of radioactively labeled targets to DNA arrays spotted on nylon membranes. We introduce methods to compare the intensity values of several hybridization experiments. This is essential to find differentially expressed genes or to do pattern analysis. We also discuss possibilities for quality control of the acquired data. http://www.dkfz.de/tbi M.Vingron@dkfz-heidelberg.de

On Differential Variability of Expression Ratios: Improving Statistical Inference About Gene Expression Changes From Microarray Data

Article

Full-text available

Feb 2001

We consider the problem of inferring fold changes in gene expression from cDNA microarray data. Standard procedures focus on the ratio of measured fluorescent intensities at each spot on the microarray, but to do so is to ignore the fact that the variation of such ratios is not constant. Estimates of gene expression changes are derived within a simple hierarchical model that accounts for measurement error and fluctuations in absolute gene expression levels. Significant gene expression changes are identified by deriving the posterior odds of change within a similar model. The methods are tested via simulation and are applied to a panel of Escherichia coli microarrays.

Analysis of Variance for Gene Expression Microarray Data

Article

Full-text available

Feb 2000

Spotted cDNA microarrays are emerging as a powerful and cost-effective tool for large-scale analysis of gene expression. Microarrays can be used to measure the relative quantities of specific mRNAs in two or more tissue samples for thousands of genes simultaneously. While the power of this technology has been recognized, many open questions remain about appropriate analysis of microarray data. One question is how to make valid estimates of the relative expression for genes that are not biased by ancillary sources of variation. Recognizing that there is inherent "noise" in microarray data, how does one estimate the error variation associated with an estimated change in expression, i.e., how does one construct the error bars? We demonstrate that ANOVA methods can be used to normalize microarray data and provide estimates of changes in gene expression that are corrected for potential confounding effects. This approach establishes a framework for the general analysis and interpretation of microarray data.

Identification and Classification of Differentially Expressed Genes in Renal Cell Carcinoma by Expression Profiling on a Global Human 31,500-Element cDNA Array

Article

Full-text available

Dec 2001
GENOME RES

We investigated the changes in gene expression accompanying the development and progression of kidney cancer by use of 31,500-element complementary DNA arrays. We measured expression profiles for paired neoplastic and noncancerous renal epithelium samples from 37 individuals. Using an experimental design optimized for factoring out technological and biological noise, and an adapted statistical test, we found 1738 differentially expressed cDNAs with an expected number of six false positives. Functional annotation of these genes provided views of the changes in the activities of specific biological pathways in renal cancer. Cell adhesion, signal transduction, and nucleotide metabolism were among the biological processes with a large proportion of genes overexpressed in renal cell carcinoma. Down-regulated pathways in the kidney tumor cells included small molecule transport, ion homeostasis, and oxygen and radical metabolism. Our expression profiling data uncovered gene expression changes shared with other epithelial tumors, as well as a unique signature for renal cell carcinoma. [Expression data for the differentially expressed cDNAs are available as a Web supplement at http://www.dkfz-heidelberg.de/abt0840/whuber/rcc . The array data have been submitted to the GEO data repository under accession no. GSE3.]

Estimating Transformations for Regression Via Additivity and Variance Stabilization

Article

Jun 1988

Robert Tibshirani

I propose a method for the nonparametric estimation of transformations for regression. It is much more flexible than the familiar Box-Cox procedure, allowing general smooth transformations of the variables, and is similar to the ACE (alternating conditional expectation) algorithm of Breiman and Friedman (1985). The ACE procedure uses scatterplot smoothers in an iterative fashion to find the maximally correlated transformations of the variables. Like ACE, my proposal can incorporate continuous, categorical, or periodic variables, or any mixture of these types. The method differs from ACE in that it uses a (nonparametric) variance-stabilizing transformation for the response variable. The technique seems to alleviate many of the anomalies that ACE suffers with regression data, including the inability to reproduce model transformations and sensitivity to the marginal distribution of the predictors. I provide several examples, including an analysis of the “brain and body weight” data and some data on telephone-call load. I also discuss the relationship of the proposed technique to the Box-Cox and ACE procedures. Efron's work on transformations provides some of the theoretical basis for the methodology.

On Profile Likelihood

Article

Jun 2000

We show that semiparametric profile likelihoods, where the nuisance parameter has been profiled out, behave like ordinary likelihoods in that they have a quadratic expansion. In this expansion the score function and the Fisher information are replaced by-the efficient score function and efficient Fisher information. The expansion may be used, among others, to prove the asymptotic normality of the maximum likelihood estimator, to derive the asymptotic chi-squared distribution of the log-likelihood ratio statistic, and to prove the consistency of the observed information as an estimator of the inverse of the asymptotic variance.

Ratio-Based Decisions and the Quantitative Analysis of cDNA Microarray Images

Article

Oct 1997
J BIOMED OPT

Gene expression can be quantitatively analyzed by hybridizing fluor-tagged mRNA to targets on a cDNA microarray. Comparison of gene expression levels arising from cohybridized samples is achieved by taking ratios of average expression levels for individual genes. A novel method of image segmentation is provided to identify cDNA target sites and a hypothesis test and confidence interval is developed to quantify the significance of observed differences in expression ratios. In particular, the probability density of the ratio and the maximum-likelihood estimator for the distribution are derived, and an iterative procedure for signal calibration is developed.

Functional Discovery via a Compendium of Expression Profiles

Article

Aug 2000
CELL

Ascertaining the impact of uncharacterized perturbations on the cell is a fundamental problem in biology. Here, we describe how a single assay can be used to monitor hundreds of different cellular functions simultaneously. We constructed a reference database or "compendium" of expression profiles corresponding to 300 diverse mutations and chemical treatments in S. cerevisiae, and we show that the cellular pathways affected can be determined by pattern matching, even among very subtle profiles. The utility of this approach is validated by examining profiles caused by deletions of uncharacterized genes: we identify and experimentally confirm that eight uncharacterized open reading frames encode proteins required for sterol metabolism, cell wall function, mitochondrial respiration, or protein synthesis. We also show that the compendium can be used to characterize pharmacological perturbations by identifying a novel target of the commonly used drug dyclonine.

Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression

Abstract and Figures

Recommended publications

Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays

Parameter estimation for the calibration and variance stabilization of microarray data

Variance Stabilization and Robust Normalization for Microarray Gene Expression Data

Mathematical tree models for cytogenetic development in solid tumors

Analysis of Microarray Gene Expression Data