ArticlePDF Available

Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression

Authors:

Abstract and Figures

We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation h for intensity measurements, and a difference statistic Δh whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses. For the transformation h, the parametric form h(x)=arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and Δh with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental platforms, including two-colour cDNA arrays and a series of Affymetrix oligonucleotide arrays. Availability: Software is freely available for academic use as an R package at http://www.dkfz.de/abt0840/whuber Contact: w.huber@dkfz.de
Content may be subject to copyright.
BIOINFORMATICS
Vol. 18 Suppl. 1 2002
Pages S96–S104
Variance stabilization applied to microarray data
calibration and to the quantification of differential
expression
Wolfgang Huber
1
, Anja von Heydebreck
2
, Holger S¨ultmann
1
,
Annemarie Poustka
1
and Martin Vingron
2
1
Department of Molecular Genome Analysis, German Cancer Research Center, INF
280, Heidelberg, 69120, Germany and
2
Department of Computational Molecular
Biology, Max-Planck-Institute for Molecular Genetics, Dahlem, Berlin, 14195,
Germany
Received on January 24, 2002; revised and accepted on March 30, 2002
ABSTRACT
We introduce a statistical model for microarray gene
expression data that comprises data calibration, the
quantification of differential expression, and the quan-
tification of measurement error. In particular, we derive
a transformation h for intensity measurements, and a
difference statistic h whose variance is approximately
constant along the whole intensity range. This forms a
basis for statistical inference from microarray data, and
provides a rational data pre-processing strategy for multi-
variate analyses. For the transformation h, the parametric
form h( x ) = arsinh(a + bx) is derived from a model of
the variance-versus-mean dependence for microarray
intensity data, using the method of variance stabilizing
transformations. For large intensities, h coincides with
the logarithmic transformation, and h with the log-ratio.
The parameters of h together with those of the calibration
between experiments are estimated with a robust variant
of maximum-likelihood estimation. We demonstrate our
approach on data sets from different experimental plat-
forms, including two-colour cDNA arrays and a series of
Affymetrix oligonucleotide arrays.
Availability: Software is freely available for academic use
as an R package at http://www.dkfz.de/abt0840/whuber
Contact: w.huber@dkfz.de
INTRODUCTION
Microarrays simultaneously measure transcript abun-
dances for thousands of genes in a cell population or
tissue sample. The measurement is performed by quan-
titating the fluorescence intensities from labeled sample
cDNA that has hybridized to the probes on the array.
Multiple samples of interest are processed either by label-
ing them with different dyes, and letting them hybridize
simultaneously against a single array, or by labeling them
with the same dye, and letting them hybridize separately
against multiple arrays. In each case, statements about the
relative abundance of a gene transcript in these samples
can be made by comparing the corresponding fluores-
cence intensities. Due to variations in sample treatment,
labeling, dye efficiency and detection, the fluorescence
intensities can in general not be compared directly, but
only after appropriate calibration, which is sometimes also
called ‘normalization’. One way of quantifying relative
transcript abundance is the fold-change, that is the ratio
of calibrated intensities. As the intensities are associated
with measurement error, the usefulness of the fold-change
or of any other measure of relative abundance depends on
knowing its error distribution: one needs to know whether,
for example, a calculated ratio of 1.5 is noteworthy, or
whether it is most likely just a chance fluctuation. To
understand the error distribution, it is necessary to first
consider that of the original spot intensities.
The analysis of replicate microarray data typically
shows that the variance of the measured spot intensities
increases with their mean. For high intensities, the co-
efficient of variation is approximately constant, that is,
the standard deviation increases roughly linearly with the
mean. In a pioneering paper Chen et al. (1997) built a
model based on the assumption of a constant coefficient
of variation, and derived the distribution of the ratios
of intensities. The distribution has one parameter, the
coefficient of variation, and according to the model is
the same for all probes on the array. To fit their model
to the intensity data from a two colour cDNA array, they
used a multiplicative calibration, which is estimated along
with the coefficient of variation in an iterative algorithm.
The model of Chen et al. (1997) motivates the use of
logarithm-transformed intensities: ratios in the original
data correspond to differences in the transformed data,
the calibration amounts to a simple shift, and the constant
S96
c
Oxford University Press 2002
Variance stabilization for microarray data
coefcient of variation in the original data corresponds
to an approximately constant standard deviation in the
transformed data.
These concepts have been widely used in microarray
data analysis. However, it has also become clear that for
many data sets that are encountered in practice they are
insufcient (e.g. Beißbarth et al. (2000); Hughes et al.
(2000); Rocke and Durbin (2001); Newton et al. (2001);
Baldi and Long (2001); Baggerly et al. (2001); Theilhaber
et al. (2001)). The limitations mostly affect the data from
weakly expressed genes. The signicance of a ratio of, say,
1.5, is higher when it is observed in the high intensity
range, than when it is observed in the low intensity
range. Furthermore, many image quantization methods
produce a certain fraction of non-positive intensities,
for which ratios make no sense and the (real-valued)
logarithm is not dened. Often, measurements below a
threshold are dismissed, but it is unclear where to set the
threshold and what to do with the missing values in the
subsequent analysis. At the root of these problems lies
the fact that with real microarray data the relationship
between variance and mean typically is of a different
form than that assumed by the model of Chen et al.
Another limitation is that Chen et al. consider only
linear calibration transformations. With this, the data
should lie along a straight line in the scatterplot of the
log-transformed data. In many data sets, however, one
observes deviations from the straight line, resulting in, for
example, banana-shapedscatter plots.
In order to overcome these limitations, we generalize
the approach of Chen et al. (1997). A major component
is a model for the distribution of measurement error that
has been proposed by Rocke and Durbin (2001), which
leads to a quadratic variance-versus-mean dependence.
Based on this, we derive a parametric family of transfor-
mations of the measured intensities, such that the variance
of the transformed intensities becomes approximately
independent of the mean. Together with the calibration
transformations, these are incorporated into a statistical
model which allows for maximum likelihood estimation
of its parameters. Moreover, this generalized model
is formulated for an arbitrary number d of replicates,
extending the setup of Chen et al. (1997), who considered
the case of d = 2, with the two-colour cDNA array
technology in mind. The case of d > 2 is relevant for the
analysis of series of one-colour arrays, such as Affymetrix
arrays, or cDNA membranes, and could also be useful for
multi-colour slides, should this technology emerge.
The utility of the model is twofold: First, it allows the
construction of a difference statistich whose variance
does not depend on the mean intensity, and whose value
is a measure of differential expression. h may be viewed
as a generalization of the log-ratio, and the two coincide
for the highly expressed genes. Second, our approach
provides the calibration as part of the model tting. This
offers a model-based, interpretable solution to the problem
of normalization and may be preferable to commonly used
ad hoc procedures.
The estimation of the model parameters uses replicate
data from either the different colour channels of one
array, or from a series of one-colour arrays. The different
samples need not be exact biological replicates. Rather,
the samples should be biologically related closely enough
that the expression of most genes does not change. We
use a robust estimation technique, which seeks to ignore
the differentially expressed genes, and ts the model only
to that subset of data points (typically, 50 to 90%) that is
closest to the model mean.
We validate our approach on experimental data. First we
provide evidence for the claimed form of the variance-
versus-mean dependence. After this, we look at the dis-
tribution of the proposed difference statistic h as a func-
tion of the mean spot intensity. We nd that it is centered
around zero, and has constant width along the whole inten-
sity range. Finally, we evaluate our approach with respect
to the identication of differentially expressed genes. This
is accomplished by comparing how the power of standard
statistical tests depends on the method used for calibration
and quantication of differential expression.
THE MODEL
A microarray data set may be pictured as a rectangular
table y
ki
of real numbers. The rows k correspond to the
probes on the arrays, representing genes, and the columns
i to the samples. The number n of probes may range from a
few hundred to tens of thousands. The number of columns
is d = 2 for the two-colour glass chip technology, and
may range up to dozens or a few hundred for series of
one-colour arrays. The values y
ki
, with k = 1,...,n
and i = 1,...,d are the intensity data as produced by
the image quantization software. Many programs estimate
local background intensities, which may be subtracted.
Due to variations in experimental factors such as
amount of sample mRNA, or labeling and hybridization
efciencies, the values y
ki
cannot directly be compared.
We assume that the different columns (samples) can be
brought on the same scale through afne-linear mappings,
parametrized by the 2d 2 real-valued parameters
o
2
,...,o
d
and s
2
,...,s
d
> 0:
y
ki
→ ˜y
ki
= o
i
+ s
i
y
ki
(1)
where i = 1,...,d, and o
1
= 0, s
1
= 1 without
loss of generality. After this, one can calculate measures
of differential expression, quantifying how much the
intensity of a certain probe is different in one sample from
another. For example, one may consider the difference
between calibrated intensities, or the ratio. We use the
general term difference statistic for such measures.
S97
W.Huber et al.
200 0 200 400 600 800 1000
Fig. 1. Graph of the variance stabilizing transformation (4) (solid
line), and of the logarithm function (dashed line). The histogram
shows the intensity distribution of one colour channel on an 8400-
element cDNA microarray. The parameters of the transformation (4)
were estimated from the comparison with the intensities from the
other colour channel.
For a non-differentially expressed gene, the values ˜y
ki
for i = 1,...,d scatter around the true value according to
the distribution of the measurement error of probe k.We
may thus regard ˜y
ki
as realizations of the random variables
Y
k
with mean E(Y
k
) = u
k
and variance Var(Y
k
) = v
k
.
We assume that v
k
only depends on k through a quadratic
function of the mean u
k
of the following form:
v
k
= v(u
k
) = (c
1
u
k
+ c
2
)
2
+ c
3
, with c
3
> 0. (2)
We will discuss the motivations for this assumption in the
next section. The method of variance stabilization can be
used to derive a transformation h such that the variance
Var (h(Y
k
)) is approximately independent of the mean
E(h( Y
k
)). An expression for h is given by (Tibshirani,
1988)
h(y) =
y
1/
v(u) du, (3)
and results from a linear approximation of h(Y
k
) around
h(u
k
) (Delta method). Inserting (2) into (3) yields
h(y) = γ arsinh(a + by), (4)
where the parameters of h are related to those of (2)
through γ = c
1
1
, a = c
2
/
c
3
, and b = c
1
/
c
3
.
A graph of the arsinh function is shown by the solid line
in Figure 1. Relationships between the arsinh function and
the logarithm are given by
arsinh(x) = log(x +
x
2
+ 1),
lim
x→∞
(
arsinh x log x log 2
)
= 0. (5)
Hence, for large intensities the transformation (4) becomes
equivalent to the usual logarithmic transformation. How-
ever, unlike the logarithm, it does not have a singularity
at zero, and continues to be smooth and real-valued in the
range of small or negative intensities.
Now we apply the variance stabilizing transfor-
mation (4) to the calibrated data ˜y
ki
from Equa-
tion (1) to obtain the transformation y
ki
→ h( ˜y
ki
) =
arsinh(a + b(o
i
+ s
i
y
ki
)). The parameter γ may be
omitted since it is merely an overall scaling factor. Setting
a
i
= a + bo
i
, b
i
= bs
i
, and
h
i
(y
ki
) = arsinh(a
i
+ b
i
y
ki
), (6)
we can incorporate the calibration transformation (1), as
well as the variance-versus-mean dependence (2) of the
y
ki
both together in the following statistical model:
h
i
(Y
ki
) = µ
k
+ ε
ki
, k K . (7)
Here, K denotes the set of probes representing not
differentially expressed genes, µ
k
= E(h(Y
ki
)) is the
mean, and the variance of the error term is constant,
E
ki
) = 0, Va r
ki
) = σ
2
. (8)
The condition E
ki
) = 0reects the goal of calibration,
whereas the common variance σ
2
of the error term is
aimed at by variance stabilization. We x the higher
moments by assuming that the ε
ki
are i. i. d. normal. In
Section Parameter Estimation we will provide a robust
variant of the maximum likelihood estimator for the 2d
parameters a
i
and b
i
. Using the estimated transformations
ˆ
h
i
, the difference statistic that quanties the change in
expression between samples i and j of a gene represented
by probe k is
h
k;ij
=
ˆ
h
i
(y
ki
)
ˆ
h
j
(y
kj
), k = 1,...,n. (9)
One may express Equation (9) in terms of the arsinh
function:
h
k;ij
= arsinh(ˆz
ki
) arsinh(ˆz
kj
)
= log
ˆz
ki
+
ˆz
2
ki
+ 1
ˆz
kj
+
ˆz
2
kj
+ 1
.
where ˆz
ki
a
i
+
ˆ
b
i
y
ki
and ˆz
kj
a
j
+
ˆ
b
j
y
kj
are the
calibrated intensities. This shows that in the limit of large
intensities, h
k;ij
coincides with the log-ratio, whereas
for near-zero intensities, it approaches the difference ˆz
ki
ˆz
kj
.
THE VARIANCE-VERSUS-MEAN
DEPENDENCE
The basic assumption underlying the results of the pre-
vious section is that the variance v
k
depends on k as in
S98
Variance stabilization for microarray data
0 100 200 300 400 500
0 10000 30000 50000
u
v
0 100 200 300 400
0 50 100 150 200
u
v
Fig. 2. Variance-versus-mean dependence v(u) in microarray data. Shown is the data from one mRNA sample, labeled both in red and green
and hybridized against an 8400-element cDNA slide. The plots show the variance versus the mean (left), and the standard deviation versus
the mean (right). The dots correspond to single-spot estimates ˆv
k
= (y
1k
y
2k
)
2
/2, ˆu
k
= (y
1k
+ y
2k
)/2, the solid lines show a moving
average. The axis units are arbitrary.
Equation (2). First, this assumption implies that v
k
de-
pends on k mainly through the mean intensity u
k
, and that
other factors such as sequence-specic effects, or effects
associated with array geometry or the production process
may be neglected. This would have to be veried from
case to case, but appears to be plausible in many exper-
iments. Second, we make a particular parametric ansatz,
namely a quadratic function of the form (2). There are sev-
eral motivations for this. One is provided by the following
model for the measurement error of gene expression ar-
rays (Rocke and Durbin, 2001):
Y = α + βe
η
+ ν, (10)
where β is the expression level in arbitrary units, α is an
offset, and ν and η are additive and multiplicative error
terms, respectively. ν and η are assumed to be independent
and normally distributed with mean zero. This leads to
E(Y ) = α + m
η
β (11)
Var (Y ) = s
2
η
β
2
+ s
2
ν
(12)
where m
η
and s
2
η
are mean and variance of e
η
, and s
2
ν
is the
variance of ν. Inserting Equation (11) into Equation (12)
yields a quadratic expression of the form of Equation (2),
and the relation between the parameters of model (10) and
those of the variance stabilizing transformation (4) is given
by a =−αs
η
/(m
η
s
ν
), b = s
η
/(m
η
s
ν
), γ = m
η
/s
η
.
A further motivation for the quadratic ansatz (2) is
provided by estimating v(u) directly from microarray
data. A typical example is shown in Figure 2. The right
plot shows how the assumption of constant coefcient
of variation breaks down in the low intensity range: the
curve has a non-zero intercept, that is, v(0)>0, and
its convexity is in agreement with the assumption that
c
3
> 0 in Equation (2). Similar curves have been observed
for many slides, and also for other levels of replication,
e.g. with data from replicate spots on one array, or from
replicate arrays. The essential features of these curves may
be captured by parametrizing v(u) as a quadratic function
of the form (2).
PARAMETER ESTIMATION
The parameters of the model (7) are estimated from data
with a robust variant of maximum likelihood estimation.
The detailed derivation, as well as results on convergence
and identiability are described in (Huber et al., 2002).
Given the data (y
ki
), k K , i = 1,...,d, the prole
log-likelihood (Murphy and van der Vaart, 2000) for the
parameters a
1
, b
1
,...,a
d
, b
d
is
|K |d
2
log
kK
d
i=1
(h
i
(y
ki
) −ˆµ
k
)
2
+
kK
d
i=1
log h
i
(y
ki
), (13)
with h
i
as in Equation (6). For a xed set of probes
K , we maximize (13) numerically under the constraints
b
i
> 0. The set of probes K is determined iteratively by
a version of least trimmed sum of squares (LTS) regres-
sion (Rousseuw and Leroy, 1987). Briey, K consists
S99
W.Huber et al.
of those probes for which r
k
=
d
i=1
(
ˆ
h
i
(y
ki
) −ˆµ
k
)
2
is
smaller than an appropriate quantile of the r
k
. The LTS
regression addresses the fact that the data distribution
of y
ki
is produced by a mixture from genes that are
differentially expressed, and ones that are not.
VALIDATION
In this section we investigate how the variance stabiliza-
tion and calibration work on real data, and how useful the
resulting difference statistic h is to quantify differential
gene expression.
To visualize the variance-versus-mean dependence, we
plot the difference between the gene expression data for a
pair of samples against the rank of their mean. Plotting
against the rank distributes the data evenly along the
x-axis and thus facilitates the visualization of variance
heterogeneity. We look at data from a cDNA microarray
experiment where samples from closely neighbouring
parts of a kidney tumor were labeled with green and
red uorescent dyes, respectively. The expression levels
of almost all of the genes in the two samples are
expected to be unchanged, so that observed differences
should represent the distribution of h in the absence of
differential expression.
Figure 3 shows such plots for six different types
of data transformation. First, Figure 3a corresponds to
applying no transformation at all. Clearly, the width
of the difference distribution increases with the signal
average. After applying the logarithm transformation, the
data looks as in Figure 3b. Here, all intensities below
1 have been replaced by 1 before taking the logarithm.
This results in the two bands of points on the left
side, corresponding to probes where one of two values
was below 1, and one was not. If instead we dismiss
the non-positive measurements, we obtain Figure 3c. In
addition, a non-linear calibration (Yang et al., 2001) has
been applied. It has been argued that the problem of
small, or non-positive intensity values is an artifact of the
image analysis local background estimation, and hence
one might consider using the spot intensities without
local background subtraction. The result is depicted in
Figure 3d. In fact, over a wide range the difference
distribution now happens to have a practically constant
width. However, the distribution no longer follows a
horizontal line. Instead of the logarithm transformation,
another plausible choice is the rank transformation. This is
shown in Figure 3e. Finally, Figure 3f shows the difference
statistic h,asdened in Equation (9). The distribution
is centered around the x-axis, and its width is constant
along the whole range. Similarly good results have been
observed with microarray expression data from many
different sources.
We now turn to the question how well the values of
h
k;ij
, for a given probe k and measured over many pairs
of samples i and j,reect potential differential expression
of the gene represented by that probe. We compare this
to other commonly used difference statistics: the log-ratio
together with different normalization methods, and the
difference of ranks. As test data, we consider two data
sets that contain highly replicated expression data. Both
data sets compare two biological conditions, the rst one
clear cell renal cell cancer with non-cancerous kidney
cortex tissue, and the second one acute myeloid with acute
lymphoblastic leukemia (Golub et al., 1999). Given that
there is a large number of genes differentially expressed
between the two conditions, we determined the number of
those that were detected by a statistical test on the values
of h
k;ij
and, in comparison, on the other difference
statistics. Since the permutation test we used allows to
control the type I error, the number of genes detected
indicates how well the various difference statistics do in
fact represent differential expression.
The rst data set was produced at the Department
of Molecular Genome Analysis at the German Cancer
Research Center, using cDNA slides with about 4200
clones spotted in duplicate. Paired cancerous and non-
cancerous tissue samples from 19 patients were used, and
each tissue pair was hybridized against two slides, with the
dyes swapped between repetitions, resulting in a total of
38 slides. From this data, we calculated (i) the difference
statistic h
k;ij
, as well as log-ratios. For the latter, in
order to deal with the negative intensity values produced
by subtracting the image analysis softwares background
estimates, four different rules were tried: (ii) ignore the
background estimates, (iii) replace the negative values
by 1 before taking the logarithm, (iv) subtract the 5%-
quantile, then replace the remaining negative values by
1, and (v) ag them as missing values, For (ii)(iv), a
multiplicative calibration was estimated by the midpoint
of the shorth of the uncalibrated log-ratios. The shorth
of a univariate distribution is dened as the shortest
interval containing half of the values, and for a unimodal
distribution, its midpoint is a robust estimator of its mode.
For (v), we used the local regression proposed by (Yang
et al., 2001), using the implementation in the R package
sma (http://www.r-project.org) with default parameters.
Finally, we calculated (vi) the rank differences. Each of
the difference statistics (i)(vi) was averaged over the two
replicate spots, and over the two replicate arrays, resulting
in one value per gene per patient, and hence in a matrix
with 4200 rows, for the clones, and 19 columns, for the
patients. The mean of each row was compared against
its permutation distribution, obtained from performing
random column-wise sign ips. We counted the number
of genes that were at the extremes of their respective
permutation distribution, as a function of the quantile α.
The result is shown in Figures 4a and b. The test based
S100
Variance stabilization for microarray data
0 2000 4000 6000 8000
1000 500 0 500 1000
a) y
0 2000 4000 6000 8000
4 20 2 4
b) log(y)
0 2000 4000 6000 8000
4 20 2 4
c) log(y), loess
0 2000 4000 6000 8000
0.3 0.1 0.0 0.1 0.2 0.3 0.4
d) log(y
fg
)
0 2000 4000 6000 8000
3000 1000 0 1000 2000 3000
e) rank(y)
0 2000 4000 6000 8000
0.4 0.2 0.0 0.2 0.4
f) h(y)
Fig. 3. The difference between the two colour channels of a cDNA microarray versus the rank of their average. Plot a) shows the
untransformed intensity data, plots b-f) show the effect of ve different transformations (see text). The y-axes of plots b-d) correspond
to the usual log ratio, the y-axis of plot f) to the difference statistic h as proposed in this article.
on the difference statistic h uniformly had the highest
power.
Figures 4a and b correspond to two one-sided tests,
testing the row mean of the expression matrix against the
hypothesis that it is less or equal to zero, or greater or equal
to zero, respectively. We chose this procedure in order
to make the comparison insensitive to potential subtle
biases in the estimation of the calibration parameters. Such
biases could be caused by a difference in the number of
up- and down-regulated genes, and could consequently
lead to biases in any of the difference statistics (i)-(vi).
However, they would have opposite effects on the number
of detected genes in the two tests. The fact that the
difference statistic h detects more genes in both one-
sided tests veries that its better performance is not related
to such potential calibration errors.
To evaluate our method with data from a different
technological platform and experimental design, we
used an expression data set measured on Affymetrix
oligonucleotide arrays. It comprises 47 samples of
acute myeloid leukemia and 25 samples of acute lym-
phoblastic leukemia (Golub et al., 1999). From the
data matrix provided at Golub et al.s (1999) website
(http://www-genome.wi.mit.edu/mpr) we calculated cali-
brated and transformed data h
i
(y
ki
), with k = 1,...,7129
and i = 1,...,72. We used the data as is, with no further
selection or tresholding, and ignored the A/M/P-ags
that the Affymetrix software associated with each value.
The simultaneous estimation of the 2d = 144 parameters
posed no particular problem. In contrast, Golub et al.
(1999) used a calibration method based on a linear
regression, which in a pairwise fashion referenced arrays
2 ...38 to array 1, and arrays 40 ...72 to array 39. We
used a two-sample permutation t-test to detect genes
differentially expressed between AML and ALL. The
result is shown in Figures 4c and d. Again, the test based
on h has higher power.
Finally, an example for how the difference statistic h
leads to more easily interpretable data displays is depicted
in Figure 5. Since the distribution of h is independent
of the mean intensity, observed values can directly be
compared to the marginal empirical distribution, shown
S101
W.Huber et al.
5e04 1e03 2e03 5e03 1e02
200 300 400 500
a) test for upregulation (kidney data)
α
h(y)
log loess
rank
(y)
log(y Q5)
log(y
fg
)
log(y)
5e04 1e03 2e03 5e03 1e02
200 300 400 500
b) test for downregulation (kidney data)
α
h(y)
logloess
rank
(y)
log(y Q5)
log(y
fg
)
log(y)
5e05 1e04 2e04 5e04 1e03 2e03
10 20 50 100 200
c) test for upregulation (Golub data)
α
h(y)
Golub orig.
5e05 1e04 2e04 5e04 1e03 2e03
50 100 150 200
d) test for downregulation (Golub data)
α
h(y)
Golub orig.
Fig. 4. Sensitivity and specicity of different methods for the quantication of differential expression. Top row: comparison of h to
4 methods based on log-ratios, and to the rank difference, on two-colour cDNA glass chip data. Bottom row: comparison of h to the
procedure used in (Golub et al., 1999) on the AML/ALL data. The plots show the number of genes selected by permutation tests against the
signicance level α. The test based on the difference statistic h uniformly has the best power.
in the histogram to the right. A scale on the h axis
may be dened through a robust measure of width σ
h
of
the empirical distribution, as indicated in Figure 5. Note,
however, that in general the null distribution of h is
not known, and in the presence of an unknown subset
of differentially expressed genes, it is also not easy to
estimate it.
DISCUSSION
A long-standing problem in the analysis of microarray
gene expression experiments is how to take into account
the dependence of the standard deviation of a spot
intensity of its mean. In a seminal paper by Chen et al.
(1997), this relationship has been modelled as a linear
function. A main consequence is the use of logarithmic
ratios as a measure of differential expression. Here, we
have shown that their approach, although alleviating the
problem, does not solve it entirely. The main limitation
of the log-ratio as a measure of differential expression
is the dependence of its variability on the intensity. To
address this fact, we propose the general approach of
applying a variance stabilizing transformation in order to
achieve a constant signal-to-noise ratio. This results in
the difference statistic h which displays approximately
constant variance independent of the spot intensity, and
replaces the log-ratio as a measure of differential gene
expression.
S102
Variance stabilization for microarray data
rank(average)
h
0 2000 4000 6000 8000
20 2
54 ––3 2 1012345
σ
^
h
histogram
Fig. 5. Display of the data from a two-colour cDNA slide, taken
from the kidney data set. Analogous to Figure 3, the difference
statistic h is plotted along the y-axis, and the rank of the average
spot intensity along the x -axis. The variance of the measurement
error is constant over the whole intensity range, and horizontal
dotted lines are plotted at multiples of its estimated standard
deviation. Note that the gure shows the complete intensity data
from the slide, without any thresholding or truncation. The circles
and triangles represent genes that have been found up- and down-
regulated, respectively, in renal cell cancer in a previous study (Boer
et al., 2001). Most of these are veried by the present slide, while
for some, possibly due to biological or experimental variation, the
value of h is close to zero.
Alternative approaches to this problem (e.g. Hughes
et al. (2000); Baggerly et al. (2001); Theilhaber et al.
(2001)) have also put forward quantitative models for
this intensity-dependence, and propose to augment log-
ratio values by signicance valuescalculated from such
models.
Additionally, however, our approach takes into account
the problem of calibration. Due to differential behaviour of
dyes, or variations between samples and arrays, intensity
measurements need to be brought on a common scale
before they can be compared. In alternative approaches,
the estimation of the calibration parameters is complicated
by the non-constant variances. Furthermore, the resulting
ratios may sensitively depend on the calibration. These
problems are overcome in our approach: the estimation
of the calibration parameters is simplied through the
use of a variance stabilizing transformation, and is an
integral part of the overall model tting. Furthermore,
our approach is parsimonious with parameters. No non-
parametric curve estimation is required, which helps to
provide robustness and avoid overtting. Finally, since the
difference statistic h is simply obtained as the difference
between the transformed data of the individual samples,
our approach not only allows for the comparison of two
samples, but without any further effort can also be used for
multivariate analyses comparing more than two samples.
In our model, the variance stabilizing transformation h
turns out to be an arsinh function. This generalizes earlier
results, as follows. Using a quadratic ansatz, the variance-
versus-mean dependence at the basis of our approach has
three parameters, which may be related to those of the
model of Rocke and Durbin (2001). If in their model the
additive noise component vanishes, the resulting limiting
case turns out to be the logarithmic transformation with
pseudocounts, h
pc
(y) = log(y + y
0
), which has been
used by various authors to overcome limitations of the
logarithmic transformation (e.g. Beißbarth et al. (2000);
Newton et al. (2001)). Furthermore, if both the constant
and the linear term in the quadratic function vanish, our
model turns into that of Chen et al. (1997)
Our approach is based on the following main assump-
tions: First, the variance of the measurements on a probe
mainly depends on the mean intensity, and the relationship
may be described by a second order polynomial with neg-
ative discriminant. This is grounded in the analysis of a
large number of experiments and is in agreement with the
model of Rocke and Durbin (2001). Second, we assume
that the relationship of measurements between samples is
captured by an afne-linear transformation. While non-
linear behaviour may be observed under certain condi-
tions, it has been demonstrated (e.g. Ramdas et al. (2001))
that current day microarray technology has a large, prac-
tically accessible working range in which intensities in-
crease linearly with mRNA concentrations. It appears to us
that in many cases apparent non-linearities that have been
observed in the logarithmic plot (for an example, see Fig-
ure 3d) are an artifact of the logarithmic transformation,
and disappear when using the appropriate afne-linear cal-
ibration. However, the general approach we proposed can
easily be modied to incorporate a different class of cali-
bration transformations or a different form of the variance-
versus-mean dependence.
A third assumption concerns the statistical distribution
of the intensity measurements. The variance-stabilized
intensities per spot are assumed to be normally distributed.
The parameter estimation draws on this assumption in
particular near the center of the distribution, but because
of our use of a robust regression procedure, it should not
be affected by possible deviations from normality in the
tails of the distribution.
A crucial point in modelling and parameter estimation
from data is identiability. The transformations h
i
have
2d parameters (cf. Equations (6) and (13)), which we
need to determine from nd data points. d ranges from
d = 2 up to a few dozen or hundred, while n is
typically in the order of several thousands. Given this
generally favourable relation between the amount of
data and number of parameters, and according to our
experience with simulations and jackknife sampling, the
transformations are well identiable.
S103
W.Huber et al.
One might see a drawback of our method in the fact that
it measures expression differences in terms of a function
h with two estimated, experiment-specic parameters a
and b, while the log-ratio can be calculated directly
from the calibrated data, with no further parameters,
and is easily interpreted as a fold-change. However, for
large intensities, the values of h and of the log-ratio
coincide (cf. Equation (5)), irrespective of the values of
the experiment-specic parameters, and hence h may
just as well be interpreted as the logarithm of a fold
change. For small intensities that are near the detection
limit of the experiment, the values of h are contracted
towards 0 in comparison to those of the log-ratio. The
onset and magnitude of this contraction are parametrized
by the parameters a and b, which in this way encode
the intensity dependent measurement error distribution
of the experiment. We note that corresponding intensity-
dependent thresholds are also used in the analysis of log-
ratios, albeit usually in a less systematic manner.
Finally, and perhaps most importantly, our method also
proves successful in the application to real data. It can
typically be used off-the-shelf, without any particular
tuning, and has been applied to different platforms, such
as two-colour slides, Affymetrix chips, and radioactive
membranes. Like in the ANOVA approach by Kerr et al.
(2000), calibration is done not necessarily for pairs of
samples but simultaneously for a whole set. The simple
error distribution of the transformed intensities h
i
makes
them particularly suitable as input for clustering or other
multivariate analysis methods. Software is provided as an
R package, which is freely available for academic use.
ACKNOWLEDGEMENTS
We thank G
¨
unther Sawitzki and Dirk Buschmann for
fruitful discussions, and Rainer Spang for critical reading
of the manuscript. Bernd Korn has supplied the RZPD
Oncoset of clones for the slides used in the kidney
experiment, and Stephanie S
¨
uß provided expert technical
assistance.
REFERENCES
Baggerly,K.A., Coombes,R.R., Hess,K.R., Stivers,D.N.,
Abruzzo,L.V. and Zhang,W. (2001) Identifying differen-
tially expressed genes in cDNA microarray experiments. J.
Comput. Biol., 8, 639659.
Baldi,P. and Long,A.D. (2001) A Bayesian framework for the
analysis of microarray expression data: regularized t-test and
statistical inferences of gene changes. Bioinformatics, 17, 509
519.
Beißbarth,T., Fellenberg,K., Brors,B., Arribas-Prat,R., Boer,J.M.,
Hauser,N.C., Scheideler,M., Hoheisel,J.D., Sch
¨
utz,G.,
Poustka,A. and Vingron,M. (2000) Processing and quality con-
trol of DNA array hybridization data. Bioinformatics, 16, 1014
1022.
Boer,J.M., Huber,W., S
¨
ultmann,H., Wilmer,F., von Heydebreck, A.,
Haas,S., Korn,B., Gunawan,B., Vente,A., F
¨
uzesi,L., Vingron,M.
and Poustka,A. (2001) Identication and classication of dif-
ferentially expressed genes in renal cell carcinoma by expres-
sion proling on a global human 31 500-element cDNA array.
Genome Res., 11, 18611870.
Chen,Y., Dougherty,E.R. and Bittner,M.L. (1997) Ratio-based deci-
sions and the quantitave analysis of cDNA microarray images..
J. Biomed. Opt., 2, 364374.
Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasen-
beek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R.,
Caligiuri,M.A., Bloomeld,C.D. and Lander,E.S. (1999) Molec-
ular classication of cancer: Class discovery and class prediction
by gene expression monitoring. Science, 286, 531537.
Huber,W., von Heydebreck,A., S
¨
ultmann,H., Poustka,A. and Vin-
gron,M. A model for the calibration of microarray data and
the quantication of differential expression. Preprint available
on request.
Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J.,
Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H.,
He,Y.D., Kidd,M.J., King,A.M., Meyer,M.R., Slade,D.,
Lum,P.Y.Y., Stepaniants,S.B., Shoemaker,D.D., Gachotte,D.,
Chakraburtty,K., Simon,J., Bard,M. and Friend,S.H. (2000)
Functional discovery via a compendium of expression proles.
Cell, 102, 109126.
Kerr,K.M., Martin,M. and Churchill,G.A. (2000) Analysis of vari-
ance for gene expression microarray data. J. Comput. Biol., 7,
819837.
Murphy,S.A. and van der Vaart,A.W. (2000) On prole likelihood.
J. Amer. Stat. Assoc., 95, 449465.
Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and
Tsui,K.W. (2001) On differential variability of expression ra-
tios: improving statistical inference about gene expression
changes from microarray data. J. Comput. Biol., 8,3752.
Ramdas,L., Coombes,K.R., Baggerly,K., Abruzzo,L.V., High-
smith,W.E., Krogmann,T., Hamilton,S.R. and Zhang,W. (2001)
Sources of nonlinearity in cDNA microarray expression mea-
surements. Genome Biol., 2, research0047.1research0047.7.
Rocke,D.M. and Durbin,B. (2001) A model for measurement error
for gene expression analysis. J. Comput. Biol., 8, 557569.
Rousseuw,P.J. and Leroy,A.M. (1987) Robust Regression and
Outlier Detection. Wiley.
Theilhaber,J., Bushnell,S., Jackson,A. and Fuchs,R. (2001)
Bayesian estimation of fold-changes in the analysis of gene
expression: The PFOLD algorithm. J. Comput. Biol., 8, 585614.
Tibshirani,R. (1988) Estimating transformations for regression via
additivity and variance stabilization. J. Amer. Stat. Assoc., 83,
394405.
Yang,Y.H., Dudoit,S., Luu,P., Lin,D.M., Peng,V., Ngai,J. and
Speed,T.P. (2002) Normalization for cDNA microarray data: a
robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res., 30, e15:1e15:11.
S104
... d Normalization of raw expression matrix to reduce systematic bias, e.g., by center.mean or by vsn 64 . e Imputation of missing values, which includes methods such as missForest 35 . ...
... "div.median", "quantiles", "quantiles.robust" and "vsn"(variance stabilization) 64 . The two regression-based normalization methods "lossf" 36,37 and "Rlr" 45 , the total ion current normalization (TIC) and the Mean/Median-balanced quantile (MBQN) 65 normalization methods were also evaluated (we conclude them in Fig. 1d). ...
Article
Full-text available
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew’s correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.
... For aSyn and Rab GTPases datasets, both proteotypic and non-proteotypic peptides were covered; therefore, non-proteotypic peptides, which are reported, should be taken with caution. Raw peptide abundances were normalized using variance stabilizing normalization with the vsn package (Huber et al, 2002) (version 3.64.0). We used an outlier detection method based on the interquartile range (IQR) to define boundaries outside of the first (Q1) and third (Q3) quartile for peptide abundances within each condition per peptide precursor. ...
... R Core Team 2021). Raw abundances of proteotypic RSVF peptides were normalized using variance stabilizing normalization with the vsn package(Huber et al, 2002) (version 3.64.0). The normalized log 2transformed peptide abundances from treated samples with sitespecific antibodies against RSVF were compared to control, i.e., anti-Human IgG1 kappa antibody with RSVF. ...
Article
The physical interactome of a protein can be altered upon perturbation, modulating cell physiology and contributing to disease. Identifying interactome differences of normal and disease states of proteins could help understand disease mechanisms, but current methods do not pinpoint structure-specific PPIs and interaction interfaces proteome-wide. We used limited proteolysis–mass spectrometry (LiP–MS) to screen for structure-specific PPIs by probing for protease susceptibility changes of proteins in cellular extracts upon treatment with specific structural states of a protein. We first demonstrated that LiP–MS detects well-characterized PPIs, including antibody–target protein interactions and interactions with membrane proteins, and that it pinpoints interfaces, including epitopes. We then applied the approach to study conformation-specific interactors of the Parkinson’s disease hallmark protein alpha-synuclein (aSyn). We identified known interactors of aSyn monomer and amyloid fibrils and provide a resource of novel putative conformation-specific aSyn interactors for validation in further studies. We also used our approach on GDP- and GTP-bound forms of two Rab GTPases, showing detection of differential candidate interactors of conformationally similar proteins. This approach is applicable to screen for structure-specific interactomes of any protein, including posttranslationally modified and unmodified, or metabolite-bound and unbound protein states.
... Statistical analysis was performed using R studio. Protein intensities were normalized using VSN (97) and log2 transformed for further analysis. ClusterProfiler (49) was used for the KEGG pathway enrichment analysis and ReactomePA (98) for Reactome. ...
Article
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection is characterized by highly heterogeneous manifestations ranging from asymptomatic cases to death for still incompletely understood reasons. As part of the IMmunoPhenotyping Assessment in a COVID-19 Cohort study, we mapped the plasma proteomes of 1117 hospitalized patients with COVID-19 from 15 hospitals across the United States. Up to six samples were collected within ~28 days of hospitalization resulting in one of the largest COVID-19 plasma proteomics cohorts with 2934 samples. Using perchloric acid to deplete the most abundant plasma proteins allowed for detecting 2910 proteins. Our findings show that increased levels of neutrophil extracellular trap and heart damage markers are associated with fatal outcomes. Our analysis also identified prognostic biomarkers for worsening severity and death. Our comprehensive longitudinal plasma proteomics study, involving 1117 participants and 2934 samples, allowed for testing the generalizability of the findings of many previous COVID-19 plasma proteomics studies using much smaller cohorts.
... Depending on the distribution of the repeated values, one can either average them or pick the median to collapse them into a single value. The literature on the analysis of gene expression data also suggests that the classification algorithms may be able to more easily and accurately model the underlying structure in the training data by normalizing and transforming the data using variance stabilizing transforms like logarithm and cubic-root [2][3][4][5][6]. Transcription Factors (TFs) are specialized proteins that attach to DNA promoter regions to interfere with the rate of protein synthesis. ...
... The normalized intensities further underwent variance-stabilizing transformation (VST) to account for systematic bias 89,90 . After mean variance stabilization, differential analysis was performed using linear models as implemented within the limma package (v3.54.2) in R. Spearman correlation coefficients and P-values were calculated by correlating the log fold changes per metabolite in aged drug-treated vs. aged vehicle-treated mice and in aged vehicle-treated vs. young vehicle-treated mice. ...
Preprint
Full-text available
Identifying readily implementable methods that can effectively counteract aging is urgently needed for tackling age-related degenerative disorders. Here, we conducted functional assessments and deep molecular phenotyping in the aging mouse to demonstrate that glucagon-like peptide-1 receptor agonist (GLP-1RA) treatment attenuates body-wide age-related changes. Apart from improvements in physical and cognitive performance, the age-counteracting effects are prominently evident at multiple omic levels. These span the transcriptomes and DNA methylomes of various tissues, organs and circulating white blood cells, as well as the plasma metabolome. Importantly, the beneficial effects are specific to aged mice, not young adults, and are achieved with a low dosage of GLP-1RA which has a negligible impact on food consumption and body weight. The molecular rejuvenation effects exhibit organ-specific characteristics, which are generally heavily dependent on hypothalamic GLP-1R. We benchmarked the GLP-1RA age-counteracting effects against those of mTOR inhibition, a well-established anti-aging intervention, observing a strong resemblance across the two strategies. Our findings have broad implications for understanding the mechanistic basis of the clinically observed pleiotropic effects of GLP-1RAs, the design of intervention trials for age-related diseases, and the development of anti-aging-based therapeutics.
... Proteins identified in the contaminant database were discarded. After log2 transformation, protein abundances were normalised by applying the variance stabilising normalisation method [31]. Missing values were imputed using the Structured Least Square Adaptative (SLSA) method [32]. ...
Article
Full-text available
Background Metabolic dysfunction-associated steatotic liver disease (MASLD) is estimated to affect 30% of the world’s population, and its prevalence is increasing in line with obesity. Liver fibrosis is closely related to mortality, making it the most important clinical parameter for MASLD. It is currently assessed by liver biopsy – an invasive procedure that has some limitations. There is thus an urgent need for a reliable non-invasive means to diagnose earlier MASLD stages. Methods A discovery study was performed on 158 plasma samples from histologically-characterised MASLD patients using mass spectrometry (MS)-based quantitative proteomics. Differentially abundant proteins were selected for verification by ELISA in the same cohort. They were subsequently validated in an independent MASLD cohort (n = 200). Results From the 72 proteins differentially abundant between patients with early (F0-2) and advanced fibrosis (F3-4), we selected Insulin-like growth factor-binding protein complex acid labile subunit (ALS) and Galectin-3-binding protein (Gal-3BP) for further study. In our validation cohort, AUROCs with 95% CIs of 0.744 [0.673 – 0.816] and 0.735 [0.661 – 0.81] were obtained for ALS and Gal-3BP, respectively. Combining ALS and Gal-3BP improved the assessment of advanced liver fibrosis, giving an AUROC of 0.796 [0.731. 0.862]. The {ALS; Gal-3BP} model surpassed classic fibrosis panels in predicting advanced liver fibrosis. Conclusions Further investigations with complementary cohorts will be needed to confirm the usefulness of ALS and Gal-3BP individually and in combination with other biomarkers for diagnosis of liver fibrosis. With the availability of ELISA assays, these findings could be rapidly clinically translated, providing direct benefits for patients. Graphical Abstract
... 10.1021/acsomega.0c02564). The data were normalized using VSN (38) and statistical analysis was performed using Linear Models for Microarray Data (limma) with empirical Bayes (eBayes) smoothing to the standard errors (39). Proteins with an FDR adjusted p-value < 0.05 and fold-change > 2 were considered to differ significantly between two conditions under comparison. ...
Preprint
Full-text available
Poly(A)-binding protein (Pab1 in yeast) is involved in mRNA decay and translation initiation, but its molecular functions are incompletely understood. We found that auxin-induced degradation of Pab1 reduced bulk mRNA and polysome abundance in a manner suppressed by deleting the catalytic subunit of decapping enzyme (dcp2Δ), demonstrating that enhanced decapping/degradation is the major driver of reduced mRNA abundance and protein synthesis at limiting Pab1 levels. An increased median poly(A) tail length conferred by Pab1 depletion was also nullified by dcp2Δ, suggesting that mRNA isoforms with shorter tails are preferentially decapped/degraded at limiting Pab1. In contrast to findings on mammalian cells, the translational efficiencies (TEs) of many mRNAs were altered by Pab1 depletion; however, these changes were broadly diminished by dcp2Δ, suggesting that reduced mRNA abundance is a major driver of translational reprogramming at limiting Pab1. Thus, assembly of the closed-loop mRNP via PABP-eIF4G interaction appears to be dispensable for normal translation of most yeast mRNAs in vivo. Interestingly, histone mRNAs and proteins are preferentially diminished on Pab1 depletion dependent on Dcp2, accompanied by activation of internal cryptic promoters in the manner expected for reduced nucleosome occupancies, revealing a new layer of post-transcriptional control of histone gene expression.
Article
Background Acute cellular rejection (ACR) in heart transplant (HTx) recipients may be accompanied by cardiac cell damage with subsequent exposure to cardiac autoantigens and the production of cardiac autoantibodies (aABs). This study aimed to evaluate a peptide array screening approach for cardiac aABs in HTx recipients during ACR (ACR-HTx). Methods In this retrospective single-center observational study, sera from 37 HTx recipients, as well as age and sex-matched healthy subjects were screened for a total of 130 cardiac aABs of partially overlapping peptide sequences directed against structural proteins using a peptide array approach. Results In ACR-HTx, troponin I (TnI) serum levels were found to be elevated. Here, we could identify aABs against beta-2-adrenergic receptor (β-2AR: EAINCYANETCCDFFTNQAY) to be upregulated in ACR-HTx (intensities: 0.80 versus 1.31, P = 0.0413). Likewise, patients positive for β-2AR aABs showed higher TnI serum levels during ACR compared with aAB negative patients (10.0 versus 30.0 ng/L, P = 0.0375). Surprisingly, aABs against a sequence of troponin I (TnI: QKIFDLRGKFKRPTLRRV) were found to be downregulated in ACR-HTx (intensities: 3.49 versus 1.13, P = 0.0025). A comparison in healthy subjects showed the same TnI sequence to be upregulated in non-ACR-HTx (intensities: 2.19 versus 3.49, P = 0.0205), whereas the majority of aABs were suppressed in non-ACR-HTx. Conclusions Our study served as a feasibility analysis for a peptide array screening approach in HTx recipients during ACR and identified 2 different regulated aABs in ACR-HTx. Hence, further multicenter studies are needed to evaluate the prognostic implications of aAB testing and diagnostic or therapeutic consequences.
Article
Theileria annulata is a tick-transmitted apicomplexan parasite that gained the unique ability among parasitic eukaryotes to transform its host cell, inducing a fatal cancer-like disease in cattle. Understanding the mechanistic interplay between the host cell and malignant Theileria species that drives this transformation requires the identification of responsible parasite effector proteins. In this study, we used TurboID-based proximity labeling, which unbiasedly identified secreted parasite proteins within host cell compartments. By fusing TurboID to nuclear export or localization signals, we biotinylated proteins in the vicinity of the ligase enzyme in the nucleus or cytoplasm of infected macrophages, followed by mass spectrometry analysis. Our approach revealed with high confidence nine nuclear and four cytosolic candidate parasite proteins within the host cell compartments, eight of which had no orthologs in non-transforming T. orientalis . Strikingly, all eight of these proteins are predicted to be highly intrinsically disordered proteins. We discovered a novel tandem arrayed protein family, nuclear intrinsically disordered proteins (NIDP) 1–4, featuring diverse functions predicted by conserved protein domains. Particularly, NIDP2 exhibited a biphasic host cell-cycle-dependent localization, interacting with the EB1/CD2AP/CLASP1 parasite membrane complex at the schizont surface and the tumor suppressor stromal antigen 2 (STAG2), a cohesion complex subunit, in the host nucleus. In addition to STAG2, numerous NIDP2-associated host nuclear proteins implicated in various cancers were identified, shedding light on the potential role of the T. annulata exported protein family NIDP in host cell transformation and cancer-related pathways. IMPORTANCE TurboID proximity labeling was used to identify secreted proteins of Theileria annulata , an apicomplexan parasite responsible for a fatal, proliferative disorder in cattle that represents a significant socio-economic burden in North Africa, central Asia, and India. Our investigation has provided important insights into the unique host-parasite interaction, revealing secreted parasite proteins characterized by intrinsically disordered protein structures. Remarkably, these proteins are conspicuously absent in non-transforming Theileria species, strongly suggesting their central role in the transformative processes within host cells. Our study identified a novel tandem arrayed protein family, with nuclear intrinsically disordered protein 2 emerging as a central player interacting with established tumor genes. Significantly, this work represents the first unbiased screening for exported proteins in Theileria and contributes essential insights into the molecular intricacies behind the malignant transformation of immune cells.
Article
Motivation Mass spectrometry-based system proteomics allows identification of dysregulated protein hubs and associated disease-related features. Obtaining differentially expressed proteins (DEPs) is the most important step of downstream bioinformatics analysis. However, the extraction of statistically significant DEPs from datasets with multiple experimental conditions or disease types through currently available tools remains a laborious task. More often such an analysis requires considerable bioinformatics expertise, making it inaccessible to researchers with limited computational analytics experience. Results To uncover the differences among the many conditions within the data in a user-friendly manner, here we introduce FlexStat, a web-based interface that extracts DEPs through combinatory analysis. This tool accepts a protein expression matrix as input and systematically generates DEP results for every conceivable combination of various experimental conditions or disease types. FlexStat includes a suite of robust statistical tools for data preprocessing, in addition to DEP extraction, and publication-ready visualization, which are built on established R scientific libraries in an automated manner. This analytics suite was validated in diverse public proteomic datasets to showcase its high performance of rapid and simultaneous pairwise comparisons of comprehensive datasets. Availability and implementation FlexStat is implemented in R and is freely available at https://jglab.shinyapps.io/flexstatv1-pipeline-only/. The source code is accessible at https://github.com/kts-desilva/FlexStat/tree/main.
Book
Full-text available
This is a book, not a paper.
Article
Full-text available
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Article
Full-text available
The technology of hybridization to DNA arrays is used to obtain the expression levels of many different genes simultaneously. It enables searching for genes that are expressed specifically under certain conditions. However, the technology produces large amounts of data demanding computational methods for their analysis. It is necessary to find ways to compare data from different experiments and to consider the quality and reproducibility of the data. Data analyzed in this paper have been generated by hybridization of radioactively labeled targets to DNA arrays spotted on nylon membranes. We introduce methods to compare the intensity values of several hybridization experiments. This is essential to find differentially expressed genes or to do pattern analysis. We also discuss possibilities for quality control of the acquired data. http://www.dkfz.de/tbi M.Vingron@dkfz-heidelberg.de
Article
Full-text available
We consider the problem of inferring fold changes in gene expression from cDNA microarray data. Standard procedures focus on the ratio of measured fluorescent intensities at each spot on the microarray, but to do so is to ignore the fact that the variation of such ratios is not constant. Estimates of gene expression changes are derived within a simple hierarchical model that accounts for measurement error and fluctuations in absolute gene expression levels. Significant gene expression changes are identified by deriving the posterior odds of change within a similar model. The methods are tested via simulation and are applied to a panel of Escherichia coli microarrays.
Article
Full-text available
Spotted cDNA microarrays are emerging as a powerful and cost-effective tool for large-scale analysis of gene expression. Microarrays can be used to measure the relative quantities of specific mRNAs in two or more tissue samples for thousands of genes simultaneously. While the power of this technology has been recognized, many open questions remain about appropriate analysis of microarray data. One question is how to make valid estimates of the relative expression for genes that are not biased by ancillary sources of variation. Recognizing that there is inherent "noise" in microarray data, how does one estimate the error variation associated with an estimated change in expression, i.e., how does one construct the error bars? We demonstrate that ANOVA methods can be used to normalize microarray data and provide estimates of changes in gene expression that are corrected for potential confounding effects. This approach establishes a framework for the general analysis and interpretation of microarray data.
Article
Full-text available
We investigated the changes in gene expression accompanying the development and progression of kidney cancer by use of 31,500-element complementary DNA arrays. We measured expression profiles for paired neoplastic and noncancerous renal epithelium samples from 37 individuals. Using an experimental design optimized for factoring out technological and biological noise, and an adapted statistical test, we found 1738 differentially expressed cDNAs with an expected number of six false positives. Functional annotation of these genes provided views of the changes in the activities of specific biological pathways in renal cancer. Cell adhesion, signal transduction, and nucleotide metabolism were among the biological processes with a large proportion of genes overexpressed in renal cell carcinoma. Down-regulated pathways in the kidney tumor cells included small molecule transport, ion homeostasis, and oxygen and radical metabolism. Our expression profiling data uncovered gene expression changes shared with other epithelial tumors, as well as a unique signature for renal cell carcinoma. [Expression data for the differentially expressed cDNAs are available as a Web supplement at http://www.dkfz-heidelberg.de/abt0840/whuber/rcc . The array data have been submitted to the GEO data repository under accession no. GSE3.]
Article
I propose a method for the nonparametric estimation of transformations for regression. It is much more flexible than the familiar Box-Cox procedure, allowing general smooth transformations of the variables, and is similar to the ACE (alternating conditional expectation) algorithm of Breiman and Friedman (1985). The ACE procedure uses scatterplot smoothers in an iterative fashion to find the maximally correlated transformations of the variables. Like ACE, my proposal can incorporate continuous, categorical, or periodic variables, or any mixture of these types. The method differs from ACE in that it uses a (nonparametric) variance-stabilizing transformation for the response variable. The technique seems to alleviate many of the anomalies that ACE suffers with regression data, including the inability to reproduce model transformations and sensitivity to the marginal distribution of the predictors. I provide several examples, including an analysis of the “brain and body weight” data and some data on telephone-call load. I also discuss the relationship of the proposed technique to the Box-Cox and ACE procedures. Efron's work on transformations provides some of the theoretical basis for the methodology.
Article
We show that semiparametric profile likelihoods, where the nuisance parameter has been profiled out, behave like ordinary likelihoods in that they have a quadratic expansion. In this expansion the score function and the Fisher information are replaced by-the efficient score function and efficient Fisher information. The expansion may be used, among others, to prove the asymptotic normality of the maximum likelihood estimator, to derive the asymptotic chi-squared distribution of the log-likelihood ratio statistic, and to prove the consistency of the observed information as an estimator of the inverse of the asymptotic variance.
Article
Gene expression can be quantitatively analyzed by hybridizing fluor-tagged mRNA to targets on a cDNA microarray. Comparison of gene expression levels arising from cohybridized samples is achieved by taking ratios of average expression levels for individual genes. A novel method of image segmentation is provided to identify cDNA target sites and a hypothesis test and confidence interval is developed to quantify the significance of observed differences in expression ratios. In particular, the probability density of the ratio and the maximum-likelihood estimator for the distribution are derived, and an iterative procedure for signal calibration is developed.
Article
Ascertaining the impact of uncharacterized perturbations on the cell is a fundamental problem in biology. Here, we describe how a single assay can be used to monitor hundreds of different cellular functions simultaneously. We constructed a reference database or "compendium" of expression profiles corresponding to 300 diverse mutations and chemical treatments in S. cerevisiae, and we show that the cellular pathways affected can be determined by pattern matching, even among very subtle profiles. The utility of this approach is validated by examining profiles caused by deletions of uncharacterized genes: we identify and experimentally confirm that eight uncharacterized open reading frames encode proteins required for sterol metabolism, cell wall function, mitochondrial respiration, or protein synthesis. We also show that the compendium can be used to characterize pharmacological perturbations by identifying a novel target of the commonly used drug dyclonine.