Page 1

Statistical Applications in Genetics

and Molecular Biology

Volume 5, Issue 12006

Article 7

A New Type of Stochastic Dependence

Revealed in Gene Expression Data

Lev Klebanov∗

Craig Jordan†

Andrei Yakovlev‡

∗Department of Probability and Statistics, Charles University, levkleb@yahoo.com

†University of Rochester, craig jordan@urmc.rochester.edu

‡University of Rochester, Rochester, NY, andrei yakovlev@urmc.rochester.edu

Copyright c ?2006 The Berkeley Electronic Press. All rights reserved.

Page 2

A New Type of Stochastic Dependence

Revealed in Gene Expression Data∗

Lev Klebanov, Craig Jordan, and Andrei Yakovlev

Abstract

Modern methods of microarray data analysis are biased towards selecting those genes that dis-

play the most pronounced differential expression. The magnitude of differential expression does

not necessarily indicate biological significance and other criteria are needed to supplement the

information on differential expression. Three large sets of microarray data on childhood leukemia

were analyzed by an original method introduced in this paper. A new type of stochastic depen-

dence between expression levels in gene pairs was deciphered by our analysis. This modulation-

like unidirectional dependence between expression signals arises when the expression of a “gene-

modulator” is stochastically proportional to that of a “gene-driver”. A total of more than 35% of

all pairs formed from 12550 genes were conservatively estimated to belong to this type. There are

genes that tend to form Type A relationships with the overwhelming majority of genes. However,

this picture is not static: the composition of Type A gene pairs may undergo dramatic changes

when comparing two phenotypes. The ability to identify genes that act as “modulators” provides

a potential strategy of prioritizing candidate genes.

KEYWORDS: gene expression, microarray analysis, stochastic dependence

∗This research is supported by NIH grants GM075299 (Yakovlev) and CA90446 (Jordan), and

Czech Ministry of Education Grant MSM 113200008 (Klebanov).

Page 3

Erratum

In formula (5) on page 7, the minimum is taken over k while the index i is redundant. Please

refer to the supplemental erratum pdf for the corrected equation.

Page 4

1. Introduction

It has become common practice to use microarray technology for finding “in-

teresting” genes by comparing two or more different phenotypes. Modern

methods of microarray data analysis typically employ two-sample statistical

tests for testing differential expression of genes combined with multiple testing

procedures to guard against Type 1 errors (Dudoit et al. 2003). Such meth-

ods are biased towards selecting those genes that display the most pronounced

differential expression. Once the list of genes showing statistically significant

differential expression has been generated, these genes are often ranked using

purely statistical criteria and this ranking is thought to reflect their relative

importance. Quite typically, a certain number of genes with the smallest p-

values is finally selected from the list of all “significant” genes. While most

biologists recognize that the magnitude of differential expression does not nec-

essarily indicate biological significance, in the absence of better methods, this

remains the dominant means to initially prioritize candidate genes.

From a biological perspective, the above-described paradigm falls far short

of being a perfectly valid approach. Even a very small change in expression of

a particular gene may have dramatic physiological consequences if the protein

encoded by this gene plays a catalytic role in a specific cell function. Many

other downstream genes may amplify the signal produced by this truly inter-

esting gene, thereby increasing their chance to be selected by formal statistical

methods. For a regulatory gene, however, the chance of being selected by such

methods may diminish as one keeps hunting for downstream genes which tend

to show much bigger changes in their expression. As a result, the initial list of

candidates may be enriched with many effector genes that do little to elucidate

more fundamental mechanisms of biological processes.

There are two natural ways to remedy the situation. One is to use bioin-

formatics tools that utilize prior biological knowledge, such as partially known

1

1

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 5

pathways, for prioritization of candidate genes. This approach is now routinely

used in biological studies. Another way is to extract additional information on

relationships between different genes from microarray data by pertinent statis-

tical methods. Both approaches enrich and supplement each other when used

in combination. It is noteworthy that recent years have seen a growing interest

in correlations between gene expression levels in statistical methodologies for

microarray analysis (Jaeger et al. 2003; Xiao et al. 2004; Goerman et al. 2004;

Dettling et al. 2005; Efron 2005; Lu et al. 2005; Qiu et al. 2005a, 2005b). As

larger sets of microarray gene expression data become readily available, quan-

titative insights into dependencies between gene expression levels are gaining

in importance.

Using three publicly available sets of data (see Section 2.1) we studied the

expression profiles of all pairs of genes formed from a total of 12550 genes

(probe sets). It became clear that many genes develop a very special type of

relationship with each other. We term this relationship the Type A dependence

and define it as follows. Let X and Y be random variables (r.v.’s) representing

the expression levels of genes gxand gy. A pair of genes (gx,gy) is said to be

Type A if X and Y satisfy the condition:

Y = XZ, (1)

where Z is a positive r.v. which is stochastically independent of X. The r.v.’s

Y,X and Z are defined on the same probability space. By log-transforming

expression (1) one obtains

y = x + z,(2)

where x = logX,y = logY,z = log Z. A general necessary condition for Type

A dependence is

Var(x) = Cov(x,y).

This condition is also sufficient under joint normality of (x,y).

2

2

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7

Page 6

It follows from the above definition that the r.v.’s X and Y/X are inde-

pendent in a Type A gene pair. At the same time, the r.v.’s X and Y are

stochastically dependent so that the regular correlation between them can be

quite strong. Note that the roles of gx and gy in such a pair are not sym-

metrical because, given (1) is true, the r.v.’s Y and 1/Z in the relationship:

X = Y/Z are no longer independent. In this special sense, the dependence (1)

is unidirectional, leading us to term gxthe “driver” (DR) and gythe “modu-

lator” (MOD). Those gene pairs that do not display the Type A dependence

are classified as Type B pairs. An important subclass of Type B pairs is rep-

resented by those with small correlation coefficients (Section 3). Gene gymay

play the role of a driver in some other gene pair of Type A. Likewise, gene gx

may be a modulator in another Type A pair. Both genes may form Type B

relationships with some other genes as well. Gene pairs of Type A can form

long chains involving hundreds and even thousands of genes.

Formula (1) implies an amplification-type relationship between X and Y ,

which is understood in a broad sense with the amplification (modulation) be-

ing positive if the mean value of Z is greater than 1 and negative otherwise.

If the expression measures reported from microarray data were strictly linear

functions of the concentrations of mRNA transcripts, then relationship (1)

would imply that gene gy modulates gene gxin such a way that the moving

equilibrium (resulting from synthesis and degradation) concentration of the

transcripts produced by gyis (stochastically) proportional to that of the tran-

scripts produced by gene gx. Since the true relationship between transcript

concentrations and Affymetrix expression measures is unknown and may well

be nonlinear, especially at low and high mRNA concentrations, this interpre-

tation holds only approximately, much like the interpretation of microarray

gene expression measures in general.

In this paper, we provide evidence that the above-described phenomenon is

3

3

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 7

real and not just a random pattern seen in a particular data set. In doing so,

we estimate the abundance of Type A pairs, assess stability of their occurence,

and provide specific examples of their patterns in microarray data.

2. Methods

2.1. The data

We analyzed three subsets of data identified through the St. Jude Children’s

Research Hospital (SJCRH) Database on childhood leukemia (see the website:

http://www.stjuderesearch.org/data/ALL1). The SJCRH Database contains

gene expression data on 335 subjects, each represented by a separate array

(Affymetrix, Santa Clara, CA) reporting measurements on the same set of

m = 12550 genes. We selected the following three groups of patients: Group 1

was represented by 19 arrays obtained from patients with acute lymphoblastic

leukemia characterized by a normal cytogenic phenotype (NORMAL), Group

2 included 45 patients with T-cell acute lymphoblastic leukemia (TALL), and

Group 3 was represented by 88 patients with hyperdiploid (HYPERDIP) acute

lymphoblastic leukemia.

2.2. Testing for differentially expressed genes

Since the available sample size is sufficient for the use of distribution-free

methods (Lee et al., 2005; Klebanov et al., 2005), we applied the Cram´ er-

von Mises two-sample test with Bonferroni adjustment to compare Group 1

(NORMAL) and Group 2 (TALL) of the patients with childhood leukemia.

This test resulted in 342 differentially expressed genes when controlling the

family-wise error rate at the 0.05 level.

4

4

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7

Page 8

2.3. Testing for Type A dependence

Recalling formula (2), notice that testing the hypothesis H0: the gene pair

under consideration is a Type A pair is equivalent to testing the hypothesis:

x and z = y −x are stochastically independent. Testing independence is more

cumbersome and time-consuming than testing the absence of correlation and

this is why we confine ourselves to the latter approach. However, our sample

analyses have shown that the results of the two tests are in good agreement.

Presented below is a statistical test for the hypothesis H(1)

0: ρ(x,z) = 0, where

ρ(x,z) is the correlation coefficient between x and z = y − x. A test statistic

for the hypothesis H(1)

0: ρ(x,z) = 0, where ρ(x,z) is the correlation coefficient

between x and z, is given by the Fisher’s transformation

r(x,z) =1

2log1 + w(x,z)

1 − w(x,z),(3)

where w is the Pearson sample correlation coefficient. Under the null hypoth-

esis, the statistic r has an asymptotic normal distribution with mean 0 and

standard deviation 1/√n − 3.

However, the situation is more complicated if there is a multiplicativearray-

specific technological noise in the data because the correlation structure of the

vector (x,z) becomes non-identifiable. The most widely accepted noise model

assumes that the same multiplicative measurement error is shared by all genes

on each array, but the level of this error varies randomly from array to array.

In the presence of this random effect denoted by A, we observe (x + a,y + a),

where a = logA and A is a positive r.v. independent of the pair (X,Y ). Now

we deal with the pair of r.v.s (x + a,z), where z = (y + a) − (x + a). It is

interesting that the hypothesis of independence for x + a and z is equivalent

to the hypothesis of independence for x and z. For simplicity we formulate

this result in terms of the original r.v.’s A, X, Y and Z; it obviously remains

valid for their logarithms as well.

5

5

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 9

Proposition. Let (X,Y ) be a random vector with non-negative components

and let A be a positive r.v. independent of (X,Y ). Suppose that there exists

δ > 0 such that IEXs< ∞, IEYs< ∞, IEAs< ∞ for every |s| < δ. The r.v.’s

AX and Z = Y/X are independent if and only if X and Z are independent.

Proof.From the well-known properties of the Mellin transform, it follows

that the r.v.’s X and Z are independent if and only if

IE(XsZt) = IE(Xs)IE(Zt)

for all s,t < δ. Since A is independent of the pair (X,Y ), we can write

IE{(AX)sZt} = IE{AsYtXs−t} = IE(As)IE{XsZt}.

Then the following implications are obvious:

IE{(AX)sZt} = IE(AsXs)IE(Zt) ⇔ IE(As)IE(XsZt) = IE(As)IE(Xs)IE(Zt) ⇔

IE(XsZt) = IE(Xs)IE(Zt),

given E(As) > 0. This proves our proposition.

Because of this nice property, the test statistic (3) can be applied to noisy

data in order to test the hypothesis H(1)

0. However, the noise adds variability to

the data and this may affect the error properties of the test. The latter effect

is impossible to assess even by simulations because the noise in unobservable.

Since our main goal is to demonstrate the extent of the phenomenon under

study by conservatively estimating the abundance of Type A pairs, we are

much more concerned with Type 2 rather than Type 1 errors. With this in

mind, we reformulate the problem of hypothesis testing as follows.

Let the random vector u = (u1,...,um) represent logarithms of the true

expression levels of m genes and let a = (a,...,a) be the corresponding m-

dimensional noise vector with identical components. The observed expression

levels are represented as v = u + a. It is assumed that all the r.v.’s under

6

6

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7

Page 10

consideration have finite second moment. Let us choose two components of

the vector u and denote them by x and y, respectively. The corresponding

components of the vector v are given by

ξ = x + aandη = y + a,

where the r.v. a is unobservable, of course. For definiteness sake we always

assume that σ2(ξ) ≤ σ2(η). Introduce the class U = U(u?) of all random m-

dimensional vectors u?such that u?+ a?= v, where a?= (a?,...,a?) and a?is

a random variable (on the log-scale) independent of u?. For the pair (ξ,η) we

test the hypothesis

H(2)

0 : sup

u?∈U|ρ(ξ?,η?− ξ?)| = 0,

where ξ?and η?are any two components of the vector u?. It is easy to show

that

ρ(ξ?,η?− ξ?) = ρ(ξ,η − ξ)

?

1 +σ2(a?)

σ2(ξ?)

?1

2

,

where ξ?and a?are independent. We know that σ2(a) ≤ σ2(a) + σ2(uk) =

σ2(vk) for any component vkof the vector v, k = 1,...,m. Therefore,

σ2(a) ≤ σ2= min

1≤i≤mσ2

i(vk),(5)

where σ2

i(vk) is the variance of observed (noisy) expression of the kth gene

and the minimum in (5) is taken over all the genes. Notice also that σ2(uk) ≥

σ2(vk) − σ2for any k = 1,...,m. Since σ2(ξ?) ≥ σ2, the following inequality

holds

sup

u?∈U|ρ(ξ?,η?− ξ?)| ≤ |ρ(ξ,η − ξ)|

where all the characteristics involved can be estimated for all genes except the

?

1 +

σ2

σ2(ξ) − σ2

?1

2

,(6)

one for which the minimum in (5) is attained. The latter gene is excluded

from the analysis.

7

7

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 11

By replacing all the parameters in the right-hand side of inequality (6)

with their empirical estimators and transforming the resultant expression in

accordance with (3), we obtain a test-statistic for rejecting H(2)

0. Let us denote

this statistic by r∗. The upper bound in (6) is sharp as it can be shown that the

bound is attained if all components of the vector v are normally distributed. In

the latter case, the test based on r∗controls a given nominal significance level

for testing the hypothesis H(2)

0. Otherwise, the test thus designed may have

a higher actual significance level than a given nominal level, which is of little

concern when estimating the abundance of Type A pairs (Section 3) because

we make the estimate more conservative by falsely rejecting more true null

hypotheses. However, this circumstance should be kept in mind when using

the test to select individual Type B pairs in combination with multiple testing

procedures. Another important caveat is that the above test cannot be used

for selecting individual Type A pairs. As is the case with the statistic r given

by formula (3), the asymptotic distribution of r∗is normal but with a larger

variance. Therefore, we make our test only more rejective (at a sacrifice in the

Type 1 error rate) when constructing r∗from the noisy data but using quantiles

of the sampling distribution for the statistic r. This strategy is consistent with

our goal to provide strong evidence for the existence of Type A dependence in

Section 3.

3. Results and Discussion

The procedure given in Section 2.3 is designed to reject the null hypothesis

H0: the gene pair under consideration is a Type A pair in the presence of a

multiplicative random noise in the data. To determine the abundance of Type

A pairs we estimate the expected proportion πAof true null hypotheses in the

data at hand. In doing so, we do not resort to any multiple testing adjustment

because our objective is to estimate the total number of Type A pairs and not

8

8

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7

Page 12

to select individual pairs. Furthermore, we want to obtain a lower bound for

their proportion; the more hypotheses rejected the lower our estimate.

In accordance with the method by Storey and Tibshirani (2003), πAcan

be estimated by a limiting value of

ˆ πA(λ) =#{pi> λ; i = 1,...,m}

m(1 − λ)

, (7)

as the tuning parameter λ tends to 1. In formula (7), pi,i = 1,...,m, are

the observed p-values calculated from quantiles of the test described in Sec-

tion 2.3, and m is the total number of the hypotheses tested. The estimator

ˆ πAtends to be biased up, i.e. IE{ˆ πA(λ)} ≥ πA(Storey et al. 2004), and its

accuracy is unknown. In addition, ˆ πAis known to be highly variable whenever

p-values are heavily correlated (Qiu et al. 2005b). Both difficulties can be

remedied by resorting to resampling techniques. We used a subsampling vari-

ant of the delete-d-jackknife method (Efron and Tibshirani 1993, Shao and Tu

1995) to reduce the bias and estimate the variance of ˆ πA. Unlike the bootstrap,

the delete-d-jackknife is asymptotically valid without any smoothness require-

ments on the statistic ˆ πA whose sampling distribution we want to estimate

(Politis and Romano 1994).

We formed all pairs from 1500 randomly selected genes and applied this

method (with 200 subsamples) to the collection of arrays for Group 2 (d = 6)

and Group 3 (d = 8) of patients with childhood leukemia. For each subsample

of arrays, the estimate ˆ πA was obtained and then the sample mean E(ˆ πA),

median M(ˆ πA), and standard deviation S(ˆ πA) were computed from the 200

subsamples. The latter characteristic was estimated by a resampling coun-

terpart of the jackknife sample standard deviation (Shao and Tu 1995). The

resultant estimates were: E(ˆ πA) = 46%, M(ˆ πA) = 46%, S(ˆ πA) = 5% for Group

2, and E(ˆ πA) = 43%, M(ˆ πA) = 43%, and S(ˆ πA) = 6% for Group 3. When

both datasets were pooled together, thereby increasing the power of our test

9

9

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 13

for H0, the proportion of Type A pairs was still significant: E(ˆ πA) = 35%,

M(ˆ πA) = 34%, S(ˆ πA) = 4.5%. When the presence of the technological noise

was ignored in testing the hypothesis H(1)

0, the estimated proportion of Type

A pairs was about 60%. The stability of the proposed procedure was assessed

by the same resampling procedure. The results of this assessment for 500

randomly selected genes are shown in Fig. 1. One can see that the selection

stability is remarkably high, indicating that we have a stable pattern in the

data under study.

To better appreciate the difference between the A and B types of gene

pairs, one can resort to the correlation diagram shown in Fig. 2. We first

compute the Pearson correlation coefficients in all pairs formed by a given

gene gx. The solid line in Fig. 2 represents these coefficients in increasing

order. In each pair (gx,gy), we also compute correlations between X and Y/X

and between Y and X/Y , respectively, and take the minimal of the two. The

resultant correlation coefficients for all pairs formed by the chosen gene are

also plotted in increasing order (dashed line). Shown in Fig. 2 is the diagram

for gene TIE1 derived from the HYPERDIP data. From the joint behavior of

the two curves, it is clear that TIE1 tends to form Type A relationships with

the overwhelming majority of genes. Another gene (EIF2AK2), however, is

likely to form Type B relationships with the majority of genes in this set of

data (Fig. 3).

The prevalence of type A over type B relationships and vice versa provides

an additional information on the behavior of a given gene in two different

phenotypes. For example, the correlation diagram in Fig. 4 shows that the

pattern displayed by gene TCL1A in NORMAL is reversed in TALL. One can

also compute correlations under the assumption of Type A dependence and

then compare them with their actual values. Given the dependence is of Type

A, the correlation coefficient between x and y is uniquely defined by σxand

10

10

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7

Page 14

σy, i.e. ρ(x,y) = σx/σy. Shown in Fig. 5 are 12550 correlation coefficients

in the derived Type A pairs formed by gene MMP10 and their actual values

estimated from the HYPERDIP data.

From the aforesaid it appears that the modulation-like dependence, man-

ifesting itself in type A gene pairs, is a stable mass phenomenon. Such pairs

can form long chains that change their structure (membership) under different

conditions. The asymmetric character of this dependence suggests that a chain

of Type A gene pairs may represent an induced biochemical pathway such as a

signal transduction pathway. However, the Type A dependence does not nec-

essarily imply a direct causal effect (mediated by gene products) of one gene on

another but can equally readily be thought of as an association between them

arising from the assignment of operations to different genes by a molecular

mechanism residing outside the chain. In other words, a chain of Type A pairs

can be thought of as a “supply chain” that provides certain proportions of gene

products to fulfill a specific cell function. This speculation implies that genes

involved in Type A pairs act as effector rather than regulatory genes. It seems

impossible to put forward more specific molecular hypotheses at this point

in our research. However, the phenomenon of Type A dependence definitely

deserves a closer look regardless of its biological interpretation.

As discussed earlier, Type B pairs frequently show a weak correlation be-

tween their members. There are certain genes whose expression appears to be

uncorrelated with that of numerous (but probably not all) other genes (Fig.

3). However, such genes should not be perceived as “isolates” that remain un-

engaged in regulatory mechanisms. The gene pairs of Type B can be involved

in relaying feedback signals because the absence of correlation is readily consis-

tent with the well-known modes of regulation such as induction (promotion)

and repression. The latter nonlinear (presumably deterministic) functional

relationships become fuzzy in microarray “snapshots” because of various ran-

11

11

Klebanov et al.: Dependence Between Gene Expressions

Published by The Berkeley Electronic Press, 2006

Page 15

dom effects, including their asynchronous occurences in different cells, and are

amendable to statistical analysis only in specially designed experiments. This

is the reason why it is natural to focus on stochastic dependence when making

inferences on gene interactions from “static” microarray data.

The information on the two types of dependencies can be utilized to fur-

ther characterize a set of differentially expressed genes resulted from two-

sample comparisons. When comparing Group 1 and Group 2 of patients with

leukemia, a total of 342 genes were declared differentially expressed. We gen-

erated 1000 subsamples to select only stable (with 100% selection frequency)

Type A pairs and discarded all genes that were classified as MODs by our

method. This ad hoc procedure left us with as few as 49 finally selected genes.

The names of these genes are listed in the Appendix. We note among the 49

genes selected that specific classes are preferentially enriched. For example,

genes annotated as transcription factors or adhesion molecules show increased

frequencies between approximately 50 and 100%, respectively. Similarly, genes

regulating various aspects of metabolism also increase 100% in the set of 49

genes.

It is important to note that our approach did not select the genes with

the largest differences in expression levels. Indeed, there were only 3 out of

the 49 genes found among the 49 most differentially expressed genes with

the smallest adjusted p-values. One of them (TCL1A) was the 6th top gene

in terms of differential expression while the other two (GALNAC4S-6ST and

ENG) were the 40th and 42th, respectively. When looking at annotations of

the 49 most differentially expressed genes, we found that this set, comprised

mostly of modulators, is enriched with signal transduction factors (comprising

24.5% of the set membership) while the frequencies of transcription factors

and adhesion molecules are significantly reduced in comparison with the set

of 49 genes selected by our criterion.

12

12

Statistical Applications in Genetics and Molecular Biology, Vol. 5 [2006], Iss. 1, Art. 7

http://www.bepress.com/sagmb/vol5/iss1/art7