Page 1

On the assessment of statistical significance of

three-dimensional colocalization of sets of

genomic elements

Daniela M. Witten1,* and William Stafford Noble2,3,*

1Department of Biostatistics,2Department of Genome Sciences and3Department of Computer Science and

Engineering, University of Washington, Seattle, WA 98109

Received November 15, 2011; Revised December 23, 2011; Accepted December 26, 2011

ABSTRACT

A growing body of experimental evidence supports

the hypothesis that the 3D structure of chromatin

in the nucleus is closely linked to important func-

tional processes, including DNA replication and

gene regulation. In support of this hypothesis,

several research groups have examined sets of

functionally associated genomic loci, with the aim

of determining whether those loci are statistically

significantly colocalized. This work presents a

critical assessment of two previously reported

analyses,bothofwhich

DNA–DNA interaction data from the yeast Saccha-

romyces cerevisiae, and both of which rely upon a

simple notion of the statistical significance of

colocalization.Weshow

analyses rely upon a faulty assumption, and we

proposeacorrectnon-parametric

approach to the same problem. Applying this

approach to the same data set does not support

the hypothesis that transcriptionally coregulated

genes tend to colocalize, but strongly supports the

colocalization of centromeres, and provides some

evidence of colocalization of origins of early DNA

replication, chromosomal breakpoints and transfer

RNAs.

usedgenome-wide

thattheseprevious

resampling

INTRODUCTION

Recently, three published studies have used generaliza-

tions of chromosome conformation capture (3C) (1) to

obtain genome-wide DNA–DNA interaction data for

the genomes of human (2), budding yeast (3) and fission

yeast (4). Such methods, coupled with complementary

experimentalassayssuchasfluorescenceinsitu

hybridization (FISH) (5), DNA adenine methyltransferase

identification (DamID) (6) and chromatin interaction

analysis by paired end tag sequencing (ChIA-PET) (7),

promise to provide an increasingly detailed picture of

the 3D structure of chromatin in vivo.

Ultimately, the widespread and growing interest in the

experimental characterization of chromatin structure is

driven by the underlying hypothesis that the structure of

DNA in the nucleus is tightly related to DNA function.

Experimental evidence supports the existence of a variety

of well-defined nuclear substructures, including the

nuclear lamina, nucleoli, PML and Cajal bodies and

nuclear speckles (8). Furthermore, in some genomes, ex-

tensive evidence suggests the existence of relatively

well-defined chromosome territories, as well as the system-

atic orientation of gene-poor, heterochromatic regions

near the nuclear periphery and gene-dense, euchromatic

regions in the nuclear interior (9). Strikingly, the overall

pattern of nuclear architecture varies systematically

among cell types yet shows evidence of evolutionary con-

servation (10). Finally, increasing evidence couples the

dynamic repositioning of genomic regions with the regu-

lation of gene expression [reviewed in (8)].

In this article, we do not argue against the hypothesis

that chromatin structure is coupled with genome function.

However, we do present a cautionary tale illustrating a

potential statistical pitfall in the search for connections

between gene function and genome structure. In particu-

lar, we investigate two recent claims about nuclear

colocalization of functional elements in the budding

yeast Saccharomyces cerevisiae. The first article, published

in Nature and coauthored by one of us (Noble), claims

that there are extensive interchromosomal interactions

between transfer RNA genes, centromeres, chromosomal

breakpoints, origins of early DNA replication and sites

where chromosomal breakpoints occur (3). The second,

published in Nucleic Acids Research, claims that many

transcription factors regulate genes that are colocalized

*To whom correspondence should be addressed. Tel: +1 206 543 8930; Fax: +1 206 685 7301; Email: noble@gs.washington.edu

Correspondence may also be addressed to Daniela M. Witten. Tel: +206 616 7182; Fax: +206 543 3286; Email: dwitten@u.washington.edu

Published online 20 January 2012Nucleic Acids Research, 2012, Vol. 40, No. 93849–3855

doi:10.1093/nar/gks012

? The Author(s) 2012. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

in the nucleus (11). We show here that the statistical test

employed by both sets of authors rests upon a faulty as-

sumption, and we illustrate the effect of this faulty as-

sumption via simulations and via reanalysis of the yeast

data. Finally, we propose a correct resampling approach

to the same problem, and we apply this non-parametric

procedure to the same data. Our reanalysis moderately

impacts the conclusions from Duan et al. (3): contrary

to the initial analysis, we do not observe evidence of

telomere colocalization; however, we do observe strong

evidence for the colocalization of centromeres, as well as

statistical support for the colocalization of chromosomal

breakpoints, transfer RNAs and origins of early DNA

replication. In contrast, the resampling analysis does not

provide support for the central claim in the Dai and Dai

(11) paper, namely, that transcriptionally coregulated

genes tend to colocalize. The statistical analysis of

three-dimensional genome structure data sets must be per-

formed with care in order to avoid being misled by the

inherent structure of such data.

MATERIALS AND METHODS

Yeast interaction data and functional elements

We obtained from a recent study (3) a list of yeast

interchromosomal interactions observed at a false discov-

ery rate below 0.01, obtained by measuring interactions

among 3991 segments (chromosomal loci flanked by

pre-defined restriction enzyme sites) distributed through-

out the yeast genome. We also obtained the genomic co-

ordinates of centromeres, telomeres, transfer RNAs,

chromosomal breakpoints and origins of early DNA rep-

lication, all of which were studied in Duan et al. (3). In

addition, we obtained 174 gene sets, each of which

contains at least 20 yeast genes coregulated by a single

transcription factor,that

colocalization using the yeast interaction data (11).

wererecently testedfor

The hypergeometric approach for assessing gene set

colocalization

Duan et al. (3) and Dai and Dai (11) take the following

approach for assessing the extent to which various

genomic functional groups (e.g. centromeres, telomeres,

genes coregulated by a single transcription factor)

colocalize in the nucleus. For simplicity, we will refer to

the elements of a genomic functional group as ‘genes’,

though this need not be the case. Suppose that there are

a total of N genes, of which n belong to the gene set of

interest. Let M denote the number of all possible

interchromosomal interactions between the N genes, and

let K denote the actual number of experimentally observed

interchromosomal interactions between the N genes. Let

m denote the number of all possible interchromosomal

interactions between the n genes in the gene set of

interest, and let k denote the actual number of experimen-

tally observed interchromosomal interactions between the

n genes in the gene set of interest. Then, the authors claim

that the probability of observing k interchromosomal

interactions among the genes in the gene set is derived

from a hypergeometric distribution; that is, the probabil-

ity is

? ?M?m

K

m

kK?k

M

? ?

??

:

ð1Þ

Hence, they conclude that the probability associated with

observing at least k interchromosomal interactions among

the genes in the gene set is

? ?M?m

K

1 ?

X

k?1

x¼0

m

xK?x

M

? ?

??

:

ð2Þ

Applying Equation 2 to a candidate set of n genes yields a

P-value indicating whether the n genes colocalize in the

nucleus.

Though both Duan et al. (3) and Dai and Dai (11) used

a hypergeometric test to assess colocalization of genomic

elements and gene sets, there is a slight discrepancy in the

way that the two sets of authors defined the concept of an

‘interchromosomal interaction’. In order to illustrate the

difference we discuss the definition of K, the actual

number ofobserved interchromosomal

Duan et al. (3) computed K by summing, for each pair

of genomic elements that lie on different chromosomes,

the number of segments in the first genomic element that

interacted with a segment in the second genomic element

at a false discovery threshold below 0.01. On the other

hand, Dai and Dai (11) computed K by counting the

number of pairs of genes on different chromosomes for

which at least one segment in the first gene interacted with

at least one segment in the second gene at a false discovery

threshold below 0.01. In reassessing the evidence for

colocalization of the genomic elements and gene sets

studied by the two sets of authors, we defined the

concept of interchromosomal interactions as did each set

of authors.

interactions.

A resampling method for assessing gene set colocalization

Let n1,...,nIdenote the number of genes in the gene set

of interest that belong to each of the I chromosomes.

Note that

i¼1ni¼ n. We propose the following non-

parametric resampling approach for assessing gene set

colocalization:

PI

(1) Compute k, the number of experimentally observed

interchromosomal interactions among the genes in

the gene set of interest.

(2) Computem,the number

chromosomal interactions among the genes in the

gene set of interest.

(3) For b=1,..., B, where B is a large integer, such as

1000:

(a) For the i-th chromosome, draw nigenes uniform-

ly at random, without replacement, from the

genes on this chromosome. Repeat for each

chromosome, so that

been drawn. This is a ‘random gene set’.

(b) Compute k*b, the number of experimentally

observed interchromosomal interactions among

the genes in the random gene set.

of possibleinter-

PI

i¼1ni¼ n genes have

3850Nucleic Acids Research, 2012,Vol.40, No. 9

Page 3

(c) Compute m*b, the number of possible inter-

chromosomal interactions among the genes in

the random gene set.

(4) The P-value for the gene set of interest is given by

1

B

X

B

b¼1

1

k?b

m?b?k

m

??;

ð3Þ

where 1k?b=m?b?k=m

ð

equals 1 if k?b=m?b? k=m, and 0 otherwise.

Note that in Step 3(a), we ensure that the number of

genes on each chromosome in our random gene set is the

same as the number of genes on each chromosome in the

gene set of interest. Essentially, our resampling approach

computes a P-value by comparing the k/m ratio observed

for the gene set of interest to the ratios obtained on arbi-

trary sets of genes. This is a P-value for the null hypothesis

that the given gene set of interest shows no more

colocalization than a randomly-chosen set of genes. We

note that this approach for P-value calculation is not in-

herently novel, and indeed a similar approach was taken in

a recent paper (4). The use of resampling approaches for

hypothesis testing is discussed in more general terms in

Efron and Tibshirani (12).

Þis an indicator variable that

Generation of random gene sets in yeast interaction data

To assess the characteristics of the hypergeometric and

resampling-based P-values on data generated under the

null hypothesis of no gene set colocalization, we generated

1000 random gene sets. Each random gene set was

obtained by selecting one of the 174 gene sets from the

Dai and Dai paper and drawing that number of genes,

without replacement, from the full set of genes. That is,

each random gene set contained the same number of genes

as one of the real gene sets known to be coregulated by a

single transcription factor.

Generation of random interaction data

We repeated the following experiment 100 times, in order

to generate 100 random interaction data sets. We

generated the 3D positions of 1000 ‘genes’ independently

from the uniform distribution in the unit cube. We

computed a 1000?1000 interaction matrix between

these genes, where interactions were declared between

each pair of genes whose Euclidean distance was among

the smallest 10% of observed Euclidean distances. We

then created 250 random gene sets, each of which was

obtained by drawing 100 genes without replacement

from the full set of 1000 genes.

RESULTS

Hypergeometric P-values are inappropriate for assessing

gene set colocalization

If the hypergeometric P-values derived from Equation 2

are valid, then the P-value associated with an arbitrary

selection of n genes should be drawn from a uniform

distribution. We used the interaction data described in

Duan et al. (3) to assess whether P-values obtained in

this way are indeed uniform. We generated 1000 arbitrary

gene sets (details in ‘Materials and Methods’ section) and

computed hypergeometric P-values using Equation 2. A

histogram of these P-values is displayed in Figure 1(a).

These P-values are decidedly non-uniform—there are far

too many extreme P-values.

To further investigate

hypergeometric P-values, we generated a simple simulated

data set consisting of 1000 randomly generated observa-

tions in the 3D unit cube, which we used to generate an

interactionmatrix,as described

Methods’ section. Gene sets were selected at random,

and the histogram of hypergeometric P-values computed

according to Equation 2 is given in Figure 1(b). Once

again, the P-values are far from uniform.

What is wrong with using a hypergeometric P-value to

assess colocalization of a given gene set? Such a P-value is

based upon a 2?2 contingency table, shown in Table 1.

The units that contribute to the contingency table are gene

pairs; in Table 1, there are a total of M=a+b+c+d gene

pairs. A fundamental assumption that underlies the use of

the hypergeometric distribution is that each gene pair in

the contingency table is independent from all other gene

pairs. That is, we can think of each gene pair as having

two associated pieces of information: whether or not an

interaction was observed for that gene pair (a binary

variable, xifor the i-th gene pair), and whether or not it

is in the gene set of interest (a binary variable, zifor the

i-th gene pair). For the hypergeometric distribution to be

valid, we need (x1, z1), (x2, z2),...,(xM, zM) to be independ-

ent and identically distributed (13).

But it is not hard to see that the assumption of inde-

pendence is grossly violated in at least two ways. First, to

see that x1,..., xMare not independent, note that if there is

an interaction between the gene pair (i, j), and also

between the gene pair (i, k), then the likelihood that

there also is an interaction between the gene pair (j, k) is

higher than the likelihood of interaction for a randomly

selected pair of genes. This is because if the i-th gene is

located near the j-th gene in 3D space, and the i-th gene is

located near the k-th gene in 3D space, then the j-th and

k-th genes must also be located near each other in 3D

space. Second, to see that z1,...,zMare not independent,

note that if the gene pair (i, j) is in the gene set, and also

the gene pair (i, k) is in the gene set, then it must be the

case that (j, k) is in the gene set. Given that the independ-

ence assumption underlying the hypergeometric distribu-

tion is violated in the context of assessing gene set

colocalization, it should come as no surprise that the

hypergeometric P-values are invalid. It is for this reason

that the P-values observed in Figures 1a and 1b are

non-uniform.

the propertiesofthe

in‘Materials and

A valid, resampling approach for calculating P-values for

gene set colocalization

In order to obtain a valid P-value for the extent of

colocalization of a set of n genes, we can compare the

number of experimentally observed interchromosomal

Nucleic Acids Research,2012, Vol.40, No. 93851

Page 4

interactions among genes in this gene set to the distri-

butionof the numberof

interchromosomal interactions that results from a set

of n genes drawn at random from the full set of N

genes. We cannot, unfortunately, compute the corres-

ponding P-value analytically, as was the case for the

hypergeometric P-value; however, it is straightforward

to calculate the P-value using a resampling approach.

Detailsofour proposed

‘Materials and Methods’ section.

To assess the validity of our approach, we first

computed resampling-based P-values (Equation 3) on

randomly selected gene sets of the yeast interaction

dat of Duan et al. (3), as described in ‘Materials and

experimentallyobserved

procedurearegiven in

Methods’ section. The resulting P-values are shown in

Figure 1d. As expected, because the gene sets were

chosen at random, the resulting P-values have a uniform

distribution. We repeatedly generated such random gene

sets and found that the quantiles of the P-values obtained

verycloselymatched the

distribution.

We next computed resampling-based P-values for

the simulated data set consisting of 1000 ‘genes’ uniformly

distributed in the unit cube (details given in Materials

and Methods). The P-values that result almost per-

fectly match the quantiles of a uniform distribution,

as expected (Figure 1e; the apparent excess of P-values

in the left-most bin is due to the discreteness of the

P-values).

Of course, the fact that the resampling-based P-values

are uniformly distributed under the null hypothesis does

not provide sufficient evidence of their adequacy: it must

also be shown that they are small in the presence of

colocalization, i.e. under the alternative. To assess this,

we generated gene sets under the alternative by choosing

a gene at random, and then selecting the 100 genes nearest

to it in terms of Euclidean distance. Each of the resulting

gene sets had an extremely small P-value, indicating that

the resampling-based P-values have power to reject the

null hypothesis when there is indeed evidence of

colocalization.

quantilesofa uniform

(a)

0.0 0.20.4 0.6 0.8 1.00.0 0.20.4 0.60.8 1.0 0.00.2 0.4 0.60.81.0

0.00.2 0.40.6 0.81.00.0 0.2 0.40.6 0.81.0 0.0 0.20.40.6 0.8 1.0

0

50

100 150 200 250 300

(b)

0

1000

2000

3000

4000

(c)

0

10

20

30

40

50

60

70

(d)

0

10

20

30

40

50

60

(e)

0

200

600

1000

1400

(f)

0

5

10

15

20

25

30

Figure 1. (a)–(c) show histograms of hypergeometric P-values. In panel (a), the P-values are computed for 1000 random gene sets with respect to the

yeast interaction data set of Duan et al. (3). In panel (b) the P-values are computed with respect to a simulated data set for 250 random sets of 100

genes. In (c), the P-values correspond to 174 gene sets regulated by a single transcription factor and studied in (11), computed with respect to the

yeast interaction data set of (3). Panels (d)–(f) are analogous to panels (a)–(c), but the P-values are computed using the resampling approach. In each

case, the resampling-based P-values provide no evidence of colocalization of gene sets.

Table 1. The hypergeometric test is based upon a 2?2 contingency

table of gene pairs

InteractionNo interaction

In gene set

Not in gene set

a

c

b

d

Each element in the contingency table indicates the number of gene

pairs corresponding to the associated row and column. For instance,

there are a gene pairs in the gene set for which an interaction was

observed, and d gene pairs not in the gene set for which no interaction

was observed.

3852 Nucleic Acids Research, 2012,Vol.40, No. 9

Page 5

Reanalysis of the colocalization of transcriptionally

regulated yeast genes

Dai and Dai (11) recently examined the yeast interaction

data of Duan et al. (3) in order to determine the extent to

which sets of genes coregulated by a single transcription

factor tend to colocalize. They identified 174 transcription

factors, each of which regulated at least 20 genes. They

applied the hypergeometric P-values (Equation 2) in order

to assess the colocalization of each set of coregulated

genes, and found that 34 sets of coregulated genes had

P-values below 0.01. We repeated the analysis of Dai

and Dai, and confirmed their finding that a substantial

number of the coregulated gene sets had very small

hypergeometric P-values (Figure 1c). However, on the

basis of resampling-based P-values, there is essentially

no evidence of colocalization of sets of coregulated

genes (Figure 1f). None of the 174 gene sets had a

resampling-based P-value below 0.01 after Bonferroni cor-

rection.Thehypergeometric

P-values displayed in Figures 1(c) and 1(f), as well as the

transcription factors regulating each of the 174 gene sets,

can be found in Supplementary Table S1.

Not surprisingly, because both the hypergeometric and

resampling-based P-values are based upon the number of

experimentally observed interchromosomal interactions in

a given gene set of interest, these two types of P-values are

highly correlated with each other: their Spearman correl-

ation is 0.969. Therefore, the choice of P-value does not

affect the relative ranking of evidence for gene set

colocalization as much as it does the absolute amount of

evidence for gene set colocalization.

and resampling-based

Reanalysis of the colocalization of functional elements in

the yeast genome

Duan et al. (3) assessed the extent to which certain

genomicfunctional groups—centromeres,

transfer RNAs, chromosomal breakpoints and origins

of early DNA replication—tend to colocalize in the

nucleus. Of 14 such functional groups, they found that

10 colocalize in thenucleus,

hypergeometric P< 0.01 after Bonferroni correction. (In

a related analysis,no

colocalization of genesthat

Ontology terms.) We computed the resampling-based

P-values for each of these 14 functional groups, as

described in the ‘Materials and Methods’ section, with

the following approach for drawing random functional

elements: for each of the n functional elements of

interest (e.g. n=32 telomeres) we repeatedly drew a

‘random’ functional element from the corresponding

chromosome, of the same length as the true functional

element of interest, uniformly at random along the

length of the chromosome. (In contrast, in our reanalysis

of the Dai and Dai data (11) we generated ‘random’ genes

by drawing genes, without replacement, from the full set

of genes on a given chromosome. Unfortunately, it is not

possible to take that approach in our reanalysis of the

Duan et al. (3) paper, due to the nature of the functional

elements considered. The approach that we took instead is

a natural alternative.)

telomeres,

asevidencedbya

evidencewasfoundfor

share variousGene

In the Duan et al. (3) study, 10 groups of functional

elements showed significant evidence of colocalization ac-

cording to the Bonferroni adjusted hypergeometric test. In

our reanalysis (Figure 2), three of these groups are no

longersignificant afterBonferonni

complete set of telomeres, one of the two sets of

early-firing origins, and one of the two sets of chromosom-

al breakpoints. Thus, our results suggest that (i) Duan

et al. (3) incorrectly concluded that telomeres exhibit

colocalization, and (ii) the evidence for colocalization of

early-firing origins and for chromosomal breakpoints is

weaker than initially reported.

adjustment:the

DISCUSSION

In two recent papers, Duan et al. (3) and Dai and Dai (11)

assessed the extent to which certain functional genomic

elements colocalize, using P-values derived from a

hypergeometric distribution. We have shown here that

such hypergeometric P-values are flawed. The assump-

tions of the hypergeometric distribution are inappropriate

in this setting, and consequently hypergeometric P-values

computed on random gene sets are far from uniform. We

then presented an alternative, resampling-based P-value

calculation approach that is suitable for this setting.

These resampling-based P-values indicate a complete

lack of evidence that the 174 coregulated gene sets

studied in Dai and Dai (11) colocalize in the nucleus.

However, they do support the hypothesis that centromeres

colocalize, and provide some evidence in support of

colocalization of other functional genomic elements.

In the current study, we reassessed the extent to which

target genes of 174 TFs, considered by Dai and Dai (11),

exhibit colocalization. We did not investigate several other

results in that study that were also based on a

hypergeometric test: that only one TF shows significant

colocalization based on intrachromosomal interactions,

that 5 of 158 TFs measured via ChIP-chip show

evidence of colocalization of their targets, and that

various classes of chromatin regulatory genes—histone

modification regulated genes, genes whose promoters

exhibit high chromatin remodeler occupancy, genes that

show expression changes in response to chromatin re-

modeler perturbation, genes

occupied by nucleosomes, genes containing histone

variant H2A.Z, and genes with high trans effects on

gene expression divergence—are colocalized. We are not

claiming that coregulated gene sets do not colocalize in the

nucleus; we are simply stating that there does not appear

to be evidence in the Duan et al. (3) data set of

colocalization of the 174 gene sets studied by Dai and

Dai (11).

The implications of our reanalysis for the claims made

in the Duan et al. paper are relatively minor. The primary

colocalization claims in that paper—regarding centro-

meres, telomeres, tRNAs, breakpoints and origins of rep-

lication—were based primarily upon the qualitative

assessment of a set of receiver operating characteristic

curves (Figure 4d of that paper). This analysis was aug-

mented by a set of hypergeometric tests, reported in their

whosepromoters are

Nucleic Acids Research,2012, Vol.40, No. 93853