Page 1

BIOINFORMATICSORIGINAL PAPER

Vol. 26 no. 21 2010, pages 2752–2759

doi:10.1093/bioinformatics/btq511

Systems biology

Robust and accurate data enrichment statistics via distribution

function of sum of weights

Aleksandar Stojmirovi´ c and Yi-Kuo Yu∗

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda,

MD 20894, USA

Associate Editor: Trey Ideker

Advance access publication September 8, 2010

ABSTRACT

Motivation: Term-enrichment analysis facilitates biological interpre-

tation by assigning to experimentally/computationally obtained data

annotation associated with terms from controlled vocabularies. This

process usually involves obtaining statistical significance for each

vocabulary term and using the most significant terms to describe a

given set of biological entities, often associated with weights. Many

existing enrichment methods require selections of (arbitrary number

of) the most significant entities and/or do not account for weights

of entities. Others either mandate extensive simulations to obtain

statistics or assume normal weight distribution. In addition, most

methods have difficulty assigning correct statistical significance to

terms with few entities.

Results: Implementing the well-known Lugananni–Rice formula, we

have developed a novel approach, called SaddleSum, that is free

from all the aforementioned constraints and evaluated it against

several existing methods. With entity weights properly taken into

account, SaddleSum is internally consistent and stable with respect

to the choice of number of most significant entities selected.

Making few assumptions on the input data, the proposed method

is universal and can thus be applied to areas beyond analysis

of microarrays. Employing asymptotic approximation, SaddleSum

provides a term-size-dependent score distribution function that gives

rise to accurate statistical significance even for terms with few

entities. As a consequence, SaddleSum enables researchers to place

confidence in its significance assignments to small terms that are

often biologically most specific.

Availability: Our implementation, which uses Bonferroni correction

toaccountformultiplehypotheses

http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/.

Source code for the standalone version can be downloaded from

ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/.

Contact: yyu@ncbi.nlm.nih.gov

Supplementary information: Supplementary materials are available

at Bioinformatics online.

testing,isavailableat

Received on May 19, 2010; revised on July 30, 2010; accepted on

August 31, 2010

1

A major challenge of contemporary biology is to ascribe interpre-

tation to high-throughput experimental or computational results,

INTRODUCTION

∗To whom correspondence should be addressed.

where each considered entity (gene or protein) is assigned a value.

Biological information is often summarized through controlled

vocabularies such as Gene Ontology (GO; Ashburner et al., 2000),

where each annotated term includes a list of entities. Let w denote

a collection of values, each associated with an entity. Given w

and a controlled vocabulary, enrichment analysis aims to retrieve

the terms that by statistical inference best describe w, that is, the

termsassociatedwithentitieswithatypicalvalues.Manyenrichment

analysis tools have been developed primarily to process microarray

data (Huang et al., 2009). In terms of biological relevance, the

performance assessment of those tools is generally difficult. It

requires a large, comprehensive ‘gold standard’vocabulary together

with a collection of w’s processed from experimental data, and

with true/false positive terms corresponding to each w correctly

specified. This invariably introduces some degree of circularity

because the terms often come from curating experimental results.

Before declaring efficacy in biological information retrieval that is

non-trivial to assess, an enrichment method should pass at least the

statistical accuracy and internal consistency test.

In their recent survey, Huang et al. (2009) list 68 distinct

bioinformatic enrichment tools introduced between 2002 and 2008.

Most tools share a similar workflow: given w obtained by suitably

processingexperimentaldata,theysequentiallytesteachvocabulary

term for enrichment to obtain its P-value (the likelihood of a false

positive given the null hypothesis). Since many terms are tested,

a multiple hypothesis correction, such as Bonferroni (Hochberg

and Tamhane, 1987) or false discovery rate (FDR; Benjamini and

Hochberg, 1995), is applied to P-value of each to obtain the final

statistical significance. The results are displayed for the user in a

suitable form outlining the significant terms and possibly relations

betweenthem.Notethatthelatterstepsarelargelyindependentfrom

the first. To avoid confounding factors, we will focus exclusively on

the original enrichment P-values.

Based on the statistical methods employed, the existing

enrichment tools can generally be divided into two main classes.

The singular enrichment analysis (SEA) class contains numerous

tools that form the majority of published ones (Huang et al., 2009).

By ordering values in w, these tools require users to select a number

of top-ranking entities as input and mostly use hypergeometric

distribution (or equivalently Fisher’s exact test) to obtain the term

P-values.After the selection is made, SEAtreats all entities equally,

ignoring their value differences.

The gene set analysis (GSA) class was pioneered by the gene set

enrichmentanalysis(GSEA)tool(Moothaetal.,2003;Subramanian

et al., 2005). Tools from this class use all values (entire w) to

Published by Oxford University Press on behalf of The US Government 2010.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

Robust and accurate data enrichment statistics

calculate P-values and do not require preselection of entities.

Some approaches (Al-Shahrour et al., 2007; Blom et al., 2007;

Breitling et al., 2004; Eden et al., 2009) in this group apply

hypergeometric tests to all possible selections of top-ranking

entities. The final P-value is computed by combining (in a tool-

specific manner) the P-values from the individual tests. Other

approaches use non-parametric approaches: rank-based statistics

such as Wilcoxon rank-sum (Breslin et al., 2004) or Kolmogorov–

Smirnov like (Backes et al., 2007; Ben-Shaul et al., 2005; Mootha

et al., 2003; Subramanian et al., 2005). When weights are taken

into account, such as in GSEA (Subramanian et al., 2005),

statisticalsignificancemustbedeterminedfromasampled(shuffled)

distribution. Unfortunately, limited by the number of shuffles that

can be performed, the smallest obtainable P-value is bounded away

from 0.

The final group of GSA methods computes a score for

each vocabulary term as a sum of the values (henceforth used

interchangeably with weights) of the m entities it annotates. In

general, the score distribution pdfm(S) for the experimental data is

unknown. By Central Limit Theorem, when m is large, Gaussian

(Kim and Volsky, 2005; Smid and Dorssers, 2004) or Student’s

t-distribution (Boorsma et al., 2005; Luo et al., 2009) can be used to

approximate pdfm(S). Unfortunately, when the weight distributions

are skewed, the required m may be too large for practical use.

Evidently, this undermines the P-value accuracy of small terms

(meaning terms with few entities), which are biologically most

specific.

It is generally found that, given the same vocabulary and w,

differentenrichmentanalysistoolsreportdiverseresults.Webelieve

this may be attributed to disagreement in P-values reported as

well as that different methods have different degree of robustness

(internal consistency). Instead of providing a coherent biological

understanding, the array of diverse results questions the confidence

of information found. Furthermore, other than microarray datasets,

there exist experimental or computational results such as those from

ChIP-chip (Eden et al., 2007), deep sequencing (Sultan et al., 2008),

quantitative proteomics (Sharma et al., 2009) and in silico network

simulations (Stojmirovi´ c andYu, 2007, 2009), that may benefit from

enrichment analysis. It is thus imperative to have an enrichment

method that report accurate P-values, preserves internal consistency

and allows investigations of a broader range of datasets.

To achieve these goals, we have developed a novel enrichment

tool, called SaddleSum, that founds on the well-known Lugananni–

Rice formula (Lugannani and Rice, 1980) and derives its statistics

from approximating asymptotically the distribution function of the

scores used in the parametric GSA class. This allows us to obtain

accurate statistics even in the cases where the distribution function

generating w is very skewed and for terms containing few entities.

The latter aspect is particularly important for obtaining biologically

specific information.

2METHODS

2.1

We distinguish two sets: the set of entities N of size n and the controlled

vocabulary V. Each term from V maps to a set M⊂N of size m<n. From

experimental results, we obtain a set w={wj|j∈N} and ask how likely

Mathematical foundations for SaddleSum

it is to randomly pick m entities whose sum of weights exceeds the sum

ˆS=?

probability space W with the density function p such that the moment

generating function ρ(t)=?

expressed by the Fourier inversion formula

?∞

where K(t)=lnρ(t) denotes the cumulant generating function of p. The tail

probability or P-value for a scoreˆS is given by

?∞

We propose to use an asymptotic approximation to (2), which improves with

increasing m andˆS.

Daniels (1954) derived an asymptotic approximation for the density pdfm

through saddlepoint expansion of the integral (1), while the corresponding

approximation to the tail probability was obtained by Lugannani and

Rice (1980). Let φ(x)=exp(−x2/2)/√2π and ?(x)=?∞

Letˆλ be a solution of the equation

ˆS=mK?(ˆλ).

Then, the leading term of the Lugananni–Rice approximation to the tail

probability takes the form

?1

?

summary of derivation of (4) is provided in the Supplementary Materials.

Daniels(1954)hasshownthatEquation(3)hasauniquesimplerootunder

most conditions and thatˆλ increases withˆS, withˆλ=0 forˆS=m?W? where

?W?=?

numerical computation near the mean. WhenˆS?m?W?, φ(ˆ z)/ˆ y dominates

and the overall error is O(m−1) (Daniels, 1987).

SaddleSum, our implementation of Lugananni–Rice approximation for

computing enrichment P-values, first solves Equation(3) for ˆλ using

Newton’s method and then returns the P-value using (4). The derivatives

of the cumulant generating function are estimated from w: we approximate

the moment generating function by ρ(t)≈1

ρ?(t)/ρ(t) and K??(t)=ρ??(t)/ρ(t)−(K?(t))2. Since the same w is used to

sequentially evaluate P-values of all terms in V, we retain previously

computedˆλ values in a sorted array. This allows us, using binary search,

to reject many terms with P-values greater than a given threshold without

runningNewton’smethodortobrackettherootof(3)forfasterconvergence.

More details on the SaddleSum implementation and evaluations of its

accuracy against some well-characterized distributions are in Section 2 of

Supplementary Materials. When run as a term-enrichment tool, SaddleSum

reports E-value for each significant term by applying Bonferroni correction

to the term’s P-value.

j∈Mwj.

Assume that the weights in w come independently from a continuous

Wp(x)etxdx exists for t in a neighborhood of

0. The density of S, sum of m weights arbitrarily sampled from w, can be

pdfm(S)=

1

2π

−∞

emK(it)−itSdt,

(1)

Prob(S≥ˆS)=

ˆS

pdfm(S)dS.

(2)

xφ(t)dt denote,

respectively, the density and the tail probability of Gaussian distribution.

(3)

Prob(S≥ˆS)=?(ˆ z)+

ˆ y−1

ˆ z

?

?

φ(ˆ z)+O(m−3/2),

(4)

where

ˆ y=ˆλ

mK??(ˆλ) and

ˆ z=sgn(ˆλ)2(ˆλˆS−mK(ˆλ)).Appropriate

Wxp(x)dx isthemeanofW.Whiletheapproximation(4)isuniformly

valid over the whole domain of p, its components need to be rearranged for

n

?

j∈Netwj, and then K?(t)=

2.2

The assignment of human genes to GO terms was taken from the NCBI

gene2go file (ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz) downloaded on

February 7, 2009. After assigning all genes to terms, we removed all

redundant terms—if several terms mapped to the same set of genes, we kept

only one such term. For our statistical experiments, we kept only the terms

with no less than five mapped genes within the set of weights considered and

hence the number of processed terms varied for each realization of sampling

(see below).

GO

2753

Page 3

A.Stojmirovi´ c and Y.-K.Yu

2.3

ITM Probe (Stojmirovi´ c and Yu, 2009) is an implementation of

the framework for exploring information flow in interaction networks

(Stojmirovi´ c and Yu, 2007). Information flow is modeled through discrete

time random walks with damping—at each step the walker has a certain

probability of leaving the network.Although ITM Probe offers three modes:

emitting,absorbingandchannel,weonlyusedthesimplest,emittingmode,to

provideexamplesillustratingissuesofsignificanceassignment.Theemitting

mode takes as input one or more network proteins, called sources, and a

damping factor α. For each protein node in the network, the model outputs

the expected number of visits to that node by random walks originating from

the sources, thus highlighting the network neighborhoods of the sources.

The damping factor determines the average number of steps taken by a

random walk before termination: α=1 corresponds to no termination, while

α=0 leads to no visits apart from the originating node. For our protein–

protein interaction network examples, we used the set of all human physical

interactionsfromtheBioGRID(Breitkreutzetal.,2008),version2.0.54(July

2009).The network consists of 7702 proteins and 56400 unique interactions.

Each interaction was represented by an undirected link.Alink carries weight

2 if its two ends connect to the same protein and 1 otherwise.

Information flow in protein networks

2.4

From the NCBI Gene Expression Omnibus (GEO; Barrett et al., 2009), we

retrieved human microarray datasets with expression log2ratios (weights)

provided, resulting in 34 datasets and 136 samples in total. For each sample,

when multiple weights for the same gene were present, we took their mean

instead.Thisresultedinawwhereeachgeneisassignedauniquerawweight.

For evaluations, we also used another version of w where negative weights

were set to zero. This version facilitated investigation of upregulation while

keeping the downregulated genes as part of statistical background.

Microarrays

2.5

By definition, a P-value associated with a score is the probability of

that score or better arising purely by chance. We tested the accuracy of

reportedP-valuesreportedbyenrichmentmethodsviasimulationson‘decoy’

databases, which contained only terms with random gene assignments. For

eachtermfromthedecoydatasetandeachsetofweightsbasedonnetworkor

microarraydata,werecordedthereportedP-valueandthusbuiltanempirical

distributionofP-values.IfamethodreportsaccurateP-values,theproportion

of runs, which we term empirical P-value, reporting P-values smaller than

or equal to a P-value cutoff, should be very close to that cutoff. We show the

results graphically by plotting on the log–log scale the empirical P-value as

a function of the cutoff.

For each given list of entities N, be it from the target gene set of a

microarraydatasetorthesetofparticipatinghumanproteinsintheinteraction

network, we produced two types of decoy databases. The first type was

based on GO. We shuffled gene labels 1000 times. For each shuffle, we

associated all terms from GO with the shuffled labels to retain the term

dependency. This resulted in a database with ∼5×106terms (1000 shuffles

times about 5000 GO terms). In the second type, each term, having the same

size m, was obtained by sampling without replacement m genes from N.

The databases from this type (one for each term size considered) contained

exactly 107terms. The evaluation query set of 100 w’s from interaction

networks was obtained by randomly sampling 100 proteins out of 7702 and

running ITM Probe with each protein as a single source. The weights for

source proteins were not considered since they were prescribed, not resulting

from simulation. Each run used α=0.7, without excluding any nodes from

the network. For microarrays, the set of 136 samples was used. Since both

query sets are of size ∼102, the total number of w—term matches was ∼109.

2.6Student’s t-test (used by GAGE and T-profiler)

Similar to SaddleSum, t-test approaches are based on sum-of-weights score,

but use the Student’s t-distribution to infer P-values.As before, let wjdenote

Evaluating accuracy of P-values

the weight associated with entity j∈N, let M denote the set of m entities

associated with a term from vocabulary and let M?=N \M. For any set

S⊆N of size S, let xS=1

S and let s2

GAGE (Luo et al., 2009) enrichment tool uses two sample t-test assuming

unequal variances and equal sample sizes to compare the means over N

and M. The test statistic is

xM−xN

?

and the P-value is obtained from the upper tail of the Student’s t-distribution

with degrees of freedom

S

?

j∈Swjdenote the mean weight of entities in

j∈Swj−xS)2be their sample variance.

S=

1

S−1

?

t=

s2

M/m+s2

N/m

(5)

ν=(m−1)(s2

M+s2

s4

N)2

NM+s4

.

T-profiler (Boorsma et al., 2005) compares the means over M and M?using

two sample t-test assuming equal variances but unequal sample sizes. The

pooled variance estimate is given by

s2=(m−1)s2

M+(n−m−1)s2

n−2

M?

,

and the test statistic is

t=xM−xM?

s

m+

?

11

n−m

.

The T-profiler P-value is then obtained from the tail of the Student’s

t-distribution with ν=n−2 degrees of freedom.

2.7

Methodsbasedonhypergeometricdistributionorequivalently,Fisher’sexact

test,useonlyrankingsofweightsandrequireselectionof‘significant’entities

prior to calculation of P-value. We first rank all entities according to their

weights and consider the set C of c entities with largest weights. The number

c can be fixed (say 50), correspond to a fixed percentage of the total number

of weights, depend on the values of weights, or be calculated by other means.

The scoreˆS for the term M is given by the size of the intersection, C∩M,

between C and M. This is equivalent to settingˆS=?

Hypergeometric distribution

j∈Mwjwith wj=1

for j∈C and 0 otherwise. The P-value for scoreˆS is

Prob(S≥ˆS)=

min(c,m)

?

i=ˆS

?m

i

??n−m

c

c−i

?

?

?n

.

Hence, the P-value measures the likelihood of scoreˆS or better over all

possible ways of selecting c entities out of N, with m entities associated

with the term investigated.

In each of our P-value accuracy experiments, we used two variants of the

hypergeometric method, one taking a fixed percentage of nodes and the other

taking into account the values of weights. For microarray datasets, the fist

variant took 1% of available genes (HGEM-PN1), while the second select

genes with four fold change or more (HGEM-F2). In experiments based on

protein networks, we took 3% of available proteins (231 entities) for the first

variant (HGEM-PN3) and used the participation ratio formula to determine

c in the second (HGEM-PR). Participation ratio (Stojmirovi´ c and Yu, 2007)

is given by the formula

??

Wechoseasmallerpercentageofweightsformicroarray-baseddata(1versus

3% for data derived for networks) because the microarray datasets generally

contained measurement for more genes than the number of proteins in the

network.

c=

i∈Nwi

?

?2

j

j∈Nw2

.

2754

Page 4

Robust and accurate data enrichment statistics

Fig. 1. Empirical P-values versus P-value cutoffs reported for investigated enrichment methods. Methods with accurate statistics have their curves follow

the dotted line closely over the whole range. Each curve was constructed by aggregating the results of ∼109GO-based decoy term queries. Displayed on the

left (right) are results using weights derived from protein network information flow simulations (microarrays). In microarray plots for SaddleSum, T-profiler

and GAGE, full lines indicate the results where negative weights were set to 0, while dashed lines show the results using all weights. The reason that HGEM

curves run below the theoretical line and parallel to it is that every curve is an aggregate of many curves, each of which (i) represents a single sample of

weights determining parameters to be fed into hypergeometric distribution, and (ii) is a step function touching the theoretical line and dropping below it.

Merging curves from many samples produces the effect seen in our plots.

2.8

Insteadofmakingasingle,arbitrarychoiceofcandapplyinghypergeometric

score, mHG method implemented in the GOrilla package (Eden et al., 2009)

considers all possible c’s. The mHG score is defined as

mHG score

mHG= min

1≤c≤n

min(c,m)

?

i=k

?m

i

??n−m

c

c−i

?

?

?n

,

where k is the number of entities annotated by the term M among the c

top-ranked entities. The exact P-value for mHG score is then calculated

by using a dynamic programming algorithm developed by Eden et al.

(2007). For our experiments, we used an implementation in C programming

language that was derived from the Java implementation used by GOrilla.

The implementation uses a truncated algorithm that gives an approximate

P-value with improved running speed.

2.9

To evaluate consistency of investigated methods, we compared the sets of

significant terms retrieved from GO using different numbers of non-zero

weights as input. For each w, we sort in descending order the weights

associated with entities. With each c selected, we kept c largest weights

unchanged and set the remaining to 0 to arrive at a modified set of weights

w|C. We did not totally exclude the lower weights but kept them under

consideration to provide statistical background. We submitted w|C for

analysis and obtained from each statistical method a set of enriched terms

ordered by their P-value. In Figure 2A and Supplementary FigureS3, we

displayed the actual five most significant terms retrieved with their P-values

for selected examples of weight sets. To investigate on a larger scale the

retrieval stability to c changes, we computed for each method the overlap

between sets of top 10 terms from two different c’s for the w sets mentioned

in ‘Evaluating accuracy of P-values’ and then took the average (Fig.2B).

Retrieval stability with respect to choice of c

3

We compared our SaddleSum approach against the following

existing methods: Fisher’s exact test (HGEM; Boyle et al., 2004),

two sample Student’s t-test with equal (T-profiler; Boorsma et al.,

2005) and unequal (GAGE; Luo et al., 2009) variances, and mHG

RESULTS

score(Edenetal.,2007,2009).Basedondatafrombothmicroarrays

and simulations of information flow in protein networks, the

comparison shown here encompassed (in order of importance)

evaluation of P-value accuracy, ranking stability and running time.

Accurate P-value reflects the likelihood of a false identification and

thus allows for comparison between terms retrieved even across

experiments. Incorrect P-values therefore render ranking stability

and algorithmic speed pointless.Accurate P-values without ranking

stability question the robustness of biological interpretation. For

pragmaticuseofanenrichmentmethod,evenwithaccuratestatistics

and stability, it is still important to have reasonable speed.

3.1

The term P-value reported by an enrichment analysis method

providesthelikelihoodforthattermtobeenrichedwithinw.Toinfer

biological significance using statistical analysis, it is essential to

have accurate P-values. We analyzed the accuracy of P-values

reported by the investigated approaches through simulating ∼109

queries and comparing their reported and empirical P-values.

Results based on querying databases with fixed term sizes are

shown in Supplementary FiguresS1 and S2. Shown in Figure1 are

the results for querying GO-based gene-shuffled term databases,

which retain the structure of the original GO as a mixture of

terms of different sizes organized as a directed acyclic graph

where small terms are included in larger ones. The curves for all

methods in Figure1, therefore, resemble a mixture of curves from

Supplementary FiguresS1 and S2 albeit weighted toward smaller

sized terms.

For weights from both network simulations and microarrays,

SaddleSum as well as the methods based on Fisher’s exact test

(HGEM and mHG) report P-values that are acceptable (within

one order of magnitude from the theoretical values). For HGEM

and mHG, this is not surprising because our experiments involved

shuffling entity labels and hence followed the null model of the

hypergeometric distribution. On the other hand, the null model

Accuracy of reported P-values

2755

Page 5

A.Stojmirovi´ c and Y.-K.Yu

A

B

Fig. 2. P-value consistency and retrieval stability. (A) The output of ITM Probe emitting mode with human MLL protein (histone methyltransferase subunit)

as the source (top) and the log2ratios from the human T-cell signaling microarray GSM89756 (bottom) were processed by each of the five investigated

statistical methods with varying number of weighted entities included for analysis (All and Pos include all entities; All uses raw weights while Pos sets all

negative weights to 0). The P-values for GO terms from the union of the sets of top five hits for each method and different numbers of selected entities, are

indicated by colors of the corresponding cell. Red dots show the actual top five hits for the method represented by that column. (B) Degree of overlap between

sets of significant GO terms. Each panel corresponds to a single method with different numbers of entities used for analysis, with the results from microarray

queries shown in the upper triangle and those based on network flow shown in the lower triangle. Color in each cell indicates the average pairwise overlap

between the two sets of top ten entities retrieved. For example, consider the light orange colored cell (horizontally labeled by 100 and vertically labeled by

500) in the mHG panel. This indicates that on average the top ten terms retrieved by mHG using top 100 and top 500 network flow proteins share about three

common terms.

2756