Content uploaded by Mirko Signorelli

Author content

All content in this area was uploaded by Mirko Signorelli on Oct 24, 2016

Content may be subject to copyright.

NEAT: an eﬃcient Network Enrichment

Analysis Test

Mirko Signorelli1,2, Veronica Vinciotti3, Ernst C. Wit1

1Johann Bernoulli Institute, University of Groningen, Netherlands

2Department of Statistical Sciences, University of Padova, Italy

3Department of Mathematics, Brunel University London, United Kingdom

E-mail for correspondence: m.signorelli@rug.nl

REFERENCE ARTICLE:

After inclusion in the IWSM Conference Proceedings, an extended version

of this article has been published by BMC Bioinformatics as

Signorelli, M., Vinciotti, V., Wit, E. C. (2016). NEAT: an eﬃcient network

enrichment analysis test. BMC Bioinformatics, 17:352. DOI: 10.1186/s12859-

016-1203-6.

Please refer to the article in BMC Bioinformatics (available here) for cita-

tion purposes as well as for a wider overview on NEAT.

Abstract: Network enrichment analysis (NEA) integrates gene enrichment anal-

ysis with information on dependences between genes. Existing tests for NEA

rely on normality assumptions, they can deal only with undirected networks and

are computationally slow. We propose NEAT, an alternative test based on the

hypergeometric distribution. NEAT can be applied also to directed and mixed

networks, and it is faster and more powerful than existing NEA tests.

Keywords: networks; enrichment analysis; gene expression.

1 Introduction

When the ﬁrst data on gene expression became available, they were anal-

ysed considering each gene separately. However, researchers soon realized

that genes act in a concerted manner, and that cellular processes are often

the result of complex interactions between diﬀerent genes and molecules.

Nowadays, sets of genes that are responsible for many cellular functions

This paper was published as a part of the proceedings of the 31st Inter-

national Workshop on Statistical Modelling, INSA Rennes, 4–8 July 2016. The

copyright remains with the author(s). Permission to reproduce or extract any

parts of this abstract should be requested from the author(s).

2 NEAT: an eﬃcient Network Enrichment Analysis Test

have been identiﬁed, and are collected in publicly available databases (such

as GO and KEGG). These sets of genes, whose function is already known,

can be used to characterize and interpret (“enrich”) the results of new

experiments. This characterization is typically done by means of gene en-

richment analysis (GEA) tests, which allow to compare gene expression

levels between two conditions (experimental and control) and to detect

functional sets of genes that are activated or repressed in the experimen-

tal condition. The power of GEA tests is often low, mostly because they

consider the level of overlap between sets of genes only, and they ignore

associations and dependences that exist between genes.

Recently, Alexeyenko et al. (2012) and McCormack et al. (2013) have pro-

posed to integrate GEA with information on dependences between genes

by making use of gene networks. The idea is that “enrichment” between

two sets of genes Aand Bcan be assessed by comparing the number of

links connecting nodes in Aand B,nAB , with a reference distribution that

assumes that no relation exists between the two sets. Their tests rely on

a normal approximation for the reference distribution (which is discrete),

they require the computation of many network permutations (an activity

that can be highly time consuming) and are restricted to the analysis of

undirected networks.

In the sequel we propose NEAT, an alternative Network Enrichment Anal-

ysis Test based on the hypergeometric distribution. The assumption that

in absence of enrichment NAB is distributed as an hypergeometric arises

quite naturally, and enables us to avoid normal approximations and net-

work permutations. We develop NEAT not only for undirected, but also

for directed and mixed networks, thus providing a common framework for

the analysis of diﬀerent types of networks.

2 Methods

A graph is a pair G= (V, E ), which consists of a set of nodes Vconnected by

a set of directed or undirected edges E⊆V×V. In gene regulatory networks

each gene is represented as a node of the graph, and an edge between two

nodes is drawn to signify dependence between the corresponding genes.

In the inferred network, we expect that individual links may be slightly

unstable and noisy. However, we do expect that inferred links contain a

sign of the relationships between functional gene sets. So, if there is a

functional relationship (i.e., enrichment) between functions described by

sets A⊂Vand B⊂V, then we expect the number of links between the

two groups to be larger (or smaller) than expected by chance.

2.1 Directed and mixed networks

In directed networks, we assess the presence of enrichment from Ato B

by considering the number of arrows nAB going from genes in Ato genes

Signorelli et al. 3

belonging to B. The observed nAB can be thought as a realization from

the random variable NAB, with expected value µAB . We compare µAB with

the number of arrows µ0that we would expect to observe from Ato Bby

chance, and test H0:µAB =µ0versus H1:µAB 6=µ0. We say that there

is enrichment from Ato Bif µAB is signiﬁcantly diﬀerent from µ0.

We use the hypergeometric distribution to model the null distribution of

NAB . The hypergeometric models the number of successes in a random

sample without replacement: in our case, let’s mark arrows that reach genes

in Bas “successful”, and the remaining ones as “unsuccessful”. If there is

no relation between Aand B, we can view the arrows that go out from

genes in Aas a random sample without replacement from the population

of arrows present in the graph, and nAB as the number of successes in that

sample. Thus, the distribution of NAB when H0is true is

NAB ∼hypergeom(n=oA, K =iB, N =iV),(1)

where the sample size oAis the outdegree of A, the number of successes

in the population iBis the indegree of Band the population size iVis the

total indegree of the network. So, we expect µ0=oA

iB

iVto increase as the

indegree of A, or the outdegree of B, increases. A toy example that explains

the rationale behind NEAT is presented in Figure 1.

Bearing in mind the fact that for a discrete test statistic Tthe usual formula

for p-values p1= 2 min P0[(T≤t), P (T≥t)] can exceed 1, we compute

the p-value using

p= 2 min [P0(NAB > nAB ), P0(NAB < nAB )] + P0(NAB =nAB ),(2)

which diﬀers from p1by a factor equal to P0(T=t). A p-value close to 0

can be regarded as evidence of enrichment, because it entails that nAB is

signiﬁcantly higher/smaller than we would expect it to be under H0. For a

given type I error α, one can then conclude that there is enrichment from

Ato Bif p<α.

A mixed network is a network where both directed and undirected edges

are present. It is possible to regard a mixed network as a directed network,

where every undirected edge v∼wstands for two directed arrows, v→

wand w→v. NEAT adopts such convention for the analysis of mixed

networks.

2.2 Undirected networks.

When dealing with undirected networks, the presence of enrichment be-

tween Aand Bdepends on the number of links nAB that connect genes in

Ato genes in B. Here, there is no distinction between indegree and outde-

gree of a node, and it only makes sense to consider the degree of a node:

thus, assumption (1) needs to be properly modiﬁed. Deﬁne the total degree

of a set as the sum of the degrees of nodes that belong to it: then, the null

distribution is NAB ∼hypergeom(n=dA, K =dB, N =dV),where dA,

dBand dVare the total degrees of sets A, B and V.

4 NEAT: an eﬃcient Network Enrichment Analysis Test

A

1

2

3

4

5

6

7

8

B

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

FIGURE 1. A directed network with 8 nodes (A) and its bipartite representa-

tion (B). Suppose that one wants to know whether there is enrichment from set

A={1,4}to set B={3,5,7}. There are 5 arrows going out from A, and 2

of them reach B. The whole network consists of 15 arrows, of which 4 reach B.

Thus, nAB = 2, oA= 5, iB= 4 and iV= 15. The idea behind NEAT is that, if

the 5 arrows that are going out from Aare a random sample (without replace-

ment) from the population of 15 arrows that are present in the network, then

the proportion of arrows reaching Bfrom Ashould be close to the proportion of

arrows reaching Bin the whole network. In this case, it seems that arrows going

out from Atend to reach Bmore frequently (40%) than other arrows do (27%

of the 15 arrows in the network reach B). However, the computation of the test

leads to p= 0.48: the observed nAB = 2 does not provide enough evidence to

reject the null hypothesis that there is no enrichment from Ato B.

2.3 Software.

NEAT is implemented in the Rpackage neat, which is available on CRAN

(Signorelli et al., 2016). neat allows the user to specify the network in

diﬀerent formats, and it includes a set of data and examples.

3 Simulations

We compare the performance of NEAT with the NEA test of Alexeyenko et

al. (2012) and with the LP, LA, LA+S and NP tests of McCormack et al.

(2013) by means of two simulations. We simulate two undirected random

networks with 1000 nodes, whose degree distributions are a power law in

simulation S1, and a mixture of Poisson distributions in simulation S2. We

test enrichment between 50 sets of nodes, with cardinality ranging from 50

to 100 nodes. We modify the original networks to introduce enrichments

between 100 pairs of these sets, by either increasing or reducing nAB by

Signorelli et al. 5

a proportion uniformly ranging from 10 to 50%. The results (see Table 1)

show that the distribution of p-values is uniform in both cases for NEAT

and LA, and in one case for LA+S (S1) and NP (S2). NEA and LP, instead,

do not produce uniform distributions in any case. In both S1 and S2, NEAT

turns out to have the highest discriminatory capacity (AUC) and to be by

far the fastest method, from 22 to 3000 times faster than alternative tests.

TABLE 1. Results of simulation S1 and S2. The best results in each column are

bolded. Abbreviations: pKS denotes the p-value of the Kolmogorov-Smirnov test

for H0:X∼U(0,1); AUC stands for “area under the ROC curve”. Time is

expressed in seconds.

Simulation S1 Simulation S2

Test pK S AUC Time pKS AUC Time

NEAT 0.399 0.920 0.6 0.343 0.925 0.7

NEA 0.001 0.918 2125.4 0.024 0.912 2151.5

LP 0 0.908 28.6 0 0.904 44.7

LA 0.255 0.897 14.4 0.111 0.908 18.0

LA+S 0.409 0.913 21.8 0.024 0.910 27.6

NP 0.037 0.884 12.9 0.323 0.908 15.8

4 Data analysis

After analysing gene expression patterns of yeast Saccaromyces cerevisiae

in response to diﬀerent stressful stimuli, Gasch et al. (2000) inferred the

existence of two set of genes, collectively called Environmental Stress Re-

sponse (ESR), that constitute a coordinated, initial reaction to the emer-

gence of any hostile condition in the cell. The original study made use of a

GEA test to characterize the two sets. Here, we incorporate into the analysis

known associations between genes, as represented in the YeastNet network

(Kim et al., 2013). For lack of space, we do not show here the lists of en-

richments detected by NEAT for the two ESR sets; however, such lists can

be retrieved running the example in the help page ?yeast of the Rpackage

neat (Signorelli et al., 2016). In short, NEAT detects most of the enrich-

ments that were found in the original study for the two ESR sets; besides, it

unveils some further enrichments related to molecular transportation and

amino-acid biosynthesis for the set of induced ESR genes, which would be

overlooked if functional couplings between genes were ignored.

6 NEAT: an eﬃcient Network Enrichment Analysis Test

5 Conclusion

Traditional gene enrichment analysis assesses enrichment between gene sets

solely on the basis of the extent of their overlap. Network enrichment anal-

ysis is a powerful extension of traditional GEA tests, which makes use

of genetic networks to integrate enrichment analyses with information on

associations and dependences that exist between genes.

We have developed NEAT, a test for network enrichment analysis that

aims to overcome some limitations of the resampling-based tests of Alex-

eyenko et al. (2012) and McCormack et al. (2013). First of all, we believe

that a normal approximation does not make justice to the discrete nature of

NAB . We have showed that this approximation can be avoided, if one mod-

els NAB with the hypergeometric distribution. In addition, existing NEA

tests require the computation of many network permutations: this opera-

tion can be highly time consuming, slowing down computations consider-

ably. NEAT, instead, fully speciﬁes the null distribution of NAB without

resorting to permutations, thus speeding up the computation of the test.

A further drawback of existing resampling-based tests is that they have

been implemented only for undirected networks: we address this problem

proposing two diﬀerent parametrizations for NEAT, that take into account

the diﬀerent nature of directed and undirected edges.

The test is implemented in the Rpackage neat, which is freely available on

CRAN (Signorelli et al., 2016). Our simulations show that NEAT behaves

well under the null hypothesis, is more powerful and faster than existing

NEA tests. Application to the Environmental Stress Response data shows

that NEAT can detect most of the enrichments that were found with GEA

methods, and unveils further enrichments that would be overlooked, if de-

pencences between genes were ignored. We believe that NEAT could con-

stitute a ﬂexible and computationally eﬃcient test for network enrichment

analysis. Potential applications of NEAT extend beyond gene regulatory

networks, and include social networks, brain networks and other situations

where one attempts to understand the relation between groups of vertices

in a network.

References

Alexeyenko, A., Lee, W., Pernemalm, M., Guegan, J., Dessen, P. et al. (2012).

Network enrichment analysis: extension of gene-set enrichment anal-

ysis to gene networks. BMC Bioinformatics,13:226.

Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B.

et al. (2000). Genomic expression programs in the response of yeast

cells to environmental changes. Molecular Biology of the Cell,11(12),

4241 – 4257.

Signorelli et al. 7

Kim, H., Shin, J., Kim, E., Kim, H., Hwang, S. et al. (2013). YeastNet v3:

a public database of data-speciﬁc and integrated functional gene net-

works for Saccharomyces cerevisiae. Nucleic Acids Research, 1 – 13.

McCormack, T., Frings, O., Alexeyenko, A., Sonnhammer, E. L. (2013). Sta-

tistical assessment of crosstalk enrichment between gene groups in

biological networks. PLoS One,8(1):e54945.

Signorelli, M., Vinciotti, V., Wit E. C. (2016). NEAT: eﬃcient Network En-

richment Analysis Test. https://cran.r-project.org/package=neat.