Vol. 23 no. 2 2007, pages 257–258
BIOINFORMATICS APPLICATIONS NOTE
Using GOstats to test gene lists for GO term association
S. Falcon?and R. Gentleman
Fred Hutchison Cancer Research Center, Program Computational Biology,1100 Fairview Avenue North,
P. O. Box 19024, Seattle, WA 98109, USA
Received on October 16, 2006; accepted on November 3, 2006
Advance Access publication November 10, 2006
Associate Editor: Trey Ideker
Motivation: Functional analyses based on the association of Gene
Ontology (GO) terms to genes in a selected gene list are useful
bioinformatic tools and the GOstats package has been widely used
to perform such computations. In this paper we report significant
improvements and extensions such as support for conditional testing.
Results: We discuss the capabilities of GOstats, a Bioconductor
package written in R, that allows users to test GO terms for over or
under-representation using either a classical hypergeometric test or a
conditional hypergeometric that uses the relationships among GO
terms to decorrelate the results.
Availability: GOstats is available as an R package from the
Bioconductor project: http://bioconductor.org
Version 2.0 of the Bioconductor package GOstats has substantial
improvements for testing the association between Gene Ontology
(GO) terms, see GO Consortium (2000); and a given gene list. We
have implemented a conditional hypergeometric test that uses
the relationships among the GO terms, similar to that presented
in Alexa et al. (2006), to address concerns that arise due to the
hierarchical structure of GO. Many other substantial improvements
have also been made that make the software easier to use and the
results more informative.
In this paper we briefly describe the preprocessing steps required
to construct inputs for the testing function, followed by a presenta-
tion of the algorithms used, and the structure of the return value. We
demonstrate capabilities of the GOstats package using a micro-
array dataset Chiaretti et al. (2004) from a clinical trial in acute
lymphoblastic leukemia (ALL). More details on the analysis of
this dataset are available in the GOstats package vignette.
To perform an analysis using the hypergeometric-based tests, one
needs to define a ‘gene universe’ (usually conceptualized as the
number of balls in an urn) and a list of selected genes from that
universe. While it is clear that the selected gene list determines the
results of the analysis, the fact that the universe has a large effect
on the conclusionsis,perhaps, lessobviousand correct specification
of the universe is important.
For microarray data, one can use the unique gene identifiers
assayed in the experiment as the gene universe. However, some
arrays, such as those from Affymetrix, attempt to include probes for
corresponding to a single gene. The multiple probe issue should be
resolved so that each gene is represented only once. One might also
want to consider reduction of the universe to exclude genes that are
not expressed, if such a determination can be made, since arguments
can be made against maintaining objects in the universe that cannot
The next step is to identify the subset of the universe that is
considered interesting. In many applications, this set is constructed
by finding differentially expressed genes. One might use a t-test,
or an receiver operating characteristic (ROC) curve, or any of a
large number of methods to identify such genes. Other methods for
finding sets of interesting genes can also be used.
2.1 Non-specific filtering
To obtain the universe we often use the following procedure.
First we estimate the variability across samples using the the
sufficient variability across samples to be informative; probes with
little variability across samples are inherently uninteresting as they
provide no discriminatory power. We remove probes that are
missing either Entrez Gene identifiers or do not map to any GO
terms. Finally, we refine the universe to ensure that each Entrez
Gene identifier maps to exactly one probe by selecting the probe
with the largest IQR when two or more probes map to the same
Entrez Gene ID.
There are many valid approaches to non-specific filtering that
might be quite different from the procedure described above.
However, it is important to avoid double counting genes. In our
approach, a gene is represented by an Entrez Gene ID and so we
must ensure that each Entrez Gene ID is represented by at most
Often one wishes to perform many similar analyses using slightly
different sets of parameters. The main interface to the Hypergeo-
metric tests, hyperGTest, facilitates this pattern of use by taking
a single parameter object as its argument. This parameter is an
instance of class GOHyperGParams. Using a parameter class
?To whom correspondence should be addressed.
? 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on June 4, 2013