Molecular signatures database (MSigDB) 3.0.
ABSTRACT Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets.
We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site.
MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.
- SourceAvailable from: PubMed Central[Show abstract] [Hide abstract]
ABSTRACT: The database of Gene Co-Regulation (dGCR) is a web tool for the analysis of gene relationships based on correlated patterns of gene expression over publicly available transcriptional data. The motivation behind dGCR is that genes whose expression patterns correlate across many experiments tend to be co-regulated and hence share biological function. In addition to revealing functional connections between individual gene pairs, extended sets of co-regulated genes can also be assessed for enrichment of gene ontology classes and interaction pathways. This functionality provides an insight into the biological function of the query gene itself. The dGCR web tool extends the range of expression data curated by existing co-regulation databases and provides additional insights into gene function through the analysis of pathways, gene ontology classes and co-regulation modules.Journal of genomics. 01/2015; 3:29-35.
- [Show abstract] [Hide abstract]
ABSTRACT: Transcriptome-based biosensors are expected to have a large impact on the future of biotechnology. However, a central aspect of transcriptomics is differential expression analysis, where, currently, deep RNA sequencing (RNA-seq) has the potential to replace the microarray as the standard assay for RNA quantification. Our contributions here to RNA-seq differential expression analysis are two-fold. First, given the high cost of an RNA-seq run, biological replicates are rare, and therefore, information sharing across genes to obtain variance estimates is crucial. To handle such information sharing in a rigorous manner, we propose an hierarchical, empirical Bayes approach (R-EBSeq) that combines the Cufflinks model for generating relative transcript abundance measurements, known as FPKM (fragments per kilobase of transcript length per million mapped reads) with the EBArrays framework, which was previously developed for empirical Bayes analysis of microarray data. A desirable feature of R-EBSeq is easy-to-implement analysis of more than pairwise comparisons, as we illustrate with experimental data. Secondly, we develop the standard RNA-seq test data set, on the level of reads, where 79 transcripts are artificially differentially expressed and, therefore, explicitly known. This test data set allows us to compare the performance, in terms of the true discovery rate, of R-EBSeq to three other widely used RNAseq data analysis packages: Cuffdiff, DEseq and BaySeq. Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq. Cuffdiff and R-EBSeq are the two top performers. Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.Biosensors. 09/2013; 3(3):238-58.
- [Show abstract] [Hide abstract]
ABSTRACT: Fibrolamellar hepatocellular carcinoma (FLC) is a rare primary hepatic cancer that develops in children and young adults without cirrhosis. Little is known about its pathogenesis, and it can only be treated with surgery. We performed an integrative genomic analysis of a large series of patients with FLC to identify associated genetic factors. Using 78 clinically annotated FLC samples, we performed whole-transcriptome (n=58), single-nucleotide polymorphism array (n=41), and next-generation sequencing (n=48) analyses; we also assessed the prevalence of the DNAJB1-PRKACA fusion transcript associated with this cancer (n=73). We performed class discovery using non-negative matrix factorization, and functional annotation using gene set enrichment analyses, nearest template prediction, ingenuity pathway analyses, and immunohistochemistry. The genomic identification of significant targets in cancer algorithm was used to identify chromosomal aberrations, MuTect and VarScan2 were used to identify somatic mutations, and the random survival forest was used to determine patient prognoses. Findings were validated in an independent cohort. Unsupervised gene expression clustering revealed 3 robust molecular classes of tumors: the proliferation class (51% of samples) had altered expression of genes that regulate proliferation and mTOR signaling activation; the inflammation class (26% of samples) had altered expression of genes that regulate inflammation and cytokine production; and the unannotated class (23% of samples) had a gene expression signature not previously associated with liver tumors. Expression of genes that regulate neuroendocrine function, as well has histologic markers of cholangiocytes and hepatocytes, were detected in all 3 classes. FLCs had few copy number variations; the most frequent were focal amplification at 8q24.3 (in 12.5% of samples) and deletions at 19p13 (in 28% of samples) and 22q13.32 (in 25% of samples). The DNAJB1-PRKACA fusion transcript was detected in 79% of samples. FLC samples also contained mutations in cancer-related genes such as BRCA2 (in 4.2% of samples), which are uncommon in liver neoplasms. However, FLCs did not contain mutations most commonly detected in liver cancers. We identified an 8-gene signature that predicted survival of patients with FLC. In a genomic analysis of 78 FLC samples, we identified 3 classes based on gene expression profiles. FLCs contain mutations and chromosomal aberrations not previously associated with liver cancer, and almost 80% contain the DNAJB1-PRKACA fusion transcript. Using this information, we identified a gene signature that is associated with patient survival time. Copyright © 2014 AGA Institute. Published by Elsevier Inc. All rights reserved.Gastroenterology 12/2014; · 12.82 Impact Factor
BIOINFORMATICS APPLICATIONS NOTE
Vol. 27 no. 12 2011, pages 1739–1740
Databases and ontologies
Molecular signatures database (MSigDB) 3.0
Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir,
Pablo Tamayo and Jill P. Mesirov∗
Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
Associate Editor: Alex Bateman
Advance Access publication May 5, 2011
Motivation: Well-annotated gene sets representing the universe of
the biological processes are critical for meaningful and insightful
interpretation of large-scale genomic data. The Molecular Signatures
Database (MSigDB) is one of the most widely used repositories of
Results: We report the availability of a new version of the database,
MSigDB 3.0, with over 6700 gene sets, a complete revision of the
collection of canonical pathways and experimental signatures from
publications, enhanced annotations and upgrades to the web site.
Availability and Implementation: MSigDB is freely available for
non-commercial use at http://www.broadinstitute.org/msigdb.
Received on January 13, 2011; revised on April 4, 2011; accepted
on April 12, 2011
Microarrays and other high-throughput genomic technologies
typically produce long lists of potentially interesting genes, which
are not always easily interpreted. Recognizing the importance of
coordinately expressed sets of genes, our seminal paper (Mootha
et al., 2003) introduced Gene Set Enrichment Analysis (GSEA)
to discover metabolic pathways altered in human type 2 diabetes
mellitus. GSEA and other analytical enrichment tools summarize
genomic data in prioritized lists of higher-level biological features.
As underscored by a recent survey of 68 enrichment tools, they
critically depend on ‘backend annotation databases’ (Huang et al.,
2009). Typically, such databases focus on a particular domain of
knowledge or annotation procedure. For example, Gene Ontology
(GO) (Ashburner et al., 2000) represents a hierarchy of controlled
terms to describe individual gene products, while TRANSFAC
(Matys et al., 2006) stores information about transcription factor
binding sites. A growing number of databases obtain sets from
gene expression signatures reported in the literature. These include
SignatureDB (Shaffer et al., 2006), GeneSigDB (Culhane et al.,
2009), CCancer (Dietmann et al., 2010) and L2Land LOLA(Cahan
et al., 2007).
Molecular Signatures Database (MSigDB) differs from these
resources in several distinguishing aspects. (i) MSigDB is explicitly
designed to provide gene sets for enrichment analysis methods.
As such, it is natively and seamlessly integrated with our GSEA
software (Subramanian et al., 2005). (ii) MSigDB covers a
substantially more diverse and wider range of gene set sources
and types. These include signatures extracted from original research
∗To whom correspondence should be addressed.
publications, and entire collections of sets derived from specialized
resources such as GO, KEGG (Kanehisa and Goto, 2000),
TRANSFAC and L2L. (iii) MSigDB gene sets are acquired both
through manual curation and by automatic computational means,
whereas other databases emphasize only one of these approaches.
(iv) Finally, MSigDB contains the largest number of gene sets
The initial MSigDB database, released in 2005 with GSEA
software, contained 1325 sets. In contrast, MSigDB 3.0, released in
September 2010, includes 6769 sets and a richer set of annotations.
Here, we describe the MSigDB 3.0 sets in more detail and the
accompanying online resource.
Gene set collections: gene sets in MSigDB 3.0 are organized into
five collections according to their derivation:
C1: Genes located in the same chromosome or cytogenetic band.
C2: Gene sets representing canonical pathways from pathway
resources [including 430 new sets contributed by Reactome
(Matthews et al., 2009)], and sets corresponding to chemical
and genetic perturbations from 786 scientific publications.
C3: Sets of genes sharing cis-regulatory motifs in their promoter
(transcription factor targets) or 3?UTR (micro-RNA targets)
C4: Clusters of coexpressed modules defined by computational
analysis of large gene expression compendia.
C5: Gene sets corresponding to GO terms.
since the initial release (see also online Release Notes).
Gene set annotations: each MSigDB gene set is a list of genes
with relevant annotations and links to external resources. MSigDB
focuses on human gene sets. However, we do include sets from
some model organisms and gene set annotations include organism
identification. We use HUGO gene symbols and, as of version 3.0,
human Entrez Gene IDs serve as universal identifiers. These Entrez
IDs are guaranteed to be unique and stable, can easily be mapped
into a variety of other identifiers and are natively integrated with
the GenBank resources of primary nucleic and protein sequences.
We also preserve whatever original identifiers were used in the gene
set source. All sets have unique database identifiers and names, and
include brief and full descriptions. Other annotations depend on
the type of gene set. Annotations linking to external resources are
especially important as they allow researchers to place the sets in
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com
A.Liberzon et al.
Table 1. MSigDB versions and changes in the number of gene sets
Gene set category 1.0 (2005)2.5 (2008) 3.0 (2010)
C2: curated (total)
C2: chemical and
C2: canonical pathways
C3: motifs (total)
C3: transcription factor
C3: micro-RNA targets
C5: GO terms
aDecrease in number due to the removal of sets with too few genes to run GSEA.
the context of a specific study and facilitate decisions on follow-up
Gene sets from publications are the most richly annotated. Their
annotations include the PubMed ID of the publication, pointers to
other gene sets from the same publication, and now also details on
the exact table or figure from which the gene set was extracted.
For version 3.0, we updated the names of these gene sets to make
them more descriptive and standardized and the accompanying brief
descriptions to follow a more uniform and consistent format. Other
annotation features introduced with version 3.0 include links to
source datasets in Gene Expression Omnibus (GEO) (Barrett et al.,
sets include links to the pathway at the source web site.
File formats: MSigDB gene set files are available for download
in plain text and XML formats. The plain text files contain simple
listings of gene set membership, while the XML files also include
the annotations. To ensure reproducibility of GSEA results, older
versions of the MSigDB files are always available. Note that users
of our GSEA software do not need to download the MSigDB files
as the tool directly and automatically retrieves the gene sets.
In version 3.0, we updated the MSigDB web site. First introduced
in July 2007, the site allows users to view the annotated gene
sets and perform simple search and analysis tasks. Each gene set
and all of its annotations are presented on a separate web page
external web resources, including PubMed, GEO andArrayExpress,
PubChem and Entrez Gene.
The MSigDB web site allows users to find gene sets by
searching for keywords in the annotations. The online analysis
tools allow users to: (i) compute overlaps between gene sets; (ii)
view a heat map of a gene set in one of the reference expression
compendia; and (iii) categorize the genes in a set by gene families.
Gene families offer a quick view of a gene set by grouping its
members into a small number of informative categories. We have
updated the gene families and they now include: oncogenes, tumor
suppressors, translocated cancer genes, transcription factors, protein
MSigDB ONLINE RESOURCE
Fig. 1. A typical gene set page on the MSigDB web site. The list of genes
has been abbreviated from 41 to 2 for the purposes of this figure.
kinases, homeodomain proteins, cell differentiation markers and
We thank J. Roberston, L. Saunders and L. Kazmierski for gene
set collection; H. Kuehn and J. McLaughlin for documentation; and
M. Wrobel for web site development.
Funding: National Cancer Institute (5R01CA121941).
Conflict of Interest: none declared.
Ashburner,M. et al. (2000) Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat. Genet., 25, 25–29.
Barrett,T. et al. (2009) NCBI GEO: archive for high-throughput functional genomic
data. Nucleic Acids Res., 30, D5–D15.
Cahan,P. et al. (2007) Meta-analysis of microarray results: challenges, opportunities,
and recommendations for standardization. Gene, 401, 12–18.
Nucleic Acids Res., 38, D716–D725.
Dietmann,S. et al. (2010) CCancer: a bird’s eye view on gene lists reported in cancer-
related studies. Nucleic Acids Res., 38 (Suppl), W118–W123.
Huang,da W. et al. (2009) Nucleic Acids Res., 37, 1-13.
Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto encyclopedia of genes and genomes.
Nucleic Acids Res., 28, 27–30.
Matthews,L. et al. (2009) Reactome knowledgebase of human biological pathways and
processes. Nucleic Acids Res., 37, D619–D622.
regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110.
Mootha,V. et al. (2003) PGC-1alpha-responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes. Nat. Genet.,
Parkinson,H. et al. (2009) ArrayExpress update – from an archive of functional
genomics experiments to the atlas of gene expression. Nucleic Acids Res., 37,
Shaffer,A. et al. (2006) A library of gene expression signatures to illuminate normal
and pathological lymphoid biology. Immunol Rev., 210, 67–85.
Subramanian,A. et al. (2005) Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci.
USA, 102, 15545–15550.