Molecular signature database (MSigDB) 3.0

Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
Bioinformatics (Impact Factor: 4.98). 06/2011; 27(12):1739-40. DOI: 10.1093/bioinformatics/btr260
Source: PubMed


Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets.
We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site.
MSigDB is freely available for non-commercial use at

Download full-text


Available from: Pablo Tamayo,
  • Source
    • "Therefore, not only does gene set testing with large collections fail to deliver an improvement in statistical power, but the decline in annotation quality and higher gene set interdependency can also compromise the biological relevance and interpretability of any associations that are discovered. The typical approach for addressing the problem of gene set collection size is either to use pre-existing collection subsets, e.g., standard GO Slims [11] or the MSigDB C5 collection that filters out GO terms with IEA evidence codes [5], or to create custom collection subsets that match a specific use case, e.g., custom GO Slim generation [9]. Although the use of data-independent subsets addresses the issue of collection size and the subsets may closely align with the domain of investigation, the process of selecting a subset is inherently subjective and thus susceptible to researcher bias. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set ltering (SGSF), a novel technique for independent ltering of gene set collections prior to gene set testing. The SGSF method uses as a lter statistic the p-value measuring the statistical signicance of the association between each gene set and the sample principal components (PCs), taking into account the signicance of the associated eigenvalues. Because this lter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately lters gene sets unrelated to the experimental outcome resulting in signicantly increased gene set testing powe
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 10/2015; 12(5):1-1. DOI:10.1109/TCBB.2015.2415815 · 1.44 Impact Factor
  • Source
    • "The application of integrated approaches such as Galahad[46], Expression2Kinases[47], and CellNOptR[48] which uses both genomic (or proteomic) profiles and protein-protein interaction (PPI) data, is also gaining attention. In its essence, all these prioritization processes involve comparing a molecular profiles (e.g., protein-target interaction or gene expression response) associated with a chemical with a database of disease or pathway signatures such as MSigDB[49], GeneSigDB[50], and EnrichR[51]. The comparisons can be performed using a variety of association measures [39] [52], but have limitations such as ignoring the topology of the regulatory networks and the relative rank of the strength of the association. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Ligand- and structure-based drug design approaches complement phenotypic and target screens, respectively, and are the two major frameworks for guiding early-stage drug discovery efforts. Since the beginning of this century, the advent of the genomic era has presented researchers with a myriad of high throughput biological data (parts lists and their interaction networks) to address efficacy and toxicity, augmenting the traditional ligand- and structure-based approaches. This data rich era has also presented us with challenges related to integrating and analyzing these multi-platform and multi-dimensional datasets and translating them into viable hypotheses. Hence in the present paper, we review these existing approaches to drug discovery research and argue the case for a new systems biology based approach. We present the basic principles and the foundational arguments/underlying assumptions of the systems biology based approaches to drug design. Systems biology data types (key entities, their attributes and their relationships with each other, and data models/representations), software and tools used for both retrospective- and prospective-analysis, and the hypotheses that can be inferred are also discussed. In addition, we summarize some of the existing resources for a systems biology based drug discovery paradigm (open TG-GATEs, DrugMatrix, CMap and LINCs) in terms of their strengths and limitations.
    Current topics in medicinal chemistry 08/2015; 15(999). DOI:10.2174/1568026615666150826114524 · 3.40 Impact Factor
  • Source
    • "GO annotation file is downloaded from [22] on Nov. 23rd, 2014. Pathway annotation from MSigDB is downloaded from GSEA [23]. Biological process GO terms and MSigDB pathways are tested for enrichment using Fisher's test() in R. The significance of threshold was set at 0.01. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Lung cancer consists of two main subtypes: small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC) that are classified according to their physiological phenotypes. In this study, we have developed a network-based approach to identify molecular biomarkers that can distinguish SCLC from NSCLC. By identifying positive and negative coexpression gene pairs in normal lung tissues, SCLC, or NSCLC samples and using functional association information from the STRING network, we first construct a lung cancer-specific gene association network. From the network, we obtain gene modules in which genes are highly functionally associated with each other and are either positively or negatively coexpressed in the three conditions. Then, we identify gene modules that not only are differentially expressed between cancer and normal samples, but also show distinctive expression patterns between SCLC and NSCLC. Finally, we select genes inside those modules with discriminating coexpression patterns between the two lung cancer subtypes and predict them as candidate biomarkers that are of diagnostic use.
    08/2015; 2015(7):685303. DOI:10.1155/2015/685303
Show more