Genes2WordCloud: A quick way to identify biological themes from gene lists and free text

Department of Pharmacology and Systems Therapeutics, Systems Biology Center New York (SBCNY), Mount Sinai School of Medicine, 1425 Madison Avenue, New York, NY, 10029, USA. .
Source Code for Biology and Medicine 10/2011; 6(1):15. DOI: 10.1186/1751-0473-6-15
Source: PubMed


Word-clouds recently emerged on the web as a solution for quickly summarizing text by maximizing the display of most relevant terms about a specific topic in the minimum amount of space. As biologists are faced with the daunting amount of new research data commonly presented in textual formats, word-clouds can be used to summarize and represent biological and/or biomedical content for various applications.
Genes2WordCloud is a web application that enables users to quickly identify biological themes from gene lists and research relevant text by constructing and displaying word-clouds. It provides users with several different options and ideas for the sources that can be used to generate a word-cloud. Different options for rendering and coloring the word-clouds give users the flexibility to quickly generate customized word-clouds of their choice.
Genes2WordCloud is a word-cloud generator and a word-cloud viewer that is based on WordCram implemented using Java, Processing, AJAX, mySQL, and PHP. Text is fetched from several sources and then processed to extract the most relevant terms with their computed weights based on word frequencies. Genes2WordCloud is freely available for use online; it is open source software and is available for installation on any web-site along with supporting documentation at
Genes2WordCloud provides a useful way to summarize and visualize large amounts of textual biological data or to find biological themes from several different sources. The open source availability of the software enables users to implement customized word-clouds on their own web-sites and desktop applications.

Download full-text


Available from: Avi Ma'ayan,
  • Source
    • "MeSHOPs are likely to provide a similar level of convenience for summarizing complex topics for accelerated interpretation. The use of word clouds, of course, has been extensive, including for the display of gene annotation [12,13]. The key advantage of MeSHOPs is that they draw upon the expert curation underlying MEDLINE. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background MEDLINE®/PubMed® indexes over 20 million biomedical articles, providing curated annotation of its contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships. Results We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms. Conclusions MeSHOPs allows quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. A web interface is provided for generating and visualizing MeSHOPs.
    BMC Bioinformatics 09/2012; 13(1):249. DOI:10.1186/1471-2105-13-249 · 2.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The un-biased and reproducible interpretation of high-content gene sets from large-scale genomic experiments is crucial to the understanding of biological themes, validation of experimental data, and the eventual development of plans for future experimentation. To derive biomedically-relevant information from simple gene lists, a mathematical association to scientific language and meaningful words or sentences is crucial. Unfortunately, existing software for deriving meaningful and easily-appreciable scientific textual 'tokens' from large gene sets either rely on controlled vocabularies (Medical Subject Headings, Gene Ontology, BioCarta) or employ Boolean text searching and co-occurrence models that are incapable of detecting indirect links in the literature. As an improvement to existing web-based informatic tools, we have developed Textrous!, a web-based framework for the extraction of biomedical semantic meaning from a given input gene set of arbitrary length. Textrous! employs natural language processing techniques, including latent semantic indexing (LSI), sentence splitting, word tokenization, parts-of-speech tagging, and noun-phrase chunking, to mine MEDLINE abstracts, PubMed Central articles, articles from the Online Mendelian Inheritance in Man (OMIM), and Mammalian Phenotype annotation obtained from Jackson Laboratories. Textrous! has the ability to generate meaningful output data with even very small input datasets, using two different text extraction methodologies (collective and individual) for the selecting, ranking, clustering, and visualization of English words obtained from the user data. Textrous!, therefore, is able to facilitate the output of quantitatively significant and easily appreciable semantic words and phrases linked to both individual gene and batch genomic data.
    PLoS ONE 04/2013; 8(4):e62665. DOI:10.1371/journal.pone.0062665 · 3.23 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Autism spectrum disorders (ASD) are complex heterogeneous neurodevelopmental disorders of an unclear etiology, and no cure currently exists. Prior studies have demonstrated that the black and tan, brachyury (BTBR) T+ Itpr3tf/J mouse strain displays a behavioral phenotype with ASD-like features. BTBR T+ Itpr3tf/J mice (referred to simply as BTBR) display deficits in social functioning, lack of communication ability, and engagement in stereotyped behavior. Despite extensive behavioral phenotypic characterization, little is known about the genes and proteins responsible for the presentation of the ASD-like phenotype in the BTBR mouse model. In this study, we employed bioinformatics techniques to gain a wide-scale understanding of the transcriptomic and proteomic changes associated with the ASD-like phenotype in BTBR mice. We found a number of genes and proteins to be significantly altered in BTBR mice compared to C57BL/6J (B6) control mice controls such as BDNF, Shank3, and ERK1, which are highly relevant to prior investigations of ASD. Furthermore, we identified distinct functional pathways altered in BTBR mice compared to B6 controls that have been previously shown to be altered in both mouse models of ASD, some human clinical populations, and have been suggested as a possible etiological mechanism of ASD, including “axon guidance” and “regulation of actin cytoskeleton.” In addition, our wide-scale bioinformatics approach also discovered several previously unidentified genes and proteins associated with the ASD phenotype in BTBR mice, such as Caskin1, suggesting that bioinformatics could be an avenue by which novel therapeutic targets for ASD are uncovered. As a result, we believe that informed use of synergistic bioinformatics applications represents an invaluable tool for elucidating the etiology of complex disorders like ASD.
    Frontiers in Physiology 11/2015; 6(8). DOI:10.3389/fphys.2015.00324 · 3.53 Impact Factor