[Show abstract][Hide abstract] ABSTRACT: Bacterial community composition and functional potential change subtly across gradients in the surface ocean. In contrast, while there are significant phylogenetic divergences between communities from freshwater and marine habitats, the underlying mechanisms to this phylogenetic structuring yet remain unknown. We hypothesized that the functional potential of natural bacterial communities is linked to this striking divide between microbiomes. To test this hypothesis, metagenomic sequencing of microbial communities along a 1,800 km transect in the Baltic Sea area, encompassing a continuous natural salinity gradient from limnic to fully marine conditions, was explored. Multivariate statistical analyses showed that salinity is the main determinant of dramatic changes in microbial community composition, but also of large scale changes in core metabolic functions of bacteria. Strikingly, genetically and metabolically different pathways for key metabolic processes, such as respiration, biosynthesis of quinones and isoprenoids, glycolysis and osmolyte transport, were differentially abundant at high and low salinities. These shifts in functional capacities were observed at multiple taxonomic levels and within dominant bacterial phyla, while bacteria, such as SAR11, were able to adapt to the entire salinity gradient. We propose that the large differences in central metabolism required at high and low salinities dictate the striking divide between freshwater and marine microbiomes, and that the ability to inhabit different salinity regimes evolved early during bacterial phylogenetic differentiation. These findings significantly advance our understanding of microbial distributions and stress the need to incorporate salinity in future climate change models that predict increased levels of precipitation and a reduction in salinity.
PLoS ONE 01/2014; 9(2):e89549. · 3.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to enable computational access to molecular-interaction data resources by means of a standard Web Service and query language. Currently providing >150 million binary interaction evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of multiple molecular-interaction information resources using a single query. Here, we present an extension of the PSICQUIC specification (version 1.3), which has been released to be compliant with the enhanced standards in molecular interactions. The new release also includes a new reference implementation of the PSICQUIC server available to the data providers. It offers augmented web service capabilities and improves the user experience. PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power of this web service is shown in PSICQUIC View web application, an example of how to simultaneously query, browse and download results from the different PSICQUIC servers. This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Nucleic Acids Research 05/2013; · 8.28 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Understanding the microbial content of the air has important scientific, health, and economic implications. While studies have primarily characterized the taxonomic content of air samples by sequencing the 16S or 18S ribosomal RNA gene, direct analysis of the genomic content of airborne microorganisms has not been possible due to the extremely low density of biological material in airborne environments. We developed sampling and amplification methods to enable adequate DNA recovery to allow metagenomic profiling of air samples collected from indoor and outdoor environments. Air samples were collected from a large urban building, a medical center, a house, and a pier. Analyses of metagenomic data generated from these samples reveal airborne communities with a high degree of diversity and different genera abundance profiles. The identities of many of the taxonomic groups and protein families also allows for the identification of the likely sources of the sampled airborne bacteria.
PLoS ONE 01/2013; 8(12):e81862. · 3.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.
[Show abstract][Hide abstract] ABSTRACT: The International Molecular Exchange (IMEx) consortium is an international collaboration between major public interaction data providers to share literature-curation efforts and make a nonredundant set of protein interactions available in a single search interface on a common website (http://www.imexconsortium.org/). Common curation rules have been developed, and a central registry is used to manage the selection of articles to enter into the dataset. We discuss the advantages of such a service to the user, our quality-control measures and our data-distribution practices.
[Show abstract][Hide abstract] ABSTRACT: As metagenomic studies continue to increase in their number, sequence volume and complexity, the scalability of biological analysis frameworks has become a rate-limiting factor to meaningful data interpretation. To address this issue, we have developed JCVI Metagenomics Reports (METAREP) as an open source tool to query, browse, and compare extremely large volumes of metagenomic annotations. Here we present improvements to this software including the implementation of a dynamic weighting of taxonomic and functional annotation, support for distributed searches, advanced clustering routines, and integration of additional annotation input formats. The utility of these improvements to data interpretation are demonstrated through the application of multiple comparative analysis strategies to shotgun metagenomic data produced by the National Institutes of Health Roadmap for Biomedical Research Human Microbiome Project (HMP) (http://nihroadmap.nih.gov). Specifically, the scalability of the dynamic weighting feature is evaluated and established by its application to the analysis of over 400 million weighted gene annotations derived from 14 billion short reads as predicted by the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline. Further, the capacity of METAREP to facilitate the identification and simultaneous comparison of taxonomic and functional annotations including biological pathway and individual enzyme abundances from hundreds of community samples is demonstrated by providing scenarios that describe how these data can be mined to answer biological questions related to the human microbiome. These strategies provide users with a reference of how to conduct similar large-scale metagenomic analyses using METAREP with their own sequence data, while in this study they reveal insights into the nature and extent of variation in taxonomic and functional profiles across body habitats and individuals. Over one thousand HMP WGS datasets and the latest open source code are available at http://www.jcvi.org/hmp-metarep.
PLoS ONE 01/2012; 7(6):e29044. · 3.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: JCVI Metagenomics Reports (METAREP) is a Web 2.0 application designed to help scientists analyze and compare annotated metagenomics datasets. It utilizes Solr/Lucene, a high-performance scalable search engine, to quickly query large data collections. Furthermore, users can use its SQL-like query syntax to filter and refine datasets. METAREP provides graphical summaries for top taxonomic and functional classifications as well as a GO, NCBI Taxonomy and KEGG Pathway Browser. Users can compare absolute and relative counts of multiple datasets at various functional and taxonomic levels. Advanced comparative features comprise statistical tests as well as multidimensional scaling, heatmap and hierarchical clustering plots. Summaries can be exported as tab-delimited files, publication quality plots in PDF format. A data management layer allows collaborative data analysis and result sharing.
Web site http://www.jcvi.org/metarep; source code http://github.com/jcvi/METAREP CONTACT: firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: With the advance of high-throughput genomics and proteomics technologies, it becomes critical to mine and curate protein-protein
interaction (PPI) networks from biological research literature. Several PPI knowledge bases have been curated by domain experts
but they are far from comprehensive. Observing that PPI-relevant documents can be obtained from PPI knowledge bases recording
literature evidences and also that a large number of unlabeled documents (mostly negative) are freely available, we investigated
learning from positive and unlabeled data (LPU) and developed an automated system for the retrieval of PPI-relevant articles aiming at assisting the curation of a bacterial
PPI knowledge base, MPIDB. Two different approaches of obtaining unlabeled documents were used: one based on PubMed MeSH term
search and the other based on an existing knowledge base, UniProtKB. We found unlabeled documents obtained from UniProtKB
tend to yield better document classifiers for PPI curation purposes. Our study shows that LPU is a possible scenario for the
development of an automated system to retrieve PPI-relevant articles, where there is no requirement for extra annotation effort.
Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based
Keywordsdocument retrieval-learning from positive and unlabeled-protein-protein interaction
[Show abstract][Hide abstract] ABSTRACT: The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms, EC numbers, and JCVI functional role categories.
Standards in Genomic Sciences 01/2010; 2(2):229-37. · 2.01 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Generation of syntactically correct and unambiguous names for proteins is a challenging, yet vital task for functional annotation processes. Proteins are often named based on homology to known proteins, many of which have problematic names. To address the need to generate high-quality protein names, and capture our significant experience correcting protein names manually, we have developed the Protein Naming Utility (PNU, http://www.jcvi.org/pn-utility). The PNU is a web-based database for storing and applying naming rules to identify and correct syntactically incorrect protein names, or to replace synonyms with their preferred name. The PNU allows users to generate and manage collections of naming rules, optionally building upon the growing body of rules generated at the J. Craig Venter Institute (JCVI). Since communities often enforce disparate conventions for naming proteins, the PNU supports grouping rules into user-managed collections. Users can check their protein names against a selected PNU rule collection, generating both statistics and corrected names. The PNU can also be used to correct GenBank table files prior to submission to GenBank. Currently, the database features 3080 manual rules that have been entered by JCVI Bioinformatics Analysts as well as 7458 automatically imported names.
Nucleic Acids Research 12/2009; 38(Database issue):D336-9. · 8.28 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Prokaryotic protein-protein interactions are underrepresented in currently available databases. Here, we describe a 'gold standard' dataset (MPI-LIT) focusing on microbial binary protein-protein interactions and associated experimental evidence that we have manually curated from 813 abstracts and full texts that were selected from an initial set of 36 852 abstracts. The MPI-LIT dataset comprises 1237 experimental descriptions that describe a non-redundant set of 746 interactions of which 659 (88%) are not reported in public databases. To estimate the curation quality, we compared our dataset with a union of microbial interaction data from IntAct, DIP, BIND and MINT. Among common abstracts, we achieve a sensitivity of up to 66% for interactions and 75% for experimental methods. Compared with these other datasets, MPI-LIT has the lowest fraction of interaction experiments per abstract (0.9) and the highest coverage of strains (92) and scientific articles (813). We compared methods that evaluate functional interactions among proteins (such as genomic context or co-expression) which are implemented in the STRING database. Most of these methods discriminate well between functionally relevant protein interactions (MPI-LIT) and high-throughput data. AVAILABILITY: http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: The microbial protein interaction database (MPIDB) aims to collect and provide all known physical microbial interactions. Currently, 22,530 experimentally determined interactions among proteins of 191 bacterial species/strains can be browsed and downloaded. These microbial interactions have been manually curated from the literature or imported from other databases (IntAct, DIP, BIND, MINT) and are linked to 24,060 experimental evidences (PubMed ID, PSI-MI methods). In contrast to these databases, interactions in MPIDB are further supported by 8150 additional evidences based on interaction conservation, co-purification and 3D domain contacts (iPfam, 3did). AVAILABILITY: http://www.jcvi.org/mpidb/