FragGeneScan: Predicting genes in short and error-prone reads

School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2010; 38(20):e191. DOI: 10.1093/nar/gkq747
Source: PubMed


The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly
the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly
from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes
is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic
sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get
shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages
in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan
was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene
(accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error
free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes
identified by homology search), and many novel genes with no homologs in current protein sequence database.

Download full-text


Available from: Yuzhen Ye
  • Source
    • "For DNA assembly the following kmer lengths were used: 51, 55 with Velvet and 31 with Meta-Ray. Genes were predicted from the obtained contigs using FragGeneScan with suggested options for contigs (Rho, Tang & Ye, 2010) generating 429,162 cDNA genes and 3,176,262 DNA genes, respectively (Table S1). The predicted gene sequences obtained with the different assemblers and kmer lengths were clustered at 99% similarity using UCLUST (Edgar, 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Baltic Sea deep water and sediments hold one of the largest anthropogenically induced hypoxic areas in the world. High nutrient input and low water exchange result in eutrophication and oxygen depletion below the halocline. As a consequence at Landsort Deep, the deepest point of the Baltic Sea, anoxia in the sediments has been a persistent condition over the past decades. Given that microbial communities are drivers of essential ecosystem functions we investigated the microbial community metabolisms and functions of oxygen depleted Landsort Deep sediments by metatranscriptomics. Results show substantial expression of genes involved in protein metabolism demonstrating that the Landsort Deep sediment microbial community is active. Identified expressed gene suites of metabolic pathways with importance for carbon transformation including fermentation, dissimilatory sulphate reduction and methanogenesis were identified. The presence of transcripts for these metabolic processes suggests a potential for heterotrophic-autotrophic community synergism and indicates active mineralisation of the organic matter deposited at the sediment as a consequence of the eutrophication process. Furthermore, cyanobacteria, probably deposited from the water column, are transcriptionally active in the anoxic sediment at this depth. Results also reveal high abundance of transcripts encoding integron integrases. These results provide insight into the activity of the microbial community of the anoxic sediment at the deepest point of the Baltic Sea and its possible role in ecosystem functioning.
    Full-text · Article · Jan 2016 · PeerJ
  • Source
    • "The filtered sequences were assembled using Velvet/ Metavelvet (Namiki et al., 2012; Zerbino and Birney, 2008), and then contigs were generated. The contigs were then converted to putative protein sequences using FragGeneScan (Rho et al., 2010). For protein annotation and analysis, the assembled sequences were uploaded on the Metagenomics Analysis Server (MG-RAST) (Meyer et al., 2008), and subjected to similarity search using the SEED database, keeping 10 −5 as the maximum E-value (Overbeek et al., 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this study, we report the distribution and abundance of cold-adaptation proteins in microbial mat communities in the perennially ice-covered Lake Joyce, located in the McMurdo Dry Valleys, Antarctica.Wehave usedMGRAST and R code bioinformatics tools on Illumina HiSeq2000 shotgun metagenomic data and compared the filtering efficacy of these two methods on cold-adaptation proteins. Overall, the abundance of cold-shock DEAD-box protein A (CSDA), antifreeze proteins (AFPs), fatty acid desaturase (FAD), trehalose synthase (TS), and cold-shock family of proteins (CSPs)were present in all mat samples at high, moderate, or lowlevels,whereas the ice nucleation protein (INP) was present only in the ice and bulbous mat samples at insignificant levels. Considering the near homogeneous temperature profile of Lake Joyce (0.08-0.29 °C), the distribution and abundance of these proteins across various mat samples predictively correlatedwith known functional attributes necessary for microbial communities to thrive in this ecosystem. The comparison of the MG-RAST and the R code methods showed dissimilar occurrences of the cold-adaptation protein sequences, though with insignificant ANOSIM (R = 0.357; p-value = 0.012), ADONIS (R2= 0.274; p-value = 0.03) and STAMP (p-values = 0.521- 0.984) statistical analyses. Furthermore, filtering targeted sequences using the R code accounted for taxonomic groups by avoiding sequence redundancies,whereas theMG-RAST provided total counts resulting in a higher sequence output. The results fromthis study revealed for the first time the distribution of cold-adaptation proteins in six different types of microbial mats in Lake Joyce, while suggesting a simpler and more manageable userdefined method of R code, as compared to a web-based MG-RAST pipeline.
    Full-text · Article · Nov 2015 · Journal of microbiological methods
  • Source
    • " for both the microbialite and sediment contigs . Nevertheless , only 0 . 64 and 1 . 74% of the raw reads from the sediment and microbialite metagenomes , respectively , assembled into contigs , indicating that both environments had complex microbial communities . FragGeneScan was used to predict and translate contig open reading frames ( ORFs ) ( Rho et al . , 2010 ) and ProPas ( Wu and Zhu , 2012 ) was used to calculate predicted protein isoelectric points ( pI ) ."
    [Show abstract] [Hide abstract]
    ABSTRACT: Within the subarctic climate of Clinton Creek, Yukon, Canada, lies an abandoned and flooded open-pit asbestos mine that harbors rapidly growing microbialites. To understand their formation we completed a metagenomic community profile of the microbialites and their surrounding sediments. Assembled metagenomic data revealed that bacteria within the phylum Proteobacteria numerically dominated this system, although the relative abundances of taxa within the phylum varied among environments. Bacteria belonging to Alphaproteobacteria and Gammaproteobacteria were dominant in the microbialites and sediments, respectively. The microbialites were also home to many other groups associated with microbialite formation including filamentous cyanobacteria and dissimilatory sulfate-reducing Deltaproteobacteria, consistent with the idea of a shared global microbialite microbiome. Other members were present that are typically not associated with microbialites including Gemmatimonadetes and iron-oxidizing Betaproteobacteria, which participate in carbon metabolism and iron cycling. Compared to the sediments, the microbialite microbiome has significantly more genes associated with photosynthetic processes (e.g., photosystem II reaction centers, carotenoid, and chlorophyll biosynthesis) and carbon fixation (e.g., CO dehydrogenase). The Clinton Creek microbialite communities had strikingly similar functional potentials to non-lithifying microbial mats from the Canadian High Arctic and Antarctica, but are functionally distinct, from non-lithifying mats or biofilms from Yellowstone. Clinton Creek microbialites also share metabolic genes (R (2) < 0.750) with freshwater microbial mats from Cuatro Ciénegas, Mexico, but are more similar to polar Arctic mats (R (2) > 0.900). These metagenomic profiles from an anthropogenic microbialite-forming ecosystem provide context to microbialite formation on a human-relevant timescale.
    Full-text · Article · Sep 2015 · Frontiers in Microbiology
Show more