FragGeneScan: Predicting genes in short and error-prone reads

School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2010; 38(20):e191. DOI: 10.1093/nar/gkq747
Source: PubMed


The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

Download full-text


Available from: Yuzhen Ye,
34 Reads
  • Source
    • " for both the microbialite and sediment contigs . Nevertheless , only 0 . 64 and 1 . 74% of the raw reads from the sediment and microbialite metagenomes , respectively , assembled into contigs , indicating that both environments had complex microbial communities . FragGeneScan was used to predict and translate contig open reading frames ( ORFs ) ( Rho et al . , 2010 ) and ProPas ( Wu and Zhu , 2012 ) was used to calculate predicted protein isoelectric points ( pI ) ."
    [Show abstract] [Hide abstract]
    ABSTRACT: Within the subarctic climate of Clinton Creek, Yukon, Canada, lies an abandoned and flooded open-pit asbestos mine that harbors rapidly growing microbialites. To understand their formation we completed a metagenomic community profile of the microbialites and their surrounding sediments. Assembled metagenomic data revealed that bacteria within the phylum Proteobacteria numerically dominated this system, although the relative abundances of taxa within the phylum varied among environments. Bacteria belonging to Alphaproteobacteria and Gammaproteobacteria were dominant in the microbialites and sediments, respectively. The microbialites were also home to many other groups associated with microbialite formation including filamentous cyanobacteria and dissimilatory sulfate-reducing Deltaproteobacteria, consistent with the idea of a shared global microbialite microbiome. Other members were present that are typically not associated with microbialites including Gemmatimonadetes and iron-oxidizing Betaproteobacteria, which participate in carbon metabolism and iron cycling. Compared to the sediments, the microbialite microbiome has significantly more genes associated with photosynthetic processes (e.g., photosystem II reaction centers, carotenoid, and chlorophyll biosynthesis) and carbon fixation (e.g., CO dehydrogenase). The Clinton Creek microbialite communities had strikingly similar functional potentials to non-lithifying microbial mats from the Canadian High Arctic and Antarctica, but are functionally distinct, from non-lithifying mats or biofilms from Yellowstone. Clinton Creek microbialites also share metabolic genes (R (2) < 0.750) with freshwater microbial mats from Cuatro Ciénegas, Mexico, but are more similar to polar Arctic mats (R (2) > 0.900). These metagenomic profiles from an anthropogenic microbialite-forming ecosystem provide context to microbialite formation on a human-relevant timescale.
    Frontiers in Microbiology 09/2015; 6. DOI:10.3389/fmicb.2015.00966 · 3.99 Impact Factor
    • "The raw reads were trimmed and filtered based on quality and assembled into contigs. Coding regions were subsequently identified from the trimmed reads as well as from the contigs using FragGeneScan v1.19 [9]. All the obtained aminoacid sequences were used as a reference database for metaproteomic analyses of all samples. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Two parallel anaerobic digestion lines were designed to match a "bovid-like" digestive structure. Each of the lines consisted of two Continuous Stirred Tank Reactors placed in series and separated by an acidic treatment step. The first line was inoculated with industrial inocula whereas the second was seeded with cow digestive tract contents. After three months of continuous sewage sludge feeding, samples were recovered for shotgun metaproteomic and DNA-based analysis. Strikingly, protein-inferred and 16S rDNA tags-based taxonomic community profiles were not consistent. Principal Component analysis however revealed a similar clustering pattern of the samples, suggesting that reproducible methodological and/and biological factors underlie this observation. The performances of the two digestion lines did not differ significantly and the cow-derived inocula did not establish in the reactors. A low throughput metagenomic dataset (3.4×10(6) reads, 1.1 Gb) was also generated for one of the samples. It allowed a substantial increase of the analysis depth (11 vs 4% of spectral identification rate for the combined samples). Surprisingly, a high proportion of proteins from members of the "Candidatus Competibacter" group, a key microbial player usually found in activated sludge plants, was retrieved in our anaerobic digester samples. Data are available via ProteomeXchange with identifier PXD002420. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Proteomics 08/2015; 15(20). DOI:10.1002/pmic.201500041 · 3.81 Impact Factor
  • Source
    • "To predict the overall community metabolism metagenome assembly (contigs > 500 bp) was used. Gene calling was performed on the selected contigs using FragGeneScan (Rho et al., 2010) at the parameters –genome –complete = 0 –train = sanger_5. Predicted ORFs were annotated against NCBI-nr database (downloaded on September 2012), KEGG (Kanehisa et al., 2004) and COGG (Tatusov et al., 2003) using BLASTP (Altschul et al., 1990) at an E value cut-off of 1 × 10 -5 . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bdellovibrio bacteriovorus are small Deltaproteobacteria that invade, kill, and assimilate their prey. Metagenomic assembly analysis of the microbial mats of an arsenic rich hot spring was performed to describe the genotypes of the predator Bdellovibrio and the ecogenetically adapted taxa Enterobacter. The microbial mats were enriched with Bdellovibrio (1.3%) and several gram negative bacteria including Bordetella (16%), Enterobacter (6.8%), Burkholderia (4.8%), Acinetobacter (2.3%), and Yersinia (1%). A high quality (47 contigs, 25X coverage; 3.5 Mbp) draft genome of Bdellovibrio (strain ArHS; Arsenic Hot-Spring) was reassembled, which lacked the marker gene Bd0108 associated with the usual method of prey interaction and invasion for this genus, while maintaining genes coding for the hydrolytic enzymes necessary for prey assimilation. By filtering microbial mat samples (< 0.45 μm) to enrich for small predatory cell sizes we observed Bdellovibrio-like cells attached side-on to E. coli through electron microscopy. Furthermore, a draft pan-genome of the dominant potential host taxon, Enterobacter cloacae ArHS (4.8 Mb), along with three of its viral genotypes (n = 3; 42, 49, and 50 kb) was assembled. These data were further used to analyse the population level evolutionary dynamics (taxonomical, functional, and evolutionary) of reconstructed genotypes. This article is protected by copyright. All rights reserved.
    Environmental Microbiology Reports 05/2015; DOI:10.1111/1758-2229.12297 · 3.29 Impact Factor
Show more