FragGeneScan: Predicting genes in short and error-prone reads

School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2010; 38(20):e191. DOI: 10.1093/nar/gkq747
Source: PubMed


The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

Download full-text


Available from: Yuzhen Ye, Oct 06, 2015
33 Reads
    • "The raw reads were trimmed and filtered based on quality and assembled into contigs. Coding regions were subsequently identified from the trimmed reads as well as from the contigs using FragGeneScan v1.19 [9]. All the obtained aminoacid sequences were used as a reference database for metaproteomic analyses of all samples. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Two parallel anaerobic digestion lines were designed to match a "bovid-like" digestive structure. Each of the lines consisted of two Continuous Stirred Tank Reactors placed in series and separated by an acidic treatment step. The first line was inoculated with industrial inocula whereas the second was seeded with cow digestive tract contents. After three months of continuous sewage sludge feeding, samples were recovered for shotgun metaproteomic and DNA-based analysis. Strikingly, protein-inferred and 16S rDNA tags-based taxonomic community profiles were not consistent. Principal Component analysis however revealed a similar clustering pattern of the samples, suggesting that reproducible methodological and/and biological factors underlie this observation. The performances of the two digestion lines did not differ significantly and the cow-derived inocula did not establish in the reactors. A low throughput metagenomic dataset (3.4×10(6) reads, 1.1 Gb) was also generated for one of the samples. It allowed a substantial increase of the analysis depth (11 vs 4% of spectral identification rate for the combined samples). Surprisingly, a high proportion of proteins from members of the "Candidatus Competibacter" group, a key microbial player usually found in activated sludge plants, was retrieved in our anaerobic digester samples. Data are available via ProteomeXchange with identifier PXD002420. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Proteomics 08/2015; DOI:10.1002/pmic.201500041 · 3.81 Impact Factor
  • Source
    • "To predict the overall community metabolism metagenome assembly (contigs > 500 bp) was used. Gene calling was performed on the selected contigs using FragGeneScan (Rho et al., 2010) at the parameters –genome –complete = 0 –train = sanger_5. Predicted ORFs were annotated against NCBI-nr database (downloaded on September 2012), KEGG (Kanehisa et al., 2004) and COGG (Tatusov et al., 2003) using BLASTP (Altschul et al., 1990) at an E value cut-off of 1 × 10 -5 . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bdellovibrio bacteriovorus are small Deltaproteobacteria that invade, kill, and assimilate their prey. Metagenomic assembly analysis of the microbial mats of an arsenic rich hot spring was performed to describe the genotypes of the predator Bdellovibrio and the ecogenetically adapted taxa Enterobacter. The microbial mats were enriched with Bdellovibrio (1.3%) and several gram negative bacteria including Bordetella (16%), Enterobacter (6.8%), Burkholderia (4.8%), Acinetobacter (2.3%), and Yersinia (1%). A high quality (47 contigs, 25X coverage; 3.5 Mbp) draft genome of Bdellovibrio (strain ArHS; Arsenic Hot-Spring) was reassembled, which lacked the marker gene Bd0108 associated with the usual method of prey interaction and invasion for this genus, while maintaining genes coding for the hydrolytic enzymes necessary for prey assimilation. By filtering microbial mat samples (< 0.45 μm) to enrich for small predatory cell sizes we observed Bdellovibrio-like cells attached side-on to E. coli through electron microscopy. Furthermore, a draft pan-genome of the dominant potential host taxon, Enterobacter cloacae ArHS (4.8 Mb), along with three of its viral genotypes (n = 3; 42, 49, and 50 kb) was assembled. These data were further used to analyse the population level evolutionary dynamics (taxonomical, functional, and evolutionary) of reconstructed genotypes. This article is protected by copyright. All rights reserved.
    Environmental Microbiology Reports 05/2015; DOI:10.1111/1758-2229.12297 · 3.29 Impact Factor
  • Source
    • "The unigenes were annotated by BLASTX search against NCBI non-redundant (nr) database with E-value cutoff of 1eÀ5. Meanwhile, the unigene sequences were translated to protein and corresponding ORFs were predicted using FragGeneScan (Rho et al., 2010). The deduced protein sequences were also annotated, using JCVI metagenomic annotation pipeline (Tanenbaum et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The globally increasing trend of harmful algal blooms (HAB) is often attributed to coastal eutrophication and climate change, but the physiological processes at play during a bloom are poorly understood due to lack of in situ measurements of these processes or their corresponding molecular machinery. Here we employed a dinoflagellate spliced leader-based 454-pyrosequencing technique to generate time-serial expressed sequence tags (EST) throughout a diel cycle during an Alexandrium fundyense bloom in Long Island Sound. Assembly of the reads yielded 87,273 dinoflagellate genes. To facilitate mapping of these data to species and comparing of gene expression dynamics between natural bloom and laboratory culture conditions, similar diel EST sets were sequenced for A. fundyense strain CCMP1719, which achieved 31,451 genes. The assembled metatranscriptome reveals a metabolically active A. fundyense population. Relative to the laboratory culture, the natural bloom expressed more abundantly genes related to nitrogen (N)-scavenging, CO2-concentrating and saxitoxin production. Most strikingly, the data showed a versatility to exploit various sources of N (cyanate, urea, nitrate/nitrite, and ammonium), likely conferring competitive advantages in A. fundyense for bloom formation or maintenance. The dataset also led to the first characterization of Ni-containing superoxide dismutases (NiSODs) in dinoflagellates with mitochondrial targeting signal identified, which along with other types of SOD found indicated diverse SODs being expressed in the A. fundyense bloom. Our results also demonstrate that metatranscriptomics is an effective approach to unraveling physiological processes conducive to HAB outbreaks.
    Harmful Algae 02/2015; 42(1). DOI:10.1016/j.hal.2014.12.006 · 3.87 Impact Factor
Show more