FragGeneScan: Predicting genes in short and error-prone reads

School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2010; 38(20):e191. DOI: 10.1093/nar/gkq747
Source: PubMed

ABSTRACT The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

Download full-text


Available from: Yuzhen Ye, Aug 30, 2015
  • Source
    • "To predict the overall community metabolism metagenome assembly (contigs > 500 bp) was used. Gene calling was performed on the selected contigs using FragGeneScan (Rho et al., 2010) at the parameters –genome –complete = 0 –train = sanger_5. Predicted ORFs were annotated against NCBI-nr database (downloaded on September 2012), KEGG (Kanehisa et al., 2004) and COGG (Tatusov et al., 2003) using BLASTP (Altschul et al., 1990) at an E value cut-off of 1 × 10 -5 . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bdellovibrio bacteriovorus are small Deltaproteobacteria that invade, kill, and assimilate their prey. Metagenomic assembly analysis of the microbial mats of an arsenic rich hot spring was performed to describe the genotypes of the predator Bdellovibrio and the ecogenetically adapted taxa Enterobacter. The microbial mats were enriched with Bdellovibrio (1.3%) and several gram negative bacteria including Bordetella (16%), Enterobacter (6.8%), Burkholderia (4.8%), Acinetobacter (2.3%), and Yersinia (1%). A high quality (47 contigs, 25X coverage; 3.5 Mbp) draft genome of Bdellovibrio (strain ArHS; Arsenic Hot-Spring) was reassembled, which lacked the marker gene Bd0108 associated with the usual method of prey interaction and invasion for this genus, while maintaining genes coding for the hydrolytic enzymes necessary for prey assimilation. By filtering microbial mat samples (< 0.45 μm) to enrich for small predatory cell sizes we observed Bdellovibrio-like cells attached side-on to E. coli through electron microscopy. Furthermore, a draft pan-genome of the dominant potential host taxon, Enterobacter cloacae ArHS (4.8 Mb), along with three of its viral genotypes (n = 3; 42, 49, and 50 kb) was assembled. These data were further used to analyse the population level evolutionary dynamics (taxonomical, functional, and evolutionary) of reconstructed genotypes. This article is protected by copyright. All rights reserved.
    Environmental Microbiology Reports 05/2015; DOI:10.1111/1758-2229.12297 · 3.26 Impact Factor
  • Source
    • "The unigenes were annotated by BLASTX search against NCBI non-redundant (nr) database with E-value cutoff of 1eÀ5. Meanwhile, the unigene sequences were translated to protein and corresponding ORFs were predicted using FragGeneScan (Rho et al., 2010). The deduced protein sequences were also annotated, using JCVI metagenomic annotation pipeline (Tanenbaum et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The globally increasing trend of harmful algal blooms (HAB) is often attributed to coastal eutrophication and climate change, but the physiological processes at play during a bloom are poorly understood due to lack of in situ measurements of these processes or their corresponding molecular machinery. Here we employed a dinoflagellate spliced leader-based 454-pyrosequencing technique to generate time-serial expressed sequence tags (EST) throughout a diel cycle during an Alexandrium fundyense bloom in Long Island Sound. Assembly of the reads yielded 87,273 dinoflagellate genes. To facilitate mapping of these data to species and comparing of gene expression dynamics between natural bloom and laboratory culture conditions, similar diel EST sets were sequenced for A. fundyense strain CCMP1719, which achieved 31,451 genes. The assembled metatranscriptome reveals a metabolically active A. fundyense population. Relative to the laboratory culture, the natural bloom expressed more abundantly genes related to nitrogen (N)-scavenging, CO2-concentrating and saxitoxin production. Most strikingly, the data showed a versatility to exploit various sources of N (cyanate, urea, nitrate/nitrite, and ammonium), likely conferring competitive advantages in A. fundyense for bloom formation or maintenance. The dataset also led to the first characterization of Ni-containing superoxide dismutases (NiSODs) in dinoflagellates with mitochondrial targeting signal identified, which along with other types of SOD found indicated diverse SODs being expressed in the A. fundyense bloom. Our results also demonstrate that metatranscriptomics is an effective approach to unraveling physiological processes conducive to HAB outbreaks.
    Harmful Algae 02/2015; 42. DOI:10.1016/j.hal.2014.12.006 · 3.34 Impact Factor
  • Source
    • "There is a number of available algorithms, adapted for the annotation of full genome sequences with the estimated accuracy of about 95% in the prediction of CDSs (Lukashin & Borodovsky, 1998). There are several tools adjusted to predict the CDSs, such as MetaGeneMark (McHardy et al., 2007), FragGeneScan (Rho et al., 2010) or Meta- Gene Annotator (Noguchi et al., 2008). All tools listed are based on internal information, such as codon usage, to categorize the sequence fragment as coding or noncoding (Thomas et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Metagenomics is a powerful tool to better understand the microbial niches, especially these from extreme habitats like oceans and seas, hot springs or deserts. However, one who is going to face the metagenomic studies should realize the challenges which might occur in the course of experiments. This manuscript indicates common problems in function-driven metagenomics, especially factors that influence gene expression are taken into account. Codon usage bias, internal cell accumulation, correct protein folding or presence of proper initiation factors are discussed and possible ways to overcome these problems are proposed. Finally, the annotation process is described, including possible limitations that one should take under consideration. What is more, the most popular databases for metagenomic data are mentioned and discussed.
    Acta biochimica Polonica 02/2015; 62(1). DOI:10.18388/abp.2014_917 · 1.39 Impact Factor
Show more