FragGeneScan: predicting genes in short and error-prone reads.

School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA.
Nucleic Acids Research (Impact Factor: 8.81). 11/2010; 38(20):e191. DOI: 10.1093/nar/gkq747
Source: PubMed

ABSTRACT The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although heterokaryons have been reported in nature, multicellular organisms are generally assumed genetically homogeneous. Here, we investigate the case of arbuscular mycorrhizal fungi (AMF) that form symbiosis with plant roots. The growth advantages they confer to their hosts are of great potential benefit to sustainable agricultural practices. However, measuring genetic diversity for these coenocytes is a major challenge: within the same cytoplasm, AMF contain thousands of nuclei and show extremely high levels of genetic variation for some loci. The extent and physical location of polymorphism within and between AMF genomes is unclear. We used two complementary strategies to estimate genetic diversity in AMF, investigating polymorphism both on a genome scale and in putative single copy loci. First, we used data from whole-genome pyrosequencing of four AMF isolates to describe genetic diversity, based on a conservative network-based clustering approach. AMF isolates showed marked differences in genome-wide diversity patterns in comparison to a panel of control fungal genomes. This clustering approach further allowed us to provide conservative estimates of Rhizophagus spp. genomes sizes. Second, we designed new putative single copy genomic markers, which we investigated by massive parallel amplicon sequencing for two Rhizophagus irregularis and one Rhizophagus sp. isolates. Most loci showed high polymorphism, with up to 103 alleles per marker. This polymorphism could be distributed within or between nuclei. However, we argue that the Rhizophagus isolates under study might be heterokaryotic, at least for the putative single copy markers we studied. Considering that genetic information is the main resource for identification of AMF, we suggest that special attention is warranted for the study of these ecologically important organisms. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
    Genome Biology and Evolution 01/2015; DOI:10.1093/gbe/evv002 · 4.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent studies suggest that gut microbiomes of urban-industrialized societies are different from those of traditional peoples. Here we examine the relationship between lifeways and gut microbiota through taxonomic and functional potential characterization of faecal samples from hunter-gatherer and traditional agriculturalist communities in Peru and an urban-industrialized community from the US. We find that in addition to taxonomic and metabolic differences between urban and traditional lifestyles, hunter-gatherers form a distinct subgroup among traditional peoples. As observed in previous studies, we find that Treponema are characteristic of traditional gut microbiomes. Moreover, through genome reconstruction (2.2–2.5 MB, coverage depth  26–513) and functional potential characterization , we discover these Treponema are diverse, fall outside of pathogenic clades and are similar to Treponema succinifaciens, a known carbohydrate metabolizer in swine. Gut Treponema are found in non-human primates and all traditional peoples studied to date, suggesting they are symbionts lost in urban-industrialized societies.
    Nature Communications 04/2015; · 10.74 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Novel gene finding is one of the emerging fields in the environmental research. In the past decades the research was focused mainly on the discovery of microorganisms which were capable of degrading a particular compound. A lot of methods are available in literature about the cultivation and screening of these novel microorganisms. All of these methods are efficient for screening of microbes which can be cultivated in the laboratory. Microorganisms which live in extreme conditions like hot springs, frozen glaciers, acid mine drainage, etc. cannot be cultivated in the laboratory, this is because of incomplete knowledge about their growth requirements like temperature, nutrients and their mutual dependence on each other. The microbes that can be cultivated correspond only to less than 1 % of the total microbes which are present in the earth. Rest of the 99 % of uncultivated majority remains inaccessible. Metagenomics transcends the culture requirements of microbes. In metagenomics DNA is directly extracted from the environmental samples such as soil, seawater, acid mine drainage etc., followed by construction and screening of metagenomic library. With the ongoing research, a huge amount of metagenomic data is accumulating. Understanding this data is an essential step to extract novel genes of industrial importance. Various bioinformatics tools have been designed to analyze and annotate the data produced from the metagenome. The Bio-informatic requirements of metagenomics data analysis are different in theory and practice. This paper reviews the tools that are available for metagenomic data analysis and the capability such tools—what they can do and their web availability.
    01/2015; DOI:10.1007/s40030-014-0102-y

Full-text (2 Sources)

Available from
May 23, 2014