Gene prediction in metagenomic fragments: A large scale machine learning approach

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.
BMC Bioinformatics (Impact Factor: 2.67). 02/2008; 9:217. DOI: 10.1186/1471-2105-9-217
Source: PubMed

ABSTRACT Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.
We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.
Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture [1]. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
    Methods in molecular biology (Clifton, N.J.) 01/2014; 1096:183-201. DOI:10.1007/978-1-62703-712-9_15 · 1.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ark shell, Scapharca broughtonii is a marine bivalve mollusks belonging to the family Arcidae and important seafood for Korean and Japanese, and southern coast is brisk bays for the ark shell aquaculture. However, productivity of ark shell from these regions were rapidly reduced during the last decade due to mass mortality. The reason of this great damage has not yet been identified. To overcome this economic loss, diverse investigations were focused on environmental factors that affects in the physiology of S. broughtonii, but microbiological researches were performed insufficiently. Hemoglobin is one of the major blood component of ark shell and is damaged by some species of bacterial toxins. We concentrated on this red pigment because hemolysis could be the cause of ark shell mortality. In this study, we analyzed microbial diversity of underwater sediments in coastal regions and also existences in the body of S. broughtonii. We investigate about 4,200 isolates collected from June to September for microbial diversity of sediments and ark shell. We screened all of culturable microorganisms, and identified 25 genera 118 species, 24 genera 89 species, 30 genera 109 species and 39 genera 141 species, and selected 140 unique colonies for identification and challenge assay.
    12/2013; 26(3). DOI:10.7847/jfp.2013.26.3.193
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Giardia lamblia is a protozoan parasite that is found worldwide and has both medical and veterinary importance. We applied the transcription start sequence (TSS-seq) and RNA sequence (RNA-seq) techniques to study the transcriptome of the assemblage A WB strain trophozoite. We identified 8000 transcription regions (TR) with significant transcription. Of these regions, 1881 TRs were more than 500 nucleotides upstream of an annotated ORF. Combining both techniques helped us to identify 24 ORFs that should be re-annotated and 60 new ORFs. From the 8000 TRs, we were able to identify an AT-rich consensus that includes the transcription initiation site. It is possible that transcription that was previously thought to be bidirectional is actually unidirectional.
    PLoS ONE 10/2013; 8(10):e76184. DOI:10.1371/journal.pone.0076184 · 3.53 Impact Factor

Full-text (3 Sources)

Available from
May 27, 2014