Gene prediction in metagenomic fragments: A large scale machine learning approach

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.
BMC Bioinformatics (Impact Factor: 2.67). 02/2008; 9:217. DOI: 10.1186/1471-2105-9-217
Source: PubMed

ABSTRACT Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.
We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.
Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

Download full-text


Available from: Thomas Lingner, Jun 28, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ark shell, Scapharca broughtonii is a marine bivalve mollusks belonging to the family Arcidae and important seafood for Korean and Japanese, and southern coast is brisk bays for the ark shell aquaculture. However, productivity of ark shell from these regions were rapidly reduced during the last decade due to mass mortality. The reason of this great damage has not yet been identified. To overcome this economic loss, diverse investigations were focused on environmental factors that affects in the physiology of S. broughtonii, but microbiological researches were performed insufficiently. Hemoglobin is one of the major blood component of ark shell and is damaged by some species of bacterial toxins. We concentrated on this red pigment because hemolysis could be the cause of ark shell mortality. In this study, we analyzed microbial diversity of underwater sediments in coastal regions and also existences in the body of S. broughtonii. We investigate about 4,200 isolates collected from June to September for microbial diversity of sediments and ark shell. We screened all of culturable microorganisms, and identified 25 genera 118 species, 24 genera 89 species, 30 genera 109 species and 39 genera 141 species, and selected 140 unique colonies for identification and challenge assay.
    12/2013; 26(3). DOI:10.7847/jfp.2013.26.3.193
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements. The Prodigal software is freely available under the General Public License from
    Bioinformatics 07/2012; 28(17):2223-30. DOI:10.1093/bioinformatics/bts429 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, by-passing other challenging tasks such as the assem-bly of the metagenome. In this abstract we introduce a metagenomics gene caller (MGC) which improves over the state of the art prediction algorithm Orphelia [1]. Orphelia uses a two-stage machine learning approach and compute a model that classifies extracted ORFs from fragmented sequences. We hypothesise that sequences need separate models based on their local GC-Content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino acid features based on the benefit of amino acid usage shown in our previous research [2]. Direct comparison between our method and the original algorithm supports our hypotheses and sets the ground for further investigation.