Gene Prediction in Metagenomic Fragments with Orphelia: A Large-Scale Machine Learning Approach

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.
BMC Bioinformatics (Impact Factor: 2.67). 02/2008; 9(1):217. DOI: 10.1186/1471-2105-9-217
Source: PubMed

ABSTRACT Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.
We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.
Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

Download full-text


Available from: Thomas Lingner, Aug 05, 2015
  • Source
    • "). 하지만 이러한 방법 중 미생물 군집의 표현형으로써 균주를 비교⋅ 분석하는 경우 미생물마다 매우 비슷한 형태적 특징 을 나타내기 때문에 이러한 특징으로써 균주를 비교 하는 것은 정확하지 않은 방법으로 알려져 있다(Woo et al., 2003; Kim et al., 2012). 또한 메타게놈 분석법의 경우 미생물 유래의 유용 유전자를 찾거나 활용하는 연구에서는 유용한 방법이지만, 본 연구와 같이 폐사 원인 균주 구명과 관련한 연구에서는 종 수준의 미생 물 다양성 분석은 매우 힘들며, 원인 미생물의 병원성 규명 및 미생물학적인 특징을 밝히는 것은 거의 불가 능하다(Kim et al., 1993; Palleroni et al., 1997; Suzuki et al., 1997; Donachie et al., 2002; Stuart et al., 2003; Hoff et al., 2008; Sorokin et al., 2010 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The ark shell, Scapharca broughtonii is a marine bivalve mollusks belonging to the family Arcidae and important seafood for Korean and Japanese, and southern coast is brisk bays for the ark shell aquaculture. However, productivity of ark shell from these regions were rapidly reduced during the last decade due to mass mortality. The reason of this great damage has not yet been identified. To overcome this economic loss, diverse investigations were focused on environmental factors that affects in the physiology of S. broughtonii, but microbiological researches were performed insufficiently. Hemoglobin is one of the major blood component of ark shell and is damaged by some species of bacterial toxins. We concentrated on this red pigment because hemolysis could be the cause of ark shell mortality. In this study, we analyzed microbial diversity of underwater sediments in coastal regions and also existences in the body of S. broughtonii. We investigate about 4,200 isolates collected from June to September for microbial diversity of sediments and ark shell. We screened all of culturable microorganisms, and identified 25 genera 118 species, 24 genera 89 species, 30 genera 109 species and 39 genera 141 species, and selected 140 unique colonies for identification and challenge assay.
    12/2013; 26(3). DOI:10.7847/jfp.2013.26.3.193
  • Source
    • "Assessing the performance of metagenomic gene prediction tools remains a difficult task, due to the lack of experimentally verified gene sets. Tools such as Metagene Annotator, MetaGeneMark, Orphelia, and FragGeneScan, have compared their predicted results to GenBank annotations (Noguchi et al., 2008) (Zhu et al., 2010) (Hoff et al., 2008) (Rho et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements. The Prodigal software is freely available under the General Public License from
    Bioinformatics 07/2012; 28(17):2223-30. DOI:10.1093/bioinformatics/bts429 · 4.62 Impact Factor
  • Source
    • "Table I shows the sensitivity, specificity and harmonic mean scores of MGC predictions based on models built from 10%, 5% and 2.5% GC ranges respectively, in addition to the predictions from the LAP approach. The harmonic mean score is a composite measure of sensitivity and specificity [8]. Models built from the 10% GC ranges have an average harmonic mean of 87.58% with 7.2% standard deviation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, by-passing other challenging tasks such as the assem-bly of the metagenome. In this abstract we introduce a metagenomics gene caller (MGC) which improves over the state of the art prediction algorithm Orphelia [1]. Orphelia uses a two-stage machine learning approach and compute a model that classifies extracted ORFs from fragmented sequences. We hypothesise that sequences need separate models based on their local GC-Content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino acid features based on the benefit of amino acid usage shown in our previous research [2]. Direct comparison between our method and the original algorithm supports our hypotheses and sets the ground for further investigation.
Show more