Gene Prediction in Metagenomic Fragments with Orphelia: A Large-Scale Machine Learning Approach

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.
BMC Bioinformatics (Impact Factor: 2.58). 02/2008; 9(1):217. DOI: 10.1186/1471-2105-9-217
Source: PubMed


Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.
We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.
Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

Download full-text


Available from: Thomas Lingner,
31 Reads
  • Source
    • "). 하지만 이러한 방법 중 미생물 군집의 표현형으로써 균주를 비교⋅ 분석하는 경우 미생물마다 매우 비슷한 형태적 특징 을 나타내기 때문에 이러한 특징으로써 균주를 비교 하는 것은 정확하지 않은 방법으로 알려져 있다(Woo et al., 2003; Kim et al., 2012). 또한 메타게놈 분석법의 경우 미생물 유래의 유용 유전자를 찾거나 활용하는 연구에서는 유용한 방법이지만, 본 연구와 같이 폐사 원인 균주 구명과 관련한 연구에서는 종 수준의 미생 물 다양성 분석은 매우 힘들며, 원인 미생물의 병원성 규명 및 미생물학적인 특징을 밝히는 것은 거의 불가 능하다(Kim et al., 1993; Palleroni et al., 1997; Suzuki et al., 1997; Donachie et al., 2002; Stuart et al., 2003; Hoff et al., 2008; Sorokin et al., 2010 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The ark shell, Scapharca broughtonii is a marine bivalve mollusks belonging to the family Arcidae and important seafood for Korean and Japanese, and southern coast is brisk bays for the ark shell aquaculture. However, productivity of ark shell from these regions were rapidly reduced during the last decade due to mass mortality. The reason of this great damage has not yet been identified. To overcome this economic loss, diverse investigations were focused on environmental factors that affects in the physiology of S. broughtonii, but microbiological researches were performed insufficiently. Hemoglobin is one of the major blood component of ark shell and is damaged by some species of bacterial toxins. We concentrated on this red pigment because hemolysis could be the cause of ark shell mortality. In this study, we analyzed microbial diversity of underwater sediments in coastal regions and also existences in the body of S. broughtonii. We investigate about 4,200 isolates collected from June to September for microbial diversity of sediments and ark shell. We screened all of culturable microorganisms, and identified 25 genera 118 species, 24 genera 89 species, 30 genera 109 species and 39 genera 141 species, and selected 140 unique colonies for identification and challenge assay.
    12/2013; 26(3). DOI:10.7847/jfp.2013.26.3.193
  • Source
    • "We used Orphelia [29], [30] to check the presence of the ORFs near such TRs. We took 3000 bases of the genome sequence downstream from the start of TRs and examined the presence of ORFs that are 300-nt or more in length. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Giardia lamblia is a protozoan parasite that is found worldwide and has both medical and veterinary importance. We applied the transcription start sequence (TSS-seq) and RNA sequence (RNA-seq) techniques to study the transcriptome of the assemblage A WB strain trophozoite. We identified 8000 transcription regions (TR) with significant transcription. Of these regions, 1881 TRs were more than 500 nucleotides upstream of an annotated ORF. Combining both techniques helped us to identify 24 ORFs that should be re-annotated and 60 new ORFs. From the 8000 TRs, we were able to identify an AT-rich consensus that includes the transcription initiation site. It is possible that transcription that was previously thought to be bidirectional is actually unidirectional.
    PLoS ONE 10/2013; 8(10):e76184. DOI:10.1371/journal.pone.0076184 · 3.23 Impact Factor
  • Source
    • "Assessing the performance of metagenomic gene prediction tools remains a difficult task, due to the lack of experimentally verified gene sets. Tools such as Metagene Annotator, MetaGeneMark, Orphelia, and FragGeneScan, have compared their predicted results to GenBank annotations (Noguchi et al., 2008) (Zhu et al., 2010) (Hoff et al., 2008) (Rho et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements. The Prodigal software is freely available under the General Public License from
    Bioinformatics 07/2012; 28(17):2223-30. DOI:10.1093/bioinformatics/bts429 · 4.98 Impact Factor
Show more