Comparative gene prediction in human and mouse

Grup de Recerca en Informàtica Biomèdica. Institut Municipal d'Investigació Medica / Universitat Pompeu Fabra / Centre de Regulació Genòmica 08003 Barcelona, Catalonia, Spain.
Genome Research (Impact Factor: 13.85). 02/2003; 13(1):108-17. DOI: 10.1101/gr.871403
Source: PubMed

ABSTRACT The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.

Download full-text


Available from: Josep Francesc Abril, Jul 29, 2015
  • Source
    • "accurate than previous systems even though they require that informant genomes be spaced at evolutionarily appropriate distances [10] [11] [12]. Newly sequenced genomes, however, do not always have an appropriately closely related genome available , reducing the global performances of such comparative methods. "
    [Show abstract] [Hide abstract]
    ABSTRACT: New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
    11/2013; 2013:502827. DOI:10.1155/2013/502827
  • Source
    • "While the amount of genomic sequences in public databases is drastically increasing due to the new high-throughput DNA sequencing technologies, their annotation or biological interpretation still remains a real challenge. Substantial advancements have been made in the last years to improve the accuracy of gene prediction (TWINSCAN, [14]; SGP2, [15]; EvoGene, [16]; N-SCAN, [17]; DOGFISH, [18]; CONTRAST, [19]), which initially used only information contained in the sequences of the genome to be annotated to delimit the structure of genes (ab initio predictors) [20]. Today, programs for gene prediction use the alignment of DNA, RNA or protein sequences from other genomes (homology based predictors) [19] [21], which render them more powerful and more efficient in the identification of genes even for those that have not previously been characterized. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Dicentrarchus labrax is one of the major marine aquaculture species in the European Union. In this study, we have developed a directed-sequencing strategy to sequence three sea bass chromosomes and compared results with other teleosts. Three BAC DNA pools were created from sea bass BAC clones that mapped to stickleback chromosomes/groups V, XVII and XXI. The pools were sequenced to 17-39x coverage by pyrosequencing. Data assembly was supported by Sanger reads and mate pair data and resulted in superscaffolds of 13.2 Mb, 17.5 Mb and 13.7 Mb respectively. Annotation features of the superscaffolds include 1477 genes. We analyzed size change of exon, intron and intergenic sequence between teleost species and deduced a simple model for the evolution of genome composition in teleost lineage. Combination of second generation sequencing technologies, Sanger sequencing and genome partitioning strategies allows "high-quality draft assemblies" of chromosome-sized superscaffolds, which are crucial for the prediction and annotation of complete genes.
    Genomics 06/2011; 98(3):202-12. DOI:10.1016/j.ygeno.2011.06.004 · 2.79 Impact Factor
  • Source
    • "The human genome sequence has been publicly available for 10 yr (Lander et al. 2001), but the exact protein-coding gene number is still under debate (Clamp et al. 2007). Automatic annotation systems such as Ensembl (Hubbard et al. 2002; Curwen et al. 2004) have been developed to generate gene sets by exploiting the power of integrating data from various sources, such as ab initio gene predictors (Kulp et al. 1996; Burge and Karlin 1997; Parra et al. 2000; Stanke and Waack 2003), comparative genomics (Roest Crollius et al. 2000; Korf et al. 2001; Miller 2001; Wiehe et al. 2001; Parra et al. 2003), and mapping of transcriptional (cDNA, EST) or translational evidence (protein sequence) to the DNA sequence (Gelfand et al. 1996; Birney and Durbin 1997). However, manual annotation efforts, such as the Vertebrate Genome Annotation (VEGA) project (Ashurst et al. 2005; Wilming et al. 2008) or RefSeq (Pruitt et al. 2000; Pruitt and Maglott 2001), as well as quality assessment efforts (Guigo et al. 2006) still play a significant role in the validation and refinement of predicted gene models. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (
    Genome Research 04/2011; 21(5):756-67. DOI:10.1101/gr.114272.110 · 13.85 Impact Factor
Show more