Comparative Gene Prediction in Human and Mouse

Grup de Recerca en Informàtica Biomèdica. Institut Municipal d'Investigació Medica / Universitat Pompeu Fabra / Centre de Regulació Genòmica 08003 Barcelona, Catalonia, Spain.
Genome Research (Impact Factor: 14.63). 02/2003; 13(1):108-17. DOI: 10.1101/gr.871403
Source: PubMed


The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.

Download full-text


Available from: Josep Francesc Abril,
  • Source
    • "In the first stage, the mapped RAD contigs were screened for sequence similarity to stickleback (Gasterosterus aculeatus) genes using tblastx (E-value < 1e-5). tblastx was chosen as it is more sensitive to protein homologies between distantly related species using sequence data (for example [56,57]), since it translates sequences in all three frames before alignment, thus overcoming problems of detecting open reading frames in the context of frame shifts between species. In the second stage, the linkage group assigned, repeat-masked Salmo salar reference genome contigs were aligned against the stickleback gene sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic linkage maps are useful tools for mapping quantitative trait loci (QTL) influencing variation in traits of interest in a population. Genotyping-by-sequencing approaches such as Restriction-site Associated DNA sequencing (RAD-Seq) now enable the rapid discovery and genotyping of genome-wide SNP markers suitable for the development of dense SNP linkage maps, including in non-model organisms such as Atlantic salmon (Salmo salar). This paper describes the development and characterisation of a high density SNP linkage map based on SbfI RAD-Seq SNP markers from two Atlantic salmon reference families. Approximately 6,000 SNPs were assigned to 29 linkage groups, utilising markers from known genomic locations as anchors. Linkage maps were then constructed for the four mapping parents separately. Overall map lengths were comparable between male and female parents, but the distribution of the SNPs showed sex-specific patterns with a greater degree of clustering of sire-segregating SNPs to single chromosome regions. The maps were integrated with the Atlantic salmon draft reference genome contigs, allowing the unique assignment of ~4,000 contigs to a linkage group. 112 genome contigs mapped to two or more linkage groups, highlighting regions of putative homeology within the salmon genome. A comparative genomics analysis with the stickleback reference genome identified putative genes closely linked to approximately half of the ordered SNPs and demonstrated blocks of orthology between the Atlantic salmon and stickleback genomes. A subset of 47 RAD-Seq SNPs were successfully validated using a high-throughput genotyping assay, with a correspondence of 97% between the two assays. This Atlantic salmon RAD-Seq linkage map is a resource for salmonid genomics research as genotyping-by-sequencing becomes increasingly common. This is aided by the integration of the SbfI RAD-Seq SNPs with existing reference maps and the draft reference genome, as well as the identification of putative genes proximal to the SNPs. Differences in the distribution of recombination events between the sexes is evident, and regions of homeology have been identified which are reflective of the recent salmonid whole genome duplication.
    BMC Genomics 02/2014; 15(1):166. DOI:10.1186/1471-2164-15-166 · 3.99 Impact Factor
  • Source
    • "accurate than previous systems even though they require that informant genomes be spaced at evolutionarily appropriate distances [10] [11] [12]. Newly sequenced genomes, however, do not always have an appropriately closely related genome available , reducing the global performances of such comparative methods. "
    [Show abstract] [Hide abstract]
    ABSTRACT: New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
    11/2013; 2013:502827. DOI:10.1155/2013/502827
  • Source
    • "Eight of these (UCSC, Refseq, CCDS, Vega, Ensembl, Aceview, Gencode, and LincRNAs) are based on empirically-observed EST or RNA-Seq data [24,39-45]. The remaining four annotation tracks (N-Scan, Genscan, SGP, and GeneID) are algorithm-based, generated by scanning the genome for transcription start/stop sites, splice junction signals, etc. [46-50]. Many studies that aim to characterize novel features often do so with only a subset of these annotation tracks [19,20,23,51-53]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The retina is a complex tissue comprised of multiple cell types that is affected by a diverse set of diseases that are important causes of vision loss. Characterizing the transcripts, both annotated and novel, that are expressed in a given tissue has become vital for understanding the mechanisms underlying the pathology of disease. We sequenced RNA prepared from three normal human retinas and characterized the retinal transcriptome at an unprecedented level due to the increased depth of sampling provided by the RNA-seq approach. We used a non-redundant reference transcriptome from all of the empirically-determined human reference tracks to identify annotated and novel sequences expressed in the retina. We detected 79,915 novel alternative splicing events, including 29,887 novel exons, 21,757 3[prime] and 5[prime] alternate splice sites, and 28,271 exon skipping events. We also identified 116 potential novel genes. These data represent a significant addition to the annotated human transcriptome. For example, the novel exons detected increase the number of identified exons by 3%. Using a high-throughput RNA capture approach to validate 14,696 of these novel transcriptome features we found that 99% of the putative novel events can be reproducibly detected. Further, 15-36% of the novel splicing events maintain an open reading frame, suggesting they produce novel protein products. To our knowledge, this is the first application of RNA capture to perform large-scale validation of novel transcriptome features. In total, these analyses provide extensive detail about a previously uncharacterized level of transcript diversity in the human retina.
    BMC Genomics 07/2013; 14(1):486. DOI:10.1186/1471-2164-14-486 · 3.99 Impact Factor
Show more