A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome.
ABSTRACT Five years after the completion of the sequence of the Drosophila melanogaster genome, the number of protein-coding genes it contains remains a matter of debate; the number of computational gene predictions greatly exceeds the number of validated gene annotations. We have assembled a collection of >10,000 gene predictions that do not overlap existing gene annotations and have developed a process for their validation that allows us to efficiently prioritize and experimentally validate predictions from various sources by sequencing RT-PCR products to confirm gene structures. Our data provide experimental evidence for 122 protein-coding genes. Our analyses suggest that the entire collection of predictions contains only approximately 700 additional protein-coding genes. Although we cannot rule out the discovery of genes with unusual features that make them refractory to existing methods, our results suggest that the D. melanogaster genome contains approximately 14,000 protein-coding genes.
- SourceAvailable from: ncbi.nlm.nih.gov[show abstract] [hide abstract]
ABSTRACT: The availability of draft sequences for both the mouse and human genomes makes it possible, for the first time, to annotate whole mammalian genomes using comparative methods. TWINSCAN is a gene-prediction system that combines the methods of single-genome predictors like GENSCAN with information derived from genome comparison, thereby improving accuracy. Because TWINSCAN uses genomic sequence only, it is less biased toward highly and/or ubiquitously expressed genes than GENEWISE, GENOMESCAN, and other methods based on evidence derived from transcripts. We show that TWINSCAN improves gene prediction in human using intermediate products from various stages of the sequencing and analysis of the mouse genome, from low-redundancy, whole-genome shotgun reads to the draft assembly and the synteny map. TWINSCAN improves on the prior state of the art even when alignments from only 1X coverage of the mouse genome are available. Gene prediction accuracy improves steadily from 1X through 3X, more slowly from 3X to 4X, and relatively little thereafter. The assembly and the synteny map greatly speed the computations, however. Our human annotation using the mouse assembly is conservative, predicting only 25,622 genes, and appears to be one of the best de novo annotations of the human genome to date.Genome Research 02/2003; 13(1):46-54. · 14.40 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Collections of full-length nonredundant cDNA clones are critical reagents for functional genomics. The first step toward these resources is the generation and single-pass sequencing of cDNA libraries that contain a high proportion of full-length clones. The first release of the Drosophila Gene Collection Release 1 (DGCr1) was produced from six libraries representing various tissues, developmental stages, and the cultured S2 cell line. Nearly 80,000 random 5' expressed sequence tags (5' expressed sequence tags [ESTs]from these libraries were collapsed into a nonredundant set of 5849 cDNAs, corresponding to ~40% of the 13,474 predicted genes in Drosophila. To obtain cDNA clones representing the remaining genes, we have generated an additional 157,835 5' ESTs from two previously existing and three new libraries. One new library is derived from adult testis, a tissue we previously did not exploit for gene discovery; two new cap-trapped normalized libraries are derived from 0-22-h embryos and adult heads. Taking advantage of the annotated D. melanogaster genome sequence, we clustered the ESTs by aligning them to the genome. Clusters that overlap genes not already represented by cDNA clones in the DGCr1 were analyzed further, and putative full-length clones were selected for inclusion in the new DGC. This second release of the DGC (DGCr2) contains 5061 additional clones, extending the collection to 10,910 cDNAs representing >70% of the predicted genes in Drosophila.Genome Research 08/2002; 12(8):1294-300. · 14.40 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: We introduce a general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the reliability of each predicted exon. Consistently high levels of accuracy are observed for sequences of differing C + G content and for distinct groups of vertebrates.Journal of Molecular Biology 05/1997; 268(1):78-94. · 3.91 Impact Factor
A computational and experimental approach to
validating annotations and gene predictions in
the Drosophila melanogaster genome
Mark Yandell*†‡, Adina M. Bailey†§, Sima Misra†§, ShengQiang Shu§, Colin Wiel§, Martha Evans-Holm§,
Susan E. Celniker¶, and Gerald M. Rubin*§¶
*Howard Hughes Medical Institute and§Department of Molecular and Cell Biology, University of California, Life Sciences Addition, Berkeley, CA
94720-3200; and¶Department of Genome Sciences, Lawrence Berkeley National Laboratory, One Cyclotron Road, Mailstop 64-121, Berkeley, CA 94720
Contributed by Gerald M. Rubin, December 17, 2004
Five years after the completion of the sequence of the Drosophila
melanogaster genome, the number of protein-coding genes it
contains remains a matter of debate; the number of computational
gene predictions greatly exceeds the number of validated gene
annotations. We have assembled a collection of >10,000 gene
predictions that do not overlap existing gene annotations and
have developed a process for their validation that allows us to
efficiently prioritize and experimentally validate predictions from
various sources by sequencing RT-PCR products to confirm gene
coding genes. Our analyses suggest that the entire collection of
predictions contains only ?700 additional protein-coding genes.
Although we cannot rule out the discovery of genes with unusual
features that make them refractory to existing methods, our
gene number ? validation ? genome annotation
those who curated the D. melanogaster genome concluded that
the annotated 13,659 genes in the 3.1 release likely constitute
95% of all protein-coding genes (1), others researchers have
concluded that many, possibly thousands, of protein-coding
genes remain unannotated (2). Two issues have fueled the
debate surrounding gene number in D. melanogaster: the large
numbers of computational gene predictions located within in-
tergenic regions and varying standards of experimental evidence
for concluding that a gene prediction corresponds to a real gene.
As of release 3.1, ?50% of the D. melanogaster genome is
intergenic. Running the gene prediction program GENSCAN (3) on
every intergenic region in the D. melanogaster genome results in
10,644 gene predictions spread amongst 62 megabases (Mb) of
but how many? The best way to answer this question is to subject
a representative sample of the gene predictions to some validation
The design and interpretation of experiments intended to assay
expression of genes that have been predicted computationally have
become controversial. One approach is to rely on hybridization to
microarrays or RT-PCR assays for transcript expression (2), with
the detection of a product by agarose gel electrophoresis taken as
confirmation of the corresponding gene prediction. However, as
our results show, unless the diagnostic PCR product includes a
splice junction, amplification of residual genomic DNA and detec-
tion of unprocessed transcripts may lead to false verifications of
and not just the size, of the PCR products (4).
One way to obtain spliced cDNAs for sequencing is to perform
RT-PCR with a 3?-oligo(dT) primer and an upstream PCR primer
located in the prediction’s 5?-most exon. The advantages of this
approach are that it requires only a single, prediction-specific
he total number of protein-coding genes in the Drosophila
melanogaster genome remains a subject of debate. Whereas
oligonucleotide (oligo) and that the sequenced PCR product pro-
vides much useful information about transcript structure. A disad-
vantage is that the limited processivity of reverse transcriptase can
make it difficult to obtain products from long transcripts. Another
approach (5–7), the one that we have pursued, involves designing
two PCR primers to a pair of flanking exons. This approach
circumvents the problems associated with obtaining PCR products
from long transcripts, but it provides less information about tran-
One drawback to all of these approaches is that they are labor
intensive. Microarray-based validation assays offer a possible alter-
native in this regard. Hild et al. (2) recently reported the identifi-
cation and validation of 2,636 previously unrecognized D. melano-
gaster genes, based on a microarray-based approach that involved
hybridizing randomly primed cDNA against probes corresponding
scale well when large numbers of gene predictions need to be
verified. As our results show, however, one drawback of this
approach is that it is unable to distinguish ‘‘background’’ transcrip-
We have constructed a pipeline for validation of gene predic-
tions, no matter what their source (computational gene-finder,
human annotation, etc.) that allows us to subject every potential
gene to the same procedures of oligo design and validation. This
approach produces consistent and standardized results and makes
possible more accurate estimates of how many genes in D. mela-
nogaster remain unannotated.
Because testing every prediction in our collection would be very
prioritize for validation those predictions most likely to test true.
Toward this end, we explored both homology and gene structure as
means to prioritize predictions for validation. Below, we describe
our results to date on a collection of GENSCAN and FGENESH
predictions, existing genome annotations, and gene predictions
recently ‘‘confirmed’’ by a microarray-based validation approach
(2). We conclude that our collection of gene predictions contains
only ?700 additional protein-coding genes.
Materials and Methods
Priority Scores and Primer Design. Our validation strategy required
us to identify the best pair of exons within which to locate primers.
We developed a fuzzy logic (8) algorithm to accomplish this task.
Freely available online through the PNAS open access option.
Abbreviations: Mb, megabases; oligo, oligonucleotide; sjc, splice-junction conserved.
database (accession nos. CX309415–CX309654).
†M.Y., A.M.B., and S.M. contributed equally to this work.
‡To whom correspondence should be addressed at: Department of Molecular and Cell
Biology, University of California, Life Sciences Addition, Room 539, Berkeley, CA 94720-
3200. E-mail: firstname.lastname@example.org.
© 2005 by The National Academy of Sciences of the USA
February 1, 2005 ?
vol. 102 ?
Fuzzy-logic algorithms have been used with success in a variety of
problems, from antiskid braking systems for Boeing aircraft to
genomic analysis (9). The approach is especially well suited to
situations wherein little training data exists but considerable expert
knowledge is available (see the supporting information, which is
published on the PNAS web site). We used PRIMER3 (10) to pick
Partitioning Subsets of Predictions for Verification. For purposes of
our analyses, we defined intergenic regions as those regions be-
tween two genes on either strand. Introns, even if they contained a
Arabidopsis parameter file. The GENSCAN and FGENESH predictions
were loaded into a release 3.1 D. melanogaster annotation gadfly
database (11) as putative annotations. The Heidelberg predictions
were obtained as gene finding format (GFF) files (M. Hild, per-
sonal communication). Using a BIOPERL script based on
BioModel::IntersectionGraph (12), we identified GENSCAN and
FGENESH predictions that did not overlap any release 3.1 genes or
REPEATMASKER results (www.repeatmasker.org), then identified
Heidelberg predictions that did not overlap any of the GENSCAN or
FGENESH predictions that were being tested or release 3.1 genes or
tified by using CGL, a software library designed to facilitate such
comparisons (unpublished work).
RNA Isolation. RNA was isolated with RNAwiz reagent (Ambion,
Austin, TX) from the following developmental stages from an
subcollections), late third-instar larva (L3), mixed-stage pupa, and
mixed-age adult. Poly(A)?RNA was selected by using Poly(A)?
Purist kits (Ambion).
See the supporting information for RT-PCR and DNA se-
Alignment of Oligos and Sequenced RT-PCR Products to the Genome.
RT-PCR product sequence reads were quality-trimmed as de-
scribed in ref. 14. Short (?160 bp) and poorer-quality sequence
traces were read manually. For other low-quality reads, we used
were aligned to the genomic sequence by using SIM4WRAP (11).
Matcheswerefilteredbyusingthe BERKELEY OUTPUT PARSER(11).
The control, GENSCAN, FGENESH, and Heidelberg (2) predictions
database, and each prediction was visualized with aligned RT-PCR
products and oligos by using the APOLLO genome annotation
browser and editor (15). In some cases, poor RT-PCR product
sequence quality required manual National Center for Biotechnol-
ogy Information BLASTN (16) comparison against the release 3
genomic sequence. When the sequence corresponded to a spliced
mRNA transcript, the prediction was scored as ‘‘sequence vali-
dated’’ and checked against release 3.2 of the annotated genome in
in release 3.2, a gene model was curated and communicated to
FlyBase (17). See the supporting information for details of repli-
cation of RT-PCRs described by Hild et al. (2) and analysis of
RT-PCR products of unexpected size.
Results and Discussion
A Collection of >10,000 Nonoverlapping Gene Models Representing
Potentially Unannotated Genes. Because our pipeline for gene
prediction validation is independent of any particular source of
predictions, we assembled a large collection of potentially unan-
notated genes derived from multiple sources. Our primary source
consisted of 10,644 GENSCAN predictions lying within intergenic
regions in the D. melanogaster 3.1 release. These predictions were
obtained by running GENSCAN over the complete set of intergenic
regions whose repeats had been masked by using REPEATMASKER
(www.repeatmasker.org). Excluding all single exon predictions (be-
cause these are unsuitable substrates for our validation procedure)
resulted in a set of 9,811 multiexon GENSCAN predictions. We then
sought additional predictions that did not overlap any of our
GENSCAN predictions from a set of 1,167 FGENESH predictions
produced by using a modified version of the program especially
trained for use on D. melanogaster (18), which gave an additional
325 predictions, for a total of 10,136. To these we added 1,266
multiexon gene models located in intergenic regions reported by
Hild et al. (2) to consist of previously unrecognized protein-coding
genes that had been validated by the microarray-based approach;
however, all but 37 of these overlapped one of the 10,136 predic-
tions generated by GENSCAN or FGENESH.
Experimental Strategy. Our experimental approach required that
we design two exon-specific primers for each gene model, but to
which pair of exons? We reasoned that regardless of the particulars
of the gene-finder that produced the prediction, the longer the
open-reading frame, the less likely it is to occur by chance, and,
that experimentally confirmed exons tend to be longer than GEN-
SCAN-predicted exons in gene-containing regions and much longer
than the GENSCAN-predicted exons in intergenic regions (data not
shown) supports this hypothesis. If the gene model is the product
of human annotation, its longer exons are still the better choice,
regard to sequence complexity and melting temperatures. We also
sought whenever possible to design primers to exon pairs that
flanked introns whose length was as close as possible to the modal
intron length in D. melanogaster, because these exon pairs, we
reasoned, comprise the portion of the prediction most likely to be
of exon pairs flanking introns of the same length, we selected the
3?-most exon pair, which would be relevant if using oligo(dT) to
prime reverse transcription reactions. However, because we used
gene-specific primers for both the RT and PCR steps, this criterion
was not pertinent to the experiments described here. Given these
of how to weigh each criterion when choosing which exons to use
for primer design. To address this problem, we developed a
fuzzy-logic (8) algorithm that considers ideal two exons, both ?300
nt or longer, that flank an intron of ?70 but ?200 nt in length. The
algorithm considers each exon pair and scores it relative to this
ideal. We then used PRIMER3 (10) to design a matched pair of
primers to the high-scoring pair of exons. One consequence of this
approach is that each gene model in our collection now had a score
associated with its potential PCR product (see the supporting
information) independent of the original gene prediction program.
correct, then, on average, those gene models whose PCR products
were assigned a high score would be more likely to validate than
would those gene models whose PCR products had received a low
prove useful not only in estimating the number of true positives in
a collection of gene models, but also for prioritizing predictions for
validation; hence, we dubbed them ‘‘priority scores.’’
Our validation procedure entailed an RT-PCR assay with these
gene-specific primers to amplify a portion of a predicted transcript.
Both poly(A)?and total RNA isolated from whole animals rep-
the RT-PCRs. Any RT-PCR product whose size corresponded to
structure. In addition, all unique RT-PCR products, regardless of
relation to predicted size, were analyzed at this level. In a pilot
experiment, 105 of the RT-PCRs that produced no band were
sequenced; of these, none verified the tested prediction, so we
Yandell et al.
February 1, 2005 ?
vol. 102 ?
no. 5 ?
decided not to sequence reactions that did not produce a band on
an agarose gel. Only PCR products reflecting transcript splicing
were considered to verify gene predictions. If a reaction did not
yield specific products, that negative result was only included in our
analysis in cases where the corresponding primers were shown to
produce the expected amplicon from genomic DNA in a control
PCR (see the supporting information). We are aware that this
rigorous requirement will result in our discarding a significant
number of true negatives and thus result in an overestimate of the
percentage of gene models that validate, because many of the
primer pairs will fail to amplify genomic DNA simply because
the lengths of intervening introns make the amplicons prohibitively
large. In fact, the median expected size for genomic products for
those reactions that failed this test (2,824 bp) was ?5 times that of
those that passed (510 bp).
The Collection Was Partitioned into Four Subsets for Validation
priority scores might be of a gene prediction’s likelihood to be
experimentally validated, we subdivided our working prediction
collection on the basis of sequence homology rather than priority
score, hypothesizing that frequency of experimental validation
would also correlate with extent of interspecies sequence conser-
vation (Table 1). To identify genes with sequence conservation, we
first searched each of the 9,811 GENSCAN and 375 FGENESH pre-
dictions in our collection against the Drosophila pseudoobscura
genome by using TBLASTN in an effort to identify those predictions
having at least two conserved exons. We identified 559 predictions
satisfying this criterion.
Next we subjected each of these predictions to a more stringent
analysis: testing for evidence of conserved splice junctions at the
same position in the D. pseudoobscura genome. We termed those
that met this criterion the ‘‘sjc set’’ (‘‘splice-junction conserved’’)
because each has at least two adjacent exons, both with homology
to D. pseudoobscura, flanking conserved donor and acceptor sites.
We tested 171 (Table 1). Those predictions with two conserved
exons but without evidence for a conserved splice junction were
homol-2 set, we forced our exon selector algorithm to design oligos
to the best-conserved pair of exons and kept track of their priority
The remaining 9,577 GENSCAN and FGENESH predictions with at
most a single exon with D. pseudoobscura conservation constitute
the ‘‘homol-0 set.’’ We selected predictions such that one-quarter
one-quarter had scores between 50 and 25, and one-quarter had
scores ?25. We tested a total of 204 predictions.
multiexon genes reported by Hild et al. (2) to have been verified
based on expression detected by microarrays. We chose those with
the highest priority scores that did not overlap any of the gene
predictions for testing from the sjc, homol-0, or homol-2 datasets;
we termed this the ‘‘Heidelberg set.’’
which we were particularly certain of the gene structures. All have
at least one published cDNA sequence confirming their annotated
sequence evidence. These genes constituted our control set.
Validation Results. The results in Table 1 show that whereas gene
models with homology in D. pseudoobscura are more likely to test
true, simple sequence conservation (across the evolutionary dis-
tance separating D. melanogaster from D. pseudoobscura) is not as
useful in determining that a gene prediction corresponds to a real
gene as exon–intron structure conservation. Gene predictions
having at least one conserved splice junction in D. pseudoobscura
(sjc set) were ?2.5 times as likely to validate as those lacking a
conserved splice junction but having two (partially) conserved
exons (homol-2 set). In this regard, our results accord well with
previous analyses of human and rodent predictions (4), which also
found conservation of gene structure to be the most powerful in
silico diagnostic of how likely a gene prediction was to reflect the
structure of a real gene.
do not believe this low rate of validation reflects limitations of our
experimental approach because we were able to confirm the
existence of nearly all (154?159) of the genome annotations con-
tained in our control set. Based on EST frequency data (35 of the
159 have no associated ESTs), we know that not all of the genes in
our control set were highly expressed genes, yet they were still
RT-PCR conditions, we replicated the secondary RT-PCR valida-
tion protocol described by Hild et al. (2) for 22 gene models for
which they reported positive RT-PCR results; we also were able to
obtain positive results for 21 of 22 (95%) of these, confirming that
our assay conditions are able to detect previously unrecognized
Drosophila transcripts at a comparable rate, and none of these
produced a band in control reactions lacking reverse transcriptase.
Thus, although we were able to confirm that a high percentage of
the Heidelberg gene models overlap transcribed regions in the
genome, our results suggest that only a small fraction of these
transcribed regions give rise to spliced mRNAs. Therefore, we do
not dispute the data reported by Hild et al. (2), only their inter-
pretation that a high percentage of these gene models represent
previously unrecognized protein-coding genes.
The distributions of priority scores associated with the four
different sets also suggest that the Heidelberg set contains few
previously unannotated protein-coding genes. As can be seen in
Fig. 1, the priority score distribution associated with our control set
is very different from that of the 9,811 multiexon GENSCAN pre-
dictions included in our collection. The mode for the priority score
distribution for the control set was 80, twice that of the GENSCAN
homol-0 set) also tend to have higher priority scores (Fig. 1B). On
set (Fig. 1C) is nearly indistinguishable from that of the 9,811
GENSCAN predictions. Thus, the data shown in Fig. 1 provide
additional support for our conclusion that the Heidelberg set
contains few genes that produce spliced mRNAs. Taken together,
these facts suggest that the microarray-based validation procedure
used by Hild et al. (2) has a high false positive rate and that priority
scores and homology, especially conservation of a splice junction,
Table 1. Sequence validation rates of predictions
Gene prediction set
Total no. of
*See Table 2 for specific numbers from GENSCAN vs. FGENESH predictions.
†Gene predictions were considered validated if the aligned sequence of the
PCR product was consistent with a spliced gene model in the region of the
‡Only 1,266 multiexon predictions from the 2,636 predictions described by
Hild et al. (2) were considered for analysis, and, of these, we tested only the
160 with the highest priority scores that did not overlap any GENSCAN or
FGENESH predictions tested in the other sets.
www.pnas.org?cgi?doi?10.1073?pnas.0409421102Yandell et al.
are better indicators that a gene prediction produces a spliced
Overall, homol-0 set gene models had the lowest validation rate
(5.9%). Recall that for testing, we randomly selected predictions
and we tested a total of 204 predictions. None of the predictions
having priority scores ?25 validated; 3.3% of the predictions with
priority scores between 25 and 50 validated; 10% of the predictions
was half that of the Heidelberg set. We chose for inclusion in the
scores; thus, the higher percentage of verified genes in the Heidel-
berg set reflects this selection bias, because it is similar to the
validation rate of the highest scoring homol-0 predictions.
Estimating the Number of Protein-Coding Genes That Remain Unan-
notated. We have experimentally tested 744 gene models: ?7% of
our total collection. We can estimate the total number of protein-
coding genes contained in our collection by using these results
(Table 1) to estimate what percentage of the 9,373 homol-0 set, 153
homol-2 set, and 26 sjc set predictions, which were not directly
tested, correspond to real genes. These extrapolations suggest that
553 homol-0, 21 homol-2, and 8 sjc gene predictions (a total of 582)
would validate by our approach. Including the 122 experimentally
validated gene models gives an estimate that 705 protein-coding
prediction sets and are thus included in the above estimate.)
As Fig. 1 demonstrates, priority scores do indeed indicate how
likely a gene model is to be validated, regardless of the means used
of unannotated genes, we investigated the use of priority scores to
estimate the number of real genes contained in a collection of
predictions. To do so, we first partitioned the remaining untested
9,552 GENSCAN and FGENESH gene predictions and the remaining
untested 1,106 gene models from Hild et al. (2) into four bins based
on their priority scores: ?25, 25–50, 50–75, and 75–100. We then
multiplied the number of genes in each bin by the observed
frequency of validation for our 204 tested homol-0 genes: 0 for the
first quartile, 0.033 for the second, 0.1 for the third, and 0.102 for
the fourth. This approach gave an estimate of 404 additional real
genes among the untested GENSCAN and FGENESH predictions; 20
additional genes are predicted from the Heidelberg set, but, as
mentioned above, these are likely to be redundant of those in the
GENSCAN and FGENESH predictions. Adding the 122 experimentally
validated gene models places the total number of unannotated
genes at ?572, a 20% decrease compared with the number of 705
derived by extrapolation of the data in Table 1 as described above.
Nevertheless, because we wished to obtain a reasonable upper
bound for the number of genes our collection might contain, we
choose 705 as a working estimate for the additional considerations
A factor not addressed by any of our analyses is how many
single-exon genes remain unannotated. Because these are not
substrates of our validation procedure, none were tested. Running
exon genes. By definition, these single exon genes fall into the
homol-0 class. Because there is no evidence that single exon gene
predictions are more accurate than multiexon predictions, the
validation rate (5.9%) of the homol-0 genes would give 49 addi-
coding genes in the 3.1 release to ?750.
We also considered the possibility that some additional genes
might reside among the unvalidated sjc predictions, but we do not
believe this to be true. To address this possibility, however, we
searched every sjc prediction’s predicted protein against the Gen-
Bank nonredundant protein database. Ninety percent of the vali-
dated sjc predictions have a significant hit (expect ?1e?4; default
WU-BLASTP parameters) to a non-Drosophila protein, whereas only
21% of the unverified sjc predictions do. Thus, the validated sjc
unvalidated remainder; this finding suggests that much of the
homology associated with the unvalidated sjc predictions is non-
coding. Moreover, given the extensive genomic sequence conser-
vation between the two Drosophila species and the large numbers
of false positives in our collection of gene models, we find it
unsurprising that 1% (133?10,136) of the GENSCAN and FGENESH
predictions would have significant, noncoding homology to D.
pseudoobscura, the details of which were consistent with conserved
GT and AG dinucleotides at predicted exon borders. Thus, we
believe that most of the unvalidated sjc predictions are indeed true
negatives, although we cannot rule out the possibility that some
fraction of them may correspond to real genes whose expression is
spatially or temporally very limited.
A final issue to consider is how many of these 750 or so genes
represent previously unrecognized genes as opposed to merely
unannotated 5? and 3? exons of existing annotations. It seems
reasonable to assume that some of our validated predictions are, in
missed exons rather than genes, a conservative estimate, consider-
ing that Stolc et al. (13) have concluded that the number may be as
high as 32% (because 369 of 1,155 expressed GENSCAN exons they
distributionofthecontrolset(blackbars).(B)All GENSCANpredictions(graybars)vs.allvalidated GENSCANpredictions(blackbars).(C)All GENSCANpredictions(gray
bars) vs. the Heidelberg set (black bars).
Priority scores have predictive value. (A) The priority score distribution for all GENSCAN predictions in our collection (gray bars) vs. the priority score
Yandell et al.
February 1, 2005 ?
vol. 102 ?
no. 5 ?
tested belonged to existing annotations), would bring the number
of genes in our extrapolation to 675.
Priority Scores Provide a Means to Manage a Validation Pipeline in a
Cost-Effective Manner. The large numbers of predictions in our
collection relative to the small number of real genes contained
within it means that identifying which of the remaining 10,658
untested predictions is real is potentially an inefficient and labor
might be used to prioritize predictions for validation, but if the
models are derived from several gene-finders, relating individual
therefore examined the utility of the priority scores assigned by our
primer-design algorithm as a general method to score gene models
positives could be obtained from our collection of 10,136 GENSCAN
and FGENESH predictions in the first 1,000 PCRs by using priority
scores; choosing models randomly would require 5,068 PCRs to
obtain the same number of true positives. Thus, our approach can
be used to prioritize predictions for validation in a cost-effective
manner such that a greater number of previously unrecognized
genes can be identified with fewer PCRs.
Our goal was to obtain an accurate estimate of the number of
predicted protein-coding genes left unannotated in the D. melano-
gaster genome as of release 3.1. Toward this end, we assembled a
large collection of putative unannotated, multiexon protein-coding
375 FGENESH predictions, and 1,266 purportedly previously unrec-
ray-based approach (2). To consider a prediction experimentally
product align specifically to the prediction and reveal evidence of
One drawback to our validation approach is that it is labor
intensive. Our collection of gene predictions was large and, we
suspected, contained few true positives. These issues made highly
desirable any in silico method capable of identifying gene predic-
tions likely to be real, because it would allow us to test these
predictions first. Accordingly, we explored several such measures:
homology, conservation of gene structure, and priority scores.
Overall, those predictions having at least one conserved splice
junction when compared with D. pseudoobscura validated at the
highest rate, 37.4% (Table 1). We found simple sequence conser-
vation to be a much less reliable in silico indicator of gene
predictions’ validation rates than conservation of gene structure.
Only 13.4% of gene models with two partially conserved exons but
lacking evidence of a conserved splice site in D. pseudoobscura
verified. Expression of spliced transcripts was detected for only
11.3% of the predictions represented by the Heidelberg set. It
appears that our requirement that diagnostic RT-PCR products
represent spliced transcripts accounts for the difference in our
validation rate from that of the Heidelberg group. Although those
predictions may overlap transcribed regions of the genome, our
evidence suggests that only ?11% of those predictions actually
correspond to mRNAs. The generally low priority scores assigned
Overall, 16.4% of the gene predictions tested proved valid. We
models. Our reasons for believing so are twofold. First, the high
rates of validation associated with the sjc and homol-2 sets suggest
that homology, especially conservation of gene structure, is a good
indicator that a gene prediction actually reflects a real gene. We
tested the majority of such predictions; therefore, the 9,552 pre-
positives. Second, the low rate of validation (5.9%) associated with
the homol-0 set independently suggests the same conclusion. Ex-
trapolating from the validation rate of the homol-0 set or using the
more sophisticated method that makes use of priority scores gives
similar results; both suggest that our collection of gene predictions
contains ?700 protein-coding genes. Analysis of our findings
suggested that 34% of our 122 previously unrecognized genes were
included in this release. Thus, we conclude that our collection of
melanogaster has more than ?14,000 protein-coding genes.
We discovered that sequencing was crucial to our verification
procedure. For a small fraction of our predictions that gave an
expected RT-PCR product size on an agarose gel, the sequenced
products aligned not to the predictions but to other regions of the
genome, usually highly expressed, previously annotated genes,
suggesting that these products were due to mispriming on abun-
dantly expressed genes. Another 153 RT-PCRs yielded unique
products of unexpected size (see the supporting information).
Sequencing indicated that ?40% of these products aligned to a
different part of the genome, and another 12% verified a spliced
the original prediction. Thus, one must use caution when interpret-
by hybridization to microarrays or sizing RT-PCR products. With-
out a sequenced product or sufficient microarray hybridization
specificity, one might interpret a positive signal as a previously
unrecognized gene, when it is actually cross-hybridization to, or
mispriming from, an already annotated gene.
How justified are we in concluding that 95% of all D.
melanogaster protein-coding genes have been identified and at
least provisionally annotated? Certainly, there is no shortage of
untested predictions. However, if our estimates are correct,
experimentally testing all 9,552 remaining gene predictions
would identify only about another 500 genes. Moreover, implicit
(y axis) vs. the number of PCRs performed (x axis) if choosing randomly (black
line) or choosing on the basis of priority score (black diamonds) from the
collection of 10,136 GENSCAN and FGENESH predictions.
www.pnas.org?cgi?doi?10.1073?pnas.0409421102Yandell et al.