ArticleLiterature Review

A Biologist's View of the Drosophila Genome Annotation Assessment Project

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The alco-hol dehydrogenase region of Drosophila is a case in point. Gene-calling programs failed to identify some known genes in this region (Ashburner et al. 1999;Ashburner 2000;Birney and Durbin 2000;Gaasterland et al. 2000;Henikoff and Henikoff 2000;Krogh 2000;Parra et al. 2000;Reese et al. 2000a;Salamov and Solovyev 2000). Expressed sequence tag (EST) analysis is also an important tool for identifying transcription units but is also subject to errors (Adams et al. 1991;Okubo et al. 1992;Weinstock et al. 1994;Adams et al. 1995;Hillier et al. 1996;Audic and Claverie 1997;Wolfsberg and Landsman 1997;Rubin et al. 2000). ...
... Guided by gene density estimates from the comprehensive analysis of the Adh region (Ashburner et al. 1999), Adams et al. made a conservative estimate of 13,601 genes. However, in light of the gene density in the annotated genome sequence, Ashburner conceded that their analysis of the Adh region may have been too conservative, which in turn affects the estimate by Adams et al. (Adams et al. 2000;Ashburner et al. 2000). Our study of transcription in the testis clearly indicates the existence of a significant class of undetected genes in the current genome release. ...
Article
Identification and annotation of all the genes in the sequencedDrosophila genome is a work in progress. Wild-type testis function requires many genes and is thus of potentially high value for the identification of transcription units. We therefore undertook a survey of the repertoire of genes expressed in the Drosophilatestis by computational and microarray analysis. We generated 3141 high-quality testis expressed sequence tags (ESTs). Testis ESTs computationally collapsed into 1560 cDNA set used for further analysis. Of those, 11% correspond to named genes, and 33% provide biological evidence for a predicted gene. A surprising 47% fail to align with existing ESTs and 16% with predicted genes in the current genome release. EST frequency and microarray expression profiles indicate that the testis mRNA population is highly complex and shows an extended range of transcript abundance. Furthermore, >80% of the genes expressed in the testis showed onefold overexpression relative to ovaries, or gonadectomized flies. Additionally, >3% showed more than threefold overexpression at p <0.05. Surprisingly, 22% of the genes most highly overexpressed in testis matchDrosophila genomic sequence, but not predicted genes. These data strongly support the idea that sequencing additional cDNA libraries from defined tissues, such as testis, will be important tools for refined annotation of the Drosophila genome. Additionally, these data suggest that the number of genes in Drosophila will significantly exceed the conservative estimate of 13,601. [The sequence data described in this paper have been submitted to the dbEST data library under accession nos.AI944400–AI947263 and BE661985–BE662262.] [The microarray data described in this paper have been submitted to the GEO data library under accession nos. GPLS, GSM3–GSM10.]
... Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Proteins were annotated with gene ontology (GO) terms from National Center for Biotechnology Information (NCBI) (downloaded Oct 21, 2013) [24]. ...
... Glycerolipid metabolic process (6), lipid hemostasis (4), oxidation reduction (11), single fertilization (4), protein-lipid complex remodeling (3), sexual reproduction (8), protein folding (5), sterol transport (3), response to oxidative stress (5), lipid binding (11), carbohydrate binding (8), glycosaminoglycan binding (5), oxidoreductase activity (3), antioxidase activity (3) Actin filament organization (4), protein folding (5), negative regulation of cellular metabolic process (5), sexual reproduction (7), protein tetramerization (3), oxidation reduction (8), gamete generation (6), cell redox homeostasis (3), protein oligomerization (4), generation of precursor metabolites and energy (5); motor activity (6), enzyme inhibitor activity (6), proteasome regulator activity 92), protein binding (7), nucleoside binding (12), nucleotide binding (15), protein homodimerization activity (5) Sexual reproduction (11), reproductive process in a multicellular organism (10), gamete generation (9), spermatogenesis (8), protein folding 95), glycerophospholipid metabolic process (4), spermatid development/differentiation (3); purine nucleotide binding (24), ATP binding (18), microtubule motor activity (3), sterol transporter activity (2), cAMP dependent protein kinase regulator activity (2), glutathione transferase activity (2) Activated pathways ...
Article
Elevated levels of reactive oxygen species (ROS) are detected in 25% to 80% of infertile men. They are involved in the pathology of male infertility. Understanding the effect of increasing levels of ROS on the differential expression of sperm proteins is important to understand the cellular processes and or/pathways that may be implicated in male infertility. The aim of this study was to examine differentially expressed proteins (DEPs) in spermatozoa from patients with low, medium and high ROS levels. A total of 42 infertile men presenting for infertility and 17 proven fertile men were enrolled in the study. ROS levels were measured by chemiluminescence assay. Infertile men were divided into Low (0- < 93 RLU/s/10(6) sperm) (n = 11), Medium (>93-500 RLU/s/10(6) sperm) (n = 17) and High ROS (>500 RLU/s/10(6) sperm) group (n = 14). All fertile men had ROS levels between 4-50 RLU/s/10(6) sperm. 4 subjects from fertile group and 4 each from the Low, Medium and High ROS were pooled. Protein extraction, protein estimation, gel separation of the proteins, in-gel digestion, LTQ-orbitrap elite hybrid mass spectrometry system was conducted. The DEPs, the cellular localization and pathways of DEPs involved were examined utilizing bioinformatics tools. 1035 proteins were identified in the 3 groups by global proteomic analysis. Of these, 305 were DEPs. 51 were unique to the Low ROS group, 47 Medium ROS group and 104 were unique to the High ROS group. 6 DEPs were identified by Uniprot and DAVID that had distinct reproductive functions and they were expressed only in 3 ROS groups but not in the control. We have for the first time demonstrated the presence of 6 DEPs with distinct reproductive functions only in men with low, medium or high ROS levels. These DEPs can serve as potential biomarkers of oxidative stress induced male infertility.
... Each sequence was analyzed using BLASTX against Swall and Flybase proteins. Pfam [48] domains were identified using ESTwise (Ewan Birney, unpublished) and each Pfam domain was mapped to Interpro annotation. Contaminating T. brucei sequences were removed from the final set of clusters by screening them against all known T. brucei DNA sequences. ...
... Gene Ontology (GO) [48] annotation was transferred to each sequence on the basis of BLASTX hits to Flybase proteins with a significance above E = 1 -10 or, where there was a Pfam domain detected, the corresponding GO terms were transferred on the basis of Interpro to GO mapping. ...
Article
Tsetse flies transmit African trypanosomiasis leading to half a million cases annually. Trypanosomiasis in animals (nagana) remains a massive brake on African agricultural development. While trypanosome biology is widely studied, knowledge of tsetse flies is very limited, particularly at the molecular level. This is a serious impediment to investigations of tsetse-trypanosome interactions. We have undertaken an expressed sequence tag (EST) project on the adult tsetse midgut, the major organ system for establishment and early development of trypanosomes. Results A total of 21,427 ESTs were produced from the midgut of adult Glossina morsitans morsitans and grouped into 8,876 clusters or singletons potentially representing unique genes. Putative functions were ascribed to 4,035 of these by homology. Of these, a remarkable 3,884 had their most significant matches in the Drosophila protein database. We selected 68 genes with putative immune-related functions, macroarrayed them and determined their expression profiles following bacterial or trypanosome challenge. In both infections many genes are downregulated, suggesting a malaise response in the midgut. Trypanosome and bacterial challenge result in upregulation of different genes, suggesting that different recognition pathways are involved in the two responses. The most notable block of genes upregulated in response to trypanosome challenge are a series of Toll and Imd genes and a series of genes involved in oxidative stress responses. Conclusions The project increases the number of known Glossina genes by two orders of magnitude. Identification of putative immunity genes and their preliminary characterization provides a resource for the experimental dissection of tsetse-trypanosome interactions.
... Each sequence was analyzed using BLASTX against Swall and Flybase proteins. Pfam [48] domains were identified using ESTwise (Ewan Birney, unpublished) and each Pfam domain was mapped to Interpro annotation. Contaminating T. brucei sequences were removed from the final set of clusters by screening them against all known T. brucei DNA sequences. ...
... Gene Ontology (GO) [48] annotation was transferred to each sequence on the basis of BLASTX hits to Flybase proteins with a significance above E = 1 -10 or, where there was a Pfam domain detected, the corresponding GO terms were transferred on the basis of Interpro to GO mapping. ...
Article
Full-text available
Tsetse flies transmit African trypanosomiasis leading to half a million cases annually. Trypanosomiasis in animals (nagana) remains a massive brake on African agricultural development. While trypanosome biology is widely studied, knowledge of tsetse flies is very limited, particularly at the molecular level. This is a serious impediment to investigations of tsetse-trypanosome interactions. We have undertaken an expressed sequence tag (EST) project on the adult tsetse midgut, the major organ system for establishment and early development of trypanosomes. A total of 21,427 ESTs were produced from the midgut of adult Glossina morsitans morsitans and grouped into 8,876 clusters or singletons potentially representing unique genes. Putative functions were ascribed to 4,035 of these by homology. Of these, a remarkable 3,884 had their most significant matches in the Drosophila protein database. We selected 68 genes with putative immune-related functions, macroarrayed them and determined their expression profiles following bacterial or trypanosome challenge. In both infections many genes are downregulated, suggesting a malaise response in the midgut. Trypanosome and bacterial challenge result in upregulation of different genes, suggesting that different recognition pathways are involved in the two responses. The most notable block of genes upregulated in response to trypanosome challenge are a series of Toll and Imd genes and a series of genes involved in oxidative stress responses. The project increases the number of known Glossina genes by two orders of magnitude. Identification of putative immunity genes and their preliminary characterization provides a resource for the experimental dissection of tsetse-trypanosome interactions.
... respectively. Gene Ontology (GO) (Ashburner, 2000) annotation was recorded to each sequence on the basis of highest BLAST hit with an Evalue ≤ 10 −6 . ...
Article
To determine if gene expression of An. gambiae is modulated in response to o'nyong-nyong virus (ONNV) infection, we utilized cDNA microarrays including about 20 000 cDNAs. Gene expression levels of ONNV-infected female mosquitoes were compared to that of the uninfected control females harvested at 14 days postinfection. In response to ONNV infection, expression levels of 18 genes were significantly modulated, being at least two-fold up- or down-regulated. Quantitative real-time PCR analysis (qRT-PCR) further substantiated the differential expression of six of these genes in response to ONNV infection. These genes have similarity to a putative heat shock protein 70, DAN4, agglutinin attachment subunit, elongation factor 1 alpha and ribosomal protein L35. One gene, with sequence similarity to mitochondrial ribosomal protein L7, was down-regulated in infected mosquitoes. The expression levels and annotation of the differentially expressed genes are discussed in the context of host/virus interaction including host translation/replication factors, and intracellular transport pathways.
... Similarly to the HMR195 dataset it has been filtered to exclude anomalous sequences. The Drosophila melanogaster Adh region is 2.9 Mb long and has been extensively studied for the last 20 years (Ashburner, 2000 ). For the GASP experiment two different annotation sets were used to evaluate the gene-finding programs' predictions: st1 and st3. ...
Article
Full-text available
Despite constant improvements in prediction accuracy, gene-finding programs are still unable to provide automatic gene discovery with desired correctness. The current programs can identify up to 75% of exons correctly and less than 50% of predicted gene structures correspond to actual genes. New approaches to computational gene-finding are clearly needed. In this paper we have explored the benefits of combining predictions from already existing gene prediction programs. We have introduced three novel methods for combining predictions from programs Genscan and HMMgene. The methods primarily aim to improve exon level accuracy of gene-finding by identifying more probable exon boundaries and by eliminating false positive exon predictions. This approach results in improved accuracy at both the nucleotide and exon level, especially the latter, where the average improvement on the newly assembled dataset is 7.9% compared to the best result obtained by Genscan and HMMgene. When tested on a long genomic multi-gene sequence, our method that maintains reading frame consistency improved nucleotide level specificity by 21.0% and exon level specificity by 32.5% compared to the best result obtained by either of the two programs individually. The scripts implementing our methods are available from http://www.cs.ubc.ca/labs/beta/genefinding/
... Exact gene structural annotation combined with an understanding of the splicing process is crucial for future biological experiments, such as reverse genetics and microarray experiments. Experimental wetlab detection of genomic structure is notoriously slow and costly, therefore reliable computational methods need to be developed [1]. Construction of precise, predictive model of RNA splicing/alternative splicing and the ability to predict the splicing pattern of any primary transcript in any tissue have been cited among the ten major challenges facing bioinformatics [2]. ...
Article
Despite substantial recent progress, gene structural prediction remains a challenging problem in bioinformatics. The importance of a detailed understanding of gene splicing can be underlined by noting that ∼10-15% of human genetic diseases are caused by mutations that affect splice junctions. We briefly introduce the problem, mention the existing ap-proaches to gene structural annotation and provide overview of current methods. In particular, this paper explains why homology-based gene structural prediction appears to be more difficult then it might seem. The problem of splice sites (SSs) sensor design is overviewed with rigorous comparison of key designs. Finally, a discussion of methods in ab initio gene structural prediction is accompanied by an extensive comparative performance study. We make certain conclusions regarding the current state of the art and try to speculate about future research directions. Applications used to evaluate performance characteristics for various gene structural prediction programs are available online at http://www. wyomingbioinformatics.org/∼achurban/.
... The identification of all expressed genes and the structure(s) of their transcripts are prerequisites for many structural and functional genomic studies. Gene-finding programs are valuable tools for identifying gene structure, but they are errorprone and suffer from the inability to predict untranslated regions (UTRs) (Ashburner 2000;Reese et al. 2000). Direct analysis of gene transcripts is the only proven way to establish gene structures with confidence. ...
Article
Full-text available
Collections of full-length nonredundant cDNA clones are critical reagents for functional genomics. The first step toward these resources is the generation and single-pass sequencing of cDNA libraries that contain a high proportion of full-length clones. The first release of the Drosophila Gene Collection Release 1 (DGCr1) was produced from six libraries representing various tissues, developmental stages, and the cultured S2 cell line. Nearly 80,000 random 5' expressed sequence tags (5' expressed sequence tags [ESTs]from these libraries were collapsed into a nonredundant set of 5849 cDNAs, corresponding to ~40% of the 13,474 predicted genes in Drosophila. To obtain cDNA clones representing the remaining genes, we have generated an additional 157,835 5' ESTs from two previously existing and three new libraries. One new library is derived from adult testis, a tissue we previously did not exploit for gene discovery; two new cap-trapped normalized libraries are derived from 0-22-h embryos and adult heads. Taking advantage of the annotated D. melanogaster genome sequence, we clustered the ESTs by aligning them to the genome. Clusters that overlap genes not already represented by cDNA clones in the DGCr1 were analyzed further, and putative full-length clones were selected for inclusion in the new DGC. This second release of the DGC (DGCr2) contains 5061 additional clones, extending the collection to 10,910 cDNAs representing >70% of the predicted genes in Drosophila.
... Interestingly, 11% of the EST sequences characterized in this study failed to align with pre-existing Drosophila ESTs (Andrews et al. 2000;Rubin et al. 2000) or predicted genes (Adams et al. 2000). Mounting evidence from this and other manuscripts clearly indicates that genome annotation is difficult (Ashburner 2000;Gaasterland and Oprea 2001). Whereas none of us would willingly go back to the pre-annotated genome, it is quite clear that genome annotation should be taken with a grain of salt. ...
Article
Full-text available
... Second, genes with some unusual properties (that is, unusual size of exon and intron, use of noncanonical sites) might be difficult to detect. Presently, the general agreement is that these programs detect most of the coding exons but miss most of the splice sites [17][18][19]. Among these programs, Genscan seems to be one of the most efficient at detecting whole genes and Grail2 is the most efficient at detecting internal coding exons [17]. ...
Article
Full-text available
We sequenced a 173-kb region of mouse chromosome 10, telomeric to the Ifng locus, and compared it with the human homologous sequence located on chromosome 12q15 using various sequence analysis programs. This region has a low density of genes: one gene was detected in the mouse and the human sequences and a second gene was detected only in the human sequence. The mouse gene and its human orthologue, which are expressed in the immune system at a low level, produce a noncoding mRNA. Nonexpressed sequences show a higher degree of conservation than exons in this genomic region. At least three of these conserved sequences are also conserved in a third mammalian species (sheep or cow).
... Identifying precisely the 5Ј and 3Ј boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA ("exon parsing") has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997;Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. ...
... In organisms with small genomes, it is relatively straightforward to use direct computational prediction based upon genomic sequence to identify most genes by their long open reading frames (ORFs). However, computational gene prediction from the genomic sequence of organisms with short exons and long introns can be somewhat error-prone (Ashburner 2000; Reese et al. 2000; Lander et al. 2001). Previous efforts to catalogue the human transcriptome were based on expressed sequence tags (ESTs) used for the identification of new genes (Adams et al. 1991; Auffray et al. 1995; Houlgatte et al. 1995), chromosomal assignment of genes (Gieser and Swaroop 1992; Khan et al. 1992; Camargo et al. 2001), prediction of genes (Nomura et al. 1994), and assessment of gene expression (Okubo et al. 1992). ...
Article
Full-text available
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/ ). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
... Similarly, these contiguous sequences, encompassing known and predicted genes, provide a useful experimental sample for assessing existing and new approaches to genomic sequence annotation and analysis. During the course of fly genome sequencing, a 2.9-Mb region encompassing the Adh gene provided "a valuable test of the longer-term strategy of sequencing and annotating the entire genome of this fly" (Ashburner 2000). ...
Article
Full-text available
The current strategy for sequencing the mouse genome involves the combination of a whole-genome shotgun approach with clone-based sequencing. High-resolution physical maps will provide a foundation for assembling contiguous segments of sequence. We have established a bacterial artificial chromosome (BAC)-based map of a 5-Mb region on mouse Chromosome 5, encompassing three gene families: receptor tyrosine kinases (PdgfraKit-Kdr), nonreceptor protein-tyrosine type kinases (Tec-Txk), and type-A receptors for the neurotransmitter GABA (Gabra2, Gabrb1, Gabrg1, and Gabra4). The construction of a BAC contig was initiated by hybridization screening the C57BL/6J (RPCI-23) BAC library, using known genes and sequence tagged sites (STSs). Additional overlapping clones were identified by searching the database of available restriction fingerprints for the RPCI-23 and RPCI-24 libraries. This effort resulted in the selection of >600 BAC clones, 251 kb of BAC-end sequences, and the placement of 40 known and/or predicted genes within this 5-Mb region. We use this high-resolution map to illustrate the integration of the BAC fingerprint map with a radiation-hybrid map via assembled expressed sequence tags (ESTs). From annotation of three representative BAC clones we demonstrate that up to 98% of the draft sequence for each contig could be ordered and oriented using known genes, BAC ends, consensus sequences for transcript assemblies, and comparisons with orthologous human sequence. For functional studies, annotation of sequence fragments as they are assembled into 50-200-kb stretches will be remarkably valuable.
... Inspired by the success of other community-wide experiments, such as GASP (Ashburner, 2000) in genomics, CASP (Moult, 1999) in protein modeling, and PTC (Helma et al., 2001) in computational toxicology, we initiated the first Critical Assessment of Microarray Data * To whom correspondence should be addressed. Analysis (CAMDA) in year 2000. ...
Article
Full-text available
Unlabelled: We initiated the Critical Assessment of Microarray Data Analysis (CAMDA) conference to stimulate and evaluate the development of advanced data analysis techniques for microarrays. A standard data set has been released for this data analysis challenge. The goal of this challenge is to assess the performance of different analytical methods and at the same time to determine how such methods should be evaluated. We hope this effort will catalyze the discussion of microarray data analysis among the research community of biologists, statisticians, mathematicians, and computer scientists. Availability: http://camda.duke.edu.
... Results assesses the individual annotation methods and the Conclusions discusses what the experiment revealed about issues involved in annotating complete genomes. An article by Ashburner (2000) provides a biological perspective on the experiment. ...
Article
Full-text available
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.
... Identifying precisely the 5Ј and 3Ј boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA ("exon parsing") has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997;Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. ...
... When considering the problems of FP errors in large genes, and over-predictions in genes of any size, it is not surprising that many biologists are frustrated by the output of the ab initio programs 26,27 . To address these concerns, Ensembl incorporates similarity information into its pipeline to reduce the incidence of FP errors and over-predictions. ...
Article
Full-text available
To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size (including introns)--genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.
... The following attempts at comprehensive evaluations give a partial picture of current gene prediction performances [12,38,59]. ...
... In reality, such in-depth understanding was a hard-won victory. Extensive genetic and molecular analyses carried out over a span of several decades and annotation efforts carried out over a span of two years account for the high confidence level in the gene estimates for the Adh region (Ashburner 2000). ...
... Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Proteins were annotated with gene ontology (GO) terms from National Center for Biotechnology Information (NCBI) (downloaded Oct 21, 2013) [33]. ...
Article
Full-text available
The etiology of varicocele, a common cause of male factor infertility, remains unclear. Proteomic changes responsible for the underlying pathology of unilateral varicocele have not been evaluated. The objective of this prospective study was to employ proteomic techniques and bioinformatic tools to identify and analyze proteins of interest in infertile men with unilateral varicocele. Spermatozoa from infertile men with unilateral varicocele (n = 5) and from fertile men (control; n = 5) were pooled in two groups respectively. Proteins were extracted and separated by 1-D SDS-PAGE. Bands were digested and identified on a LTQ-Orbitrap Elite hybrid mass spectrometer system. Bioinformatic analysis identified the pathways and functions of the differentially expressed proteins (DEP). Sperm concentration, motility and morphology were lower, and reactive oxygen species levels were higher in unilateral varicocele patients compared to healthy controls. The total number of proteins identified were 1055, 1010 and 1042 in the fertile group, and 795, 713 and 763 proteins in the unilateral varicocele group. Of the 369 DEP between both groups, 120 proteins were unique to the fertile group and 38 proteins were unique to the unilateral varicocele group. Compared to the control group, 114 proteins were overexpressed while 97 proteins were underexpressed in the unilateral varicocele group. We have identified 29 proteins of interest that are involved in spermatogenesis and other fundamental reproductive events such as sperm maturation, acquisition of sperm motility, hyperactivation, capacitation, acrosome reaction and fertilization. The major functional pathways of the 359 DEP related to the unilateral varicocele group involve metabolism, disease, immune system, gene expression, signal transduction and apoptosis. Functional annotations showed that unilateral varicocele mostly affected small molecule biochemistry and post-translational modification proteins. Proteins expressed uniquely in the unilateral varicocele group were cysteine-rich secretory protein 2 precursor (CRISP2) and arginase-2 (ARG2). The expression of these proteins of interest are altered and possibly functionally compromised in infertile men with unilateral varicocele. If validated, these proteins may lead to potential biomarker(s) and help better understand the mechanism involved in the pathophysiology of unilateral varicocele in infertile men.
... Our results for the SWISS-PROT based system might be valid for proteins from genome sequencing projects: about 64% of all proteins were correctly by the profile-based networks using predicted surface and secondary structure. Although, we hope to further improve this level of accuracy, we challenge that the predictions are already good enough to become useful in the context of target selection for structural genomics 33 and to bridge the sequence-annotation gap in entirely-sequenced genomes [82][83][84][85]2,86,32,3 ...
Article
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.
... In organisms with small genomes, it is relatively straightforward to use direct computational prediction based upon genomic sequence to identify most genes by their long open reading frames (ORFs). However, computational gene prediction from the genomic sequence of organisms with short exons and long introns can be somewhat error-prone (Ashburner 2000; Reese et al. 2000; Lander et al. 2001). ...
Article
Full-text available
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
... Proteins that contained similar peptides and could not be differentiated based on MS/ MS analysis alone were grouped to satisfy the principles of parsimony. Proteins were annotated with gene ontology (GO) terms from National Center for Biotechnology Information (NCBI) (downloaded Oct 21, 2013) [36]. ...
Article
Reactive oxygen species (ROS) plays a major role in the pathology of male infertility. It is an independent biomarker of sperm function. Seminal plasma is a natural reservoir of antioxidants responsible for the nourishment, protection, capacitation, and motility of sperm within the female reproductive tract resulting in successful fertilization and implantation of the embryo. A comparative proteomic analysis of seminal plasma proteins from fertile men and infertile men with varying levels of ROS was carried out to identify signature proteins involved in ROS-mediated reproductive dysfunction. A total of 42 infertile men presenting with infertility and 17 proven fertile donors were enrolled in the study. ROS levels were measured in the seminal ejaculates by chemiluminescence assay. Infertile men were subdivided into Low ROS (0-<93 RLU/s/10(6) sperm; n = 11), Medium ROS (>93-500 RLU/s/10(6) sperm; n = 17) and High ROS (>500 RLU/s/10(6) sperm; n = 14) groups and compared with fertile men (4-50 RLU/s/10(6) sperm). 4 subjects from fertile group and 4 each from the Low, Medium and High ROS were pooled. 1D gel electrophoresis followed by in-gel digestion and LC/MS-MS in a LTQ-Orbitrap Elite hybrid mass spectrometer system was used for proteome analysis. Identification of differentially expressed proteins (DEPs), their cellular localization and involvement in different pathways were examined utilizing bioinformatics tools. The results indicate that proteins involved in biomolecule metabolism, protein folding and protein degradation are differentially modulated in all three infertile patient groups in comparison to fertile controls. Membrane metallo-endopeptidase (MME) was uniformly overexpressed (>2 fold) in all infertile groups. Pathway involving 35 focus proteins in post-translational modification of proteins, protein folding (heat shock proteins, molecular chaperones) and developmental disorder was overexpressed in the High ROS group compared with fertile control group. MME was one of the key proteins in the pathway. FAM3D was uniquely expressed in fertile group. We have for the first time demonstrated the presence of 35 DEPs of a single pathway that may lead to impairment of sperm function in men with Low, Medium or High ROS levels by altering protein turn over. MME and FAM3D along with ROS levels in the seminal plasma may serve as good markers for diagnosis of male infertility.
... Enrichment of gene ontology annotations (Ashburner, 2000) and tissueenriched expression profiles (Chintapalli et al., 2007) were computed as described (Dai et al., 2013), using Fisher's exact test with the Bonferroni correction for multiple hypothesis testing. FlyAtlas (http://flyatlas.org) ...
Article
Full-text available
Grainy head (Grh) is a conserved transcription factor (TF) controlling epithelial differentiation and regeneration. To elucidate Grh functions, we identified embryonic Grh targets by ChIP-seq and gene expression analysis. We show that Grh controls hundreds of target genes. Repression or activation correlates with the distance of Grh binding sites to the transcription start sites of its targets. Analysis of 54 Grh-responsive enhancers during development and upon wounding suggests cooperation with distinct TFs in different contexts. In the airways, Grh repressed genes encode key TFs involved in branching and cell differentiation. Reduction of the POU-domain TF, Vvl, (ventral veins lacking) largely ameliorates the airway morphogenesis defects of grh mutants. Vvl and Grh proteins additionally interact with each other and regulate a set of common enhancers during epithelial morphogenesis. We conclude that Grh and Vvl participate in a regulatory network controlling epithelial maturation.
... To determine the differentially expressed peptides, we compared the abundances of the peptides between resistant and susceptible control mosquito legs, using a minimum of 3.0-fold changes (normalized to median). GO terms and pathways enrichment analysis of the differentially expressed proteins, based on fold change criteria, was performed using Gene Ontology Consortium [47]. In addition, a beta-binomial test, using the ibb library in R, was employed to identify significant changes at single-protein level between the control and resistant samples among the three biological replicates. ...
Article
Full-text available
Malaria incidence has halved since the year 2000, with 80% of the reduction attributable to the use of insecticides. However, insecticide resistance is now widespread, is rapidly increasing in spectrum and intensity across Africa, and may be contributing to the increase of malaria incidence in 2018. The role of detoxification enzymes and target site mutations has been documented in the major malaria vector Anopheles gambiae; however, the emergence of striking resistant phenotypes suggests the occurrence of additional mechanisms. By comparing legs, the most relevant insect tissue for insecticide uptake, we show that resistant mosquitoes largely remodel their leg cuticles via enhanced deposition of cuticular proteins and chitin, corroborating a leg-thickening phenotype. Moreover, we show that resistant female mosquitoes seal their leg cuticles with higher total and different relative amounts of cuticular hydrocarbons, compared with susceptible ones. The structural and functional alterations in Anopheles female mosquito legs are associated with a reduced uptake of insecticides, substantially contributing to the resistance phenotype.
Article
Full-text available
We present a www server for AUGUSTUS, a novel software program for ab initio gene prediction in eukaryotic genomic sequences. Our method is based on a generalized Hidden Markov Model with a new method for modeling the intron length distribution. This method allows approximation of the true intron length distribution more accurately than do existing programs. For genomic sequence data from human and Drosophila melanogaster, the accuracy of AUGUSTUS is superior to existing gene-finding approaches. The advantage of our program becomes apparent especially for larger input sequences containing more than one gene. The server is available at http://augustus.gobics.de.
Article
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Article
Full-text available
The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.
Article
Full-text available
Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.
Article
Present-day DNA sequencing techniques have evolved considerably from their early beginnings. A modern sequencing project is essentially an assembly-line environment and is therefore improved and accelerated by the degree to which slow and error-prone manual steps can be replaced by reliable and accurate automatic ones. For hardware, this typically means expanding the use of robotics, for example, to execute the multitude of micro-volume fluid transfers that occur for each of the samples processed in a project. Likewise, automated software replaces manual processing and analysis steps for samples wherever possible. In this article, we focus on one particular aspect of software: the automated handling of raw DNA data. Specifically, we discuss a number of critical software algorithms and components and how they have been woven into a framework for largely hands-off processing of Human Genome Project data at the Genome Sequencing Center. These data represent about 25% of the total public human sequencing project.
Article
Rice is considered as a major source of food for most people worldwide. Global rice production is under threat as a result of adverse effects of heat and drought stress. It is therefore necessary to safeguard its sustainable production. This study was carried out to characterize differentially expressed genes (DEGs) under both drought and heat stresses in rice. 1001 DEGs were determined to show up-regulation under heat and drought stresses, while 1690 DEGs were commonly down-regulated. Functional classification analysis generated 22 and 37 related gene groups from the commonly up-regulated and down-regulated genes respectively. Functional characterization revealed 38.1% and 45.2% DEGs annotating for Biological Process, 53.2% and 59.2% DEGs were for Cellular Components and 54.8% and 52.4% DEGs were for Molecular Function in the up and down commonly regulated DEGs respectively. KEGG analysis demonstrated that most of the up-regulated DEGs were particularly enriched in metabolic pathways and in the biosynthesis of secondary metabolites, while the down regulated genes were mostly enriched in pathways involving ribosomes, and in purine and pyrimidine metabolisms. These results could be helpful in further analysis and understanding of heat and drought stress tolerance in rice.
Article
The initial sequencing of the human genome should be regarded as a milestone in a road that stretches years into the future; the full ramifications of the Human Genome Project are still only being theorized. Researchers will benefit from the catalog of human genes in studies of the genetics of disease susceptibility and the cell biology of gene interactions. Clinicians will increasingly offer genetic or biochemical testing to identify those at highest risk for a number of diseases. Drug discovery will eventually follow newly possible studies of gene expression and protein function. However the Human Genome Project eventually shapes medicine, it is certain that physicians, particularly obstetricians and gynecologists, will need to be well versed in the scientific and ethical issues involved, inasmuch as we will likely be at the center of the most heated debates.
Article
Full-text available
Seven ab initio web-based gene prediction programs (i.e., AUGUSTUS, BGF, Fgenesh, Fgenesh+, GeneID, Genemark.hmm, and HMMgene) were assessed to compare their prediction accuracy using protein-coding sequences of bread wheat. At both nucleotide and exon levels, Fgenesh+ was deduced as the superior program and BGF followed by Fgenesh were resided in the next positions, respectively. Conversely, at gene level, Fgenesh with the value of predicting more than 75% of all the genes precisely, concluded as the best ones. It was also found out that programs such as Fgenesh+, BGF, and Fgenesh, because of harboring the highest percentage of correct predictive exons appear to be much more applicable in achieving more trustworthy results, while using both GeneID and HMMgene the percentage of false negatives would be expected to enhance. Regarding initial exon, overall, the frequency of accurate recognition of 3' boundary was significantly higher than that of 5' and the reverse was true if terminal exon is taken into account. Lastly, HMMgene and Genemark.hmm, overall, presented independent tendency against GC content, while the others appear to be slightly more sensitive if GC-poor sequences are employed. Our results, overall, exhibited that to make adequate opportunity in acquiring remarkable results, gene finders still need additional improvements.
Chapter
Perhaps more than any other organism, the fruit fly Drosophila melanogaster has been the vanguard for molecular genetics and genome mapping. One of the first few metazoan genomes to be sequenced, the fly also offers a set of unparalleled molecular and genetic tools for exploring gene function and genome organization in a complex multicellular animal. In this chapter, we give a brief history of a century of genome biology with the fly, starting with the work in the famous fly room at Columbia University and leading to genome projects and multinational collaborations characteristic of current biological research. We discuss recent technical developments for the analysis of the genome, the proteome, and the interactome of D. melanogaster and related species, finishing with insights emerging from the recent modENCODE project aimed at cataloging all functional elements in the fly genome. Each section concludes with the things to come-“The Times They Are a-Changin.'" May You Live in Interesting Times Astounding Science Fiction Magazine 1950 The Times They Are a-Changin’ Bob Dylan 1963.
Chapter
This chapter focuses on a global comparative view of the genomic and environmental contributions to genome-wide transcriptional activity in Escherichia coli. As known, a genome is not simply a collection of genes. Both genetic and external perturbations disturb chromosomal organisation and lead to reorganisation of the expression profile in relation to cell growth physiology. Comparative transcriptome analyses of E. coli strains exponentially growing under regular conditions or under the stress of temperature upshift are described. The genome-wide transcriptional reorganisation in response to short-term temperature upshift, known as the heat shock response, and that to long-term temperature increase, designated as thermal adaptation, are discussed in comparison to the transcriptome's responses to other stresses. In addition, experimental and analytical concerns are introduced as practical references.
Article
Full-text available
To study the major differences in the distribution of spermatozoa proteins in infertile men with varicocele by comparative proteomics and validation of their level of expression. The study-specific estimates for each varicocele outcome were combined to identify the proteins involved in varicocele-associated infertility in men irrespective of stage and laterality of their clinical varicocele. Expression levels of 5 key proteins (PKAR1A, AK7, CCT6B, HSPA2, and ODF2) involved in stress response and sperm function including molecular chaperones were validated by Western blotting. Ninety-nine proteins were differentially expressed in the varicocele group. Over 87% of the DEP involved in major energy metabolism and key sperm functions were underexpressed in the varicocele group. Key protein functions affected in the varicocele group were spermatogenesis, sperm motility, and mitochondrial dysfunction, which were further validated by Western blotting, corroborating the proteomics analysis. Varicocele is essentially a state of energy deprivation, hypoxia, and hyperthermia due to impaired blood supply, which is corroborated by down-regulation of lipid metabolism, mitochondrial electron transport chain, and Krebs cycle enzymes. To corroborate the proteomic analysis, expression of the 5 identified proteins of interest was validated by Western blotting. This study contributes toward establishing a biomarker "fingerprint" to assess sperm quality on the basis of molecular parameters.
Article
Among infertile men, a diagnosis of unilateral varicocele is made in 90% of varicocele cases and bilateral in the remaining varicocele cases. However, there are reports of under-diagnosis of bilateral varicocele among infertile men and that its prevalence is greater than 10%. In this prospective study, we aimed to examine the differentially expressed proteins (DEP) extracted from spermatozoa cells of patients with bilateral varicocele and fertile donors. Subjects consisted of 17 men diagnosed with bilateral varicocele and 10 proven fertile men as healthy controls. Using the LTQ-orbitrap elite hybrid mass spectrometry system, proteomic analysis was done on pooled samples from 3 patients with bilateral varicocele and 5 fertile men. From these samples, 73 DEP were identified of which 58 proteins were differentially expressed, with 7 proteins unique to the bilateral varicocele group and 8 proteins to the fertile control group. Majority of the DEPs were observed to be associated with metabolic processes, stress responses, oxidoreductase activity, enzyme regulation, and immune system processes. Seven DEP were involved in sperm function such as capacitation, motility, and sperm-zona binding. Proteins TEKT3 and TCP11 were validated by Western blot analysis and may serve as potential biomarkers for bilateral varicocele. In this study, we have demonstrated for the first time the presence of DEP and identified proteins with distinct reproductive functions which are altered in infertile men with bilateral varicocele. Functional proteomic profiling provides insight into the mechanistic implications of bilateral varicocele-associated male infertility.
Article
The past year has been a spectacular one for Drosophila research. The sequencing and annotation of the Drosophila melanogaster genome has allowed a comprehensive analysis of the first three eukaryotes to be sequenced-yeast, worm and fly-including an analysis of the fly's influences as a model for the study of human disease. This year has also seen the initiation of a full-length cDNA sequencing project and the first analysis of Drosophila development using high-density DNA microarrays containing several thousand Drosophila genes. For the first time homologous recombination has been demonstrated in flies and targeted gene disruptions may not be far off.
Article
The rapid growth of genomic sequence data for both human and non-human species has made analyzing these sequences especially predicting genes in them very important and is currently the focus of many research efforts. Beside its scientific interest in molecular biology and genomics community, gene prediction is of considerable importance in human health and medicine. A variety of gene prediction techniques have been developed for eukaryotes, over the past few years. This article reviews and analyzes the application of certain soft computing techniques in gene prediction. First, the problem of gene prediction and its challenges are described. These are followed by different soft computing techniques along with their application to gene prediction. In addition, a comparative analysis of different soft computing techniques for gene prediction is given. Finally some limitations of the current research activities and future research directions are provided.
Chapter
The past year has been a spectacular one for Drosophila research. The sequencing and annotation of the Drosophila melanogaster genome has allowed a comprehensive analysis of the first three eukaryotes to be sequenced—yeast, worm and fly—including an analysis of the fly's influences as a model for the study of human disease. This year has also seen the initiation of a full-length cDNA sequencing project and the first analysis of Drosophila development using high-density DNA microarrays containing several thousand Drosophila genes. For the first time homologous recombination has been demonstrated in flies and targeted gene disruptions may not be far off.
Article
Expressed sequence tags (ESTs), which have piled up considerably so far, provide a valuable resource for finding new genes, disease-relevant genes, and for recognizing alternative splicing variants, SNP sites, etc. The prerequisite for carrying out these researches is to correctly ascertain the gene-sequence-related ESTs. Based on analysis of the alignment results between some known gene sequences and ESTs in public database, several measures including Identity Check, Gap Check, Inclusion Check and Length Check have been introduced to judge whether an EST alignment is related to a gene sequence or not. A computational program EDSAcl.O has been developed to identify true EST alignments and exon regions of query gene sequences. When tested with human gene sequences in the standard dataset HMR195 and evaluated with the standard measures of gene prediction performance, EDSAcl.O can identify proteincoding regions with specificity of 0.997 and sensitivity of 0.88 at the nucleotide level, which outperform that of the counterpart TAP. A web server of EDSAcl.0 is available at http://infosci.hust.edu.cn. Keywordsgene sequence-EST-sequence alignment-true alignment-exon identification
Article
Full-text available
Identification and annotation of all the genes in the sequenced Drosophila genome is a work in progress. Wild-type testis function requires many genes and is thus of potentially high value for the identification of transcription units. We therefore undertook a survey of the repertoire of genes expressed in the Drosophila testis by computational and microarray analysis. We generated 3141 high-quality testis expressed sequence tags (ESTs). Testis ESTs computationally collapsed into 1560 cDNA set used for further analysis. Of those, 11% correspond to named genes, and 33% provide biological evidence for a predicted gene. A surprising 47% fail to align with existing ESTs and 16% with predicted genes in the current genome release. EST frequency and microarray expression profiles indicate that the testis mRNA population is highly complex and shows an extended range of transcript abundance. Furthermore, >80% of the genes expressed in the testis showed onefold overexpression relative to ovaries, or gonadectomized flies. Additionally, >3% showed more than threefold overexpression at p <0.05. Surprisingly, 22% of the genes most highly overexpressed in testis match Drosophila genomic sequence, but not predicted genes. These data strongly support the idea that sequencing additional cDNA libraries from defined tissues, such as testis, will be important tools for refined annotation of the Drosophila genome. Additionally, these data suggest that the number of genes in Drosophila will significantly exceed the conservative estimate of 13,601. [The sequence data described in this paper have been submitted to the dbEST data library under accession nos. AI944400 – AI947263 and BE661985 – BE662262 .] [The microarray data described in this paper have been submitted to the GEO data library under accession nos. GPLS, GSM3–GSM10.]
Article
Full-text available
We present the sequence of a contiguous 2.63 Mb of DNA extending from the tip of the X chromosome of Drosophila melanogaster. Within this sequence, we predict 277 protein coding genes, of which 94 had been sequenced already in the course of studying the biology of their gene products, and examples of 12 different transposable elements. We show that an interval between bands 3A2 and 3C2, believed in the 1970s to show a correlation between the number of bands on the polytene chromosomes and the 20 genes identified by conventional genetics, is predicted to contain 45 genes from its DNA sequence. We have determined the insertion sites of P-elements from 111 mutant lines, about half of which are in a position likely to affect the expression of novel predicted genes, thus representing a resource for subsequent functional genomic analysis. We compare the European Drosophila Genome Project sequence with the corresponding part of the independently assembled and annotated Joint Sequence determined through “shotgun” sequencing. Discounting differences in the distribution of known transposable elements between the strains sequenced in the two projects, we detected three major sequence differences, two of which are probably explained by errors in assembly; the origin of the third major difference is unclear. In addition there are eight sequence gaps within the Joint Sequence. At least six of these eight gaps are likely to be sites of transposable elements; the other two are complex. Of the 275 genes in common to both projects, 60% are identical within 1% of their predicted amino-acid sequence and 31% show minor differences such as in choice of translation initiation or termination codons; the remaining 9% show major differences in interpretation.
Article
In this article we present an in silico method that automatically assigns putative functions to DNA sequences. The annotations are at an increasingly conceptual level, up to identifying general biomedical fields to which the sequences could contribute. This bioinformatics data-mining system makes substantial use of several resources: a locally stored MEDLINE® database; a manually built classification system; the MeSH® taxonomy; relational technology; and bioinformatics methods. Knowledge is generated from various data sources by using well-defined semantics, and by exploiting direct links between them. A two-dimensional “Concept Map™” displays the knowledge graph, which allows causal connections to be followed. The use of this method has been valuable and has saved considerable time in our in-house projects, and can be generally exploited for any sequence-annotation or knowledge-condensation task.
Article
Full-text available
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the ∼120-megabase euchromatic portion of theDrosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes ∼13,600 genes, somewhat fewer than the smaller Caenorhabditis elegansgenome, but with comparable functional diversity.
Article
Full-text available
Motivation: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. Results: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three level's for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software.
Article
Full-text available
Single P-element (P[lArB]) insertional mutagenesis of an isogenic strain was used to identify autosomal loci affecting odor-guided behavior of Drosophila melanogaster. The avoidance response to benzaldehyde of 379 homozygous P[lArB] element-containing insert lines was evaluated quantitatively. Fourteen smell impaired (smi) lines were identified in which P[lArB] element insertion caused different degrees of hyposmia in one or both sexes. The smi loci map to different cytological locations and probably are novel olfactory genes. Enhancer trap analysis of the smi lines indicates that expression of at least 10 smi genes is controlled by olfactory tissue-specific promoter/enhancer elements.
Article
Full-text available
A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species. Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. Milne 1926
Article
Full-text available
The completion of the sequencing of chromosome 3 of the malarial parasite Plasmodium falciparum¹ is a major step forward in our understanding of the Plasmodium genome. We have analysed this chromosome using GlimmerM, a freely available gene-finder developed specifically for the P. falciparum species². GlimmerM was highly effective in finding nearly all the genes that were reported on P. falciparum chromosome 2 (ref. 2), and a newly re-trained version was even more effective on chromosome 3, confirming virtually all reported genes and finding several additional ones.
Article
Lawson et al. reply Pertea et al. report their interpretation of the published Plasmodium falciparum chromosome 3 sequenceusing The Institute for Genomic Research's gene-prediction algorithm GlimmerM . We believe, however, that their analysis is flawed, and highlights the problem that overreliance on a single algorithm can lead to mistakes in annotation.
Article
Forty-seven lethal mutations and alleles of nine visible loci (including alcohol dehydrogenase) have been mapped by both deficiency mapping and, in most cases, by recombination mapping to a small region (34D-35C) of chromosome arm 2L of Drosophila melanogaster. The lethals fall into approximately 21 complementation groups, and we estimate that the total number of lethal plus visible complementation groups within the 34-band deficiency, Df(2L)64j, is approximately 34, a remarkable numerical coincidence. The possible genetic significance of this coincidence is discussed. Lethals mapping close to the structural gene for alcohol dehydrogenase, both distally and proximally, have been identified and will be used for the construction of selective crosses for the study of exchange within this locus. Despite many abnormal cytological features (e.g., ectopic pairing, weak points) region 35 of chromosome arm 2L does not display any unusual genetic features; indeed, in terms of the amount of recombination per band and the average map distance between adjacent loci, this region is similar to that between zeste and white on the X chromosome.
Article
The position of the structural gene coding for alcohol dehydrogenase (ADH) in Drosophila melanogaster has been shown to be within polytene chromosome bands 35B1 and 35B3, most probably within 35B2. The genetic and cytological properties of twelve deficiencies in polytene chromosome region 34--35 have been characterized, eleven of which include Adh. Also mapped cytogenetically are seven other recessive visible mutant loci. Flies heterozygous for overlapping deficiencies that include both the Adh locus and that for the outspread mutant (osp: a recessive wing phenotype) are homozygous viable and show a complete ADH negative phenotype and strong osp phenotype. These deficiencies probably include two polytene chromosome bands, 35B2 and 35B3.
Article
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. we model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. a world wide web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.
  • G Benson
Benson, G. 1999. Nucleic Acids Res. 27: 573–580.
  • A A Salamov
  • V V Solovyev
Salamov, A.A. and V.V. Solovyev. 2000. Genome Res. (this issue).
  • J G Henikoff
  • S Henikoff
Henikoff, J.G. and S. Henikoff. 2000. Genome Res. (this issue).
  • M Ashburner
  • S Misra
  • J Roote
  • S E Lewis
  • R Blazej
  • T Davis
  • C Doyle
  • R Galle
  • R George
  • N Harris
Ashburner, M., S. Misra, J. Roote, S.E. Lewis, R. Blazej, T. Davis, C. Doyle, R. Galle, R. George, N. Harris et al. 1999. Genetics 153: 179–219.
  • M Adams
  • S E Celniker
  • R A Holt
  • C A Evans
  • J D Gocayne
  • P G Amanatides
  • S Scherer
  • P W Li
  • R F Galle
  • R A George
Adams, M., S.E. Celniker, R.A. Holt, C.A. Evans, J.D. Gocayne, P.G. Amanatides, S. Scherer, P.W. Li, R.F. Galle, R.A. George et al. 2000. Science 287: 2185–2195.
  • U Ohler
Ohler, U. 2000. Genome Res. (this issue).
  • G Parra
  • E Blanco
  • R Guigó
Parra, G., E. Blanco, and R. Guigó. 2000. Genome Res. (this issue).
  • Mackay
MacKay. 1996. Genetics 143: 293-301.
  • N Pavy
  • S Rombauts
  • P Dehais
  • C Mathe
  • D V V Ramana
  • P Leroy
  • P Rouze
Pavy, N., S. Rombauts, P. Dehais, C. Mathe, D.V.V. Ramana, P. Leroy, and P. Rouze. 1999. Bioinformatics 15: 887–899.
  • E Birney
  • R Durbin
Birney, E. and R. Durbin. 2000. Genome Res. (this issue).
  • A Krogh
Krogh, A. 2000. Genome Res. (this issue).
  • M G Reese
  • D Kulp
  • H Tammana
  • D Haussler
Reese, M.G., D. Kulp, H. Tammana, and D. Haussler. 2000b. Genome Res. (this issue).
  • M G Reese
  • G Hartzell
  • N L Harris
  • U Ohler
  • S E Lewis
Reese, M.G., G. Hartzell, N.L. Harris, U. Ohler, and S.E. Lewis. 2000a. Genome Res. (this issue).
  • D Lawson
  • S Bowman
  • B Barrell
Lawson, D., S. Bowman, and B. Barrell. 2000. Nature 404: 34–35.
  • T Gaasterland
  • A Sczyrba
  • E Thomas
  • G Aytekin-Kurban
  • P Gordon
  • C W Sensen
Gaasterland, T., A. Sczyrba, E. Thomas, G. Aytekin-Kurban, P. Gordon, and C.W. Sensen. 2000. Genome Res. (this issue).