Brent, M. R. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nature Rev. Genet. 9, 62-73

Center for Genome Sciences, Campus BOX 8510, Washington University, 4444 Forest Park Blvd, Saint Louis, Missouri 63108, USA.
Nature Reviews Genetics (Impact Factor: 36.98). 02/2008; 9(1):62-73. DOI: 10.1038/nrg2220
Source: PubMed


The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon-intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts of the human genome were analysed. By combining several computational and experimental methods, we are now closer to producing complete and accurate gene catalogues than ever before.

Download full-text


Available from: Michael R Brent
  • Source
    • "A complete and accurately annotated proteome provides the building blocks for hypothesis-driven research seeking to enhance our understanding of biology. Genome annotation is a complex process involving multiple integrated tools, which have been described in detail [1] [2] [3] [4] [5] and are beyond the scope of this review. Briefly, traditional methods of genome annotation rely on combining various forms of evidence. "
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the objectives of genome science is the discovery and accurate annotation of all protein-coding genes. Proteogenomics has emerged as a methodology that provides orthogonal information to traditional forms of evidence used for genome annotation. By this method, peptides that are identified via tandem mass spectrometry are used to refine protein-coding gene models. Namely, these peptides are used to confirm the translation of predicted protein-coding genes, as evidence of novel genes or for correction of current gene models. Proteogenomics requires deep and broad sampling of the proteome in order to generate sufficient numbers of unique peptides. Therefore, we propose that proteogenomic projects are designed so that the generated peptides can also be used to create a comprehensive protein atlas that quantitatively catalogues protein abundance changes during development and in response to environmental stimulus.
    Full-text · Article · May 2015 · Current Plant Biology
  • Source
    • "Several studies have assessed specific DQ dimensions in different domains. One study explored progress in the accuracy assessments of automated genome curation tasks (Brent, 2008), whereas another examined in an online interactive community, patterns in credibility (Lankes, 2008). Wang and Strong (1996, p 6) provided a definition for quality, describing it as " fitness for use. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Purpose - The purpose of this paper is to understand genomics scientists' perceptions in data quality assurances based on their domain knowledge. Design/methodology/approach - The study used a survey method to collect responses from 149 genomics scientists grouped by domain knowledge. They ranked the top-five quality criteria based on hypothetical curation scenarios. The results were compared using chi(2) test. Findings - Scientists with domain knowledge of biology, bioinformatics, and computational science did not reach a consensus in ranking data quality criteria. Findings showed that biologists cared more about curated data that can be concise and traceable. They were also concerned about skills dealing with information overloading. Computational scientists on the other hand value making curation understandable. They paid more attention to the specific skills for data wrangling. Originality/value - This study takes a new approach in comparing the data quality perceptions for scientists across different domains of knowledge. Few studies have been able to synthesize models to interpret data quality perception across domains. The findings may help develop data quality assurance policies, training seminars, and maximize the efficiency of genome data management.
    Full-text · Article · Jan 2015 · Journal of Documentation
  • Source
    • "Therefore, experimental proof is suggested when reference genes are to be used under a new experimental condition or in a new plant species [34]. Moreover, given the limitations of Sanger sequencing [36]–[37], expressed sequence tag (EST) analysis has identified only five mRNAs, which are all involved in biosynthesis of the saponins. The next-generation sequencing technology, known as RNA-seq, emerged as an effective approach for high-throughput sequence, now facilitates studies of transcriptome in a rapid way and has been used to explore gene structure and expression profiling on traditional Chinese medicine [38]–[39]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The dried root of Polygala tenuifolia, named Radix Polygalae, is a well-known traditional Chinese medicine. Triterpenoid saponins are some of the most important components of Radix Polygalae extracts and are widely studied because of their valuable pharmacological properties. However, the relationship between gene expression and triterpenoid saponin biosynthesis in P. tenuifolia is unclear. Methodology/Findings In this study, ultra-performance liquid chromatography (UPLC) coupled with quadrupole time-of-flight mass spectrometry (Q-TOF MS)-based metabolomic analysis was performed to identify and quantify the different chemical constituents of the roots, stems, leaves, and seeds of P. tenuifolia. A total of 22 marker compounds (VIP>1) were explored, and significant differences in all 7 triterpenoid saponins among the different tissues were found. We also observed an efficient reference gene GAPDH for different tissues in this plant and determined the expression level of some genes in the triterpenoid saponin biosynthetic pathway. Results showed that MVA pathway has more important functions in the triterpenoid saponin biosynthesis of P. tenuifolia. The expression levels of squalene synthase (SQS), squalene monooxygenase (SQE), and beta-amyrin synthase (β-AS) were highly correlated with the peak area intensity of triterpenoid saponins compared with data from UPLC/Q-TOF MS-based metabolomic analysis. Conclusions/Significance This finding suggested that a combination of UPLC/Q-TOF MS-based metabolomics and gene expression analysis can effectively elucidate the mechanism of triterpenoid saponin biosynthesis and can provide useful information on gene discovery. These findings can serve as a reference for using the overexpression of genes encoding for SQS, SQE, and/or β-AS to increase the triterpenoid saponin production of P. tenuifolia.
    Full-text · Article · Aug 2014 · PLoS ONE
Show more