[Show abstract][Hide abstract] ABSTRACT: The alphaherpesvirus Pseudorabies virus (PrV) establishes latency primarily in neurons of trigeminal ganglia when only transcription of the latency-associated transcript (LAT) locus is detected. Eleven microRNAs (miRNAs) cluster within LAT, suggesting a role in establishment and/or maintenance of latency. We generated a mutant PrV (M) deleted of nine miRNA genes which displayed almost identical properties with the parental PrV (WT) during propagation in vitro. Fifteen pigs were experimentally infected with either WT, M or mock infected. Similar levels of virus excretion and host antibody response were observed in all infected animals. At 62 days post infection trigeminal ganglia were excised and profiled by deep sequencing and RT-qPCR. Latency was established in all infected animals without evidence of viral reactivation demonstrating that miRNAs are not mandatory for this process. Lower levels of Large Latency Transcript (LLT) were found in ganglia infected by M compared to WT PrV. All PrV miRNAs were expressed, with highest expression found for prv-miR-LLT1, prv-miR-LLT2 (in WT-ganglia) and prv-miR-LLT10 (in both WT and M-ganglia). No evidence of differentially expressed porcine miRNAs was found. Fifty-four porcine genes were differentially expressed between WT, M and control ganglia. Both viruses triggered a strong host immune response, but in M- ganglia gene upregulation was prevalent. Pathway analyses indicated that several biofunctions, including those related to cell-mediated immune response and migration of dendritic cells, were impaired in M- ganglia. These findings are consistent with a function of the LAT locus in the modulation of host response for maintaining a latent state.
Journal of Virology 10/2014; 89(1). DOI:10.1128/JVI.02181-14 · 4.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).
[Show abstract][Hide abstract] ABSTRACT: The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Nucleic Acids Research 11/2013; 42(D1). DOI:10.1093/nar/gkt1059 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The mountains of data thrusting from the new landscape of modern high-throughput biology are irrevocably changing biomedical research and creating a near-insatiable demand for training in data management and manipulation and data mining and analysis. Among life scientists, from clinicians to environmental researchers, a common theme is the need not just to use, and gain familiarity with, bioinformatics tools and resources but also to understand their underlying fundamental theoretical and practical concepts. Providing bioinformatics training to empower life scientists to handle and analyse their data efficiently, and progress their research, is a challenge across the globe. Delivering good training goes beyond traditional lectures and resource-centric demos, using interactivity, problem-solving exercises and cooperative learning to substantially enhance training quality and learning outcomes. In this context, this article discusses various pragmatic criteria for identifying training needs and learning objectives, for selecting suitable trainees and trainers, for developing and maintaining training skills and evaluating training quality. Adherence to these criteria may help not only to guide course organizers and trainers on the path towards bioinformatics training excellence but, importantly, also to improve the training experience for life scientists.
Briefings in Bioinformatics 06/2013; 14(5). DOI:10.1093/bib/bbt043 · 9.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Summary: We present iAnn, an open source community-driven platform for dissemination of life science events, such as courses, conferences and workshops. iAnn allows automatic visualisation and integration of customised event reports. A central repository lies at the core of the platform: curators add submitted events, and these are subsequently accessed via web services. Thus, once an iAnn widget is incorporated into a website, it permanently shows timely relevant information as if it were native to the remote site. At the same time, announcements submitted to the repository are automatically disseminated to all portals that query the system. To facilitate the visualization of announcements, iAnn provides powerful filtering options and views, integrated in Google Maps and Google Calendar. All iAnn widgets are freely available.Availability:
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. RESULTS: The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. CONCLUSIONS: This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig's adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response.
[Show abstract][Hide abstract] ABSTRACT: Zebrafish have become a popular organism for the study of vertebrate gene function. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.
[Show abstract][Hide abstract] ABSTRACT: For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars ∼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.
[Show abstract][Hide abstract] ABSTRACT: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
[Show abstract][Hide abstract] ABSTRACT: The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a ‘gold standard’ definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines.
Database URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
Database The Journal of Biological Databases and Curation 01/2012; 2012:bas008. DOI:10.1093/database/bas008 · 4.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute (WTSI, http://www.sanger.ac.uk/.) are being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. We introduce the ‘Blessed’ annotator and ‘Gatekeeper’ approach to Community Annotation using the Otterlace/ZMap genome annotation tool. We also describe the strategies adopted for annotation consistency, quality control and viewing of the annotation.
Database URL: http://vega.sanger.ac.uk/index.html
Database The Journal of Biological Databases and Curation 01/2012; 2012:bas009. DOI:10.1093/database/bas009 · 4.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Funding bodies are increasingly recognizing the need to provide graduates and researchers with access to short intensive courses in a variety of disciplines, in order both to improve the general skills base and to provide solid foundations on which researchers may build their careers. In response to the development of 'high-throughput biology', the need for training in the field of bioinformatics, in particular, is seeing a resurgence: it has been defined as a key priority by many Institutions and research programmes and is now an important component of many grant proposals. Nevertheless, when it comes to planning and preparing to meet such training needs, tension arises between the reward structures that predominate in the scientific community which compel individuals to publish or perish, and the time that must be devoted to the design, delivery and maintenance of high-quality training materials. Conversely, there is much relevant teaching material and training expertise available worldwide that, were it properly organized, could be exploited by anyone who needs to provide training or needs to set up a new course. To do this, however, the materials would have to be centralized in a database and clearly tagged in relation to target audiences, learning objectives, etc. Ideally, they would also be peer reviewed, and easily and efficiently accessible for downloading. Here, we present the Bioinformatics Training Network (BTN), a new enterprise that has been initiated to address these needs and review it, respectively, to similar initiatives and collections.
Briefings in Bioinformatics 11/2011; 13(3):383-9. DOI:10.1093/bib/bbr064 · 9.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The GENCODE consortium is a sub group of the ENCODE consortium. Its aim is to provide complete annotation of genes in the human genome including protein-coding loci, non-coding loci and pseudogenes, based on experimental evidence. The final aim is for the HAVANA team to manually annotate the complete genome. This is a time-consuming process which will be completed over the course of the ENCODE project. Currently, to provide a set of annotation covering the complete genome, rather than just the regions that have been manually annotated, a merge of manual annotation from HAVANA with automatic annotation from the Ensembl automatically annotated gene set is created. This process also adds unique full-length CDS predictions from the Ensembl protein coding set into manually annotated genes, to provide the most complete up to date annotation of the genome possible. Also included in the set are short and long ncRNA genes predicted by the Ensembl prediction pipelines and a consensus set of pseudogene predictions agreed between Havana, Yale and UCSC. The CCDS set is also fully represented within the GENCODE set. The GENCODE set is the default annotation available in Ensembl and is also available in the UCSC genome browser. All the annotation is tagged as to whether it is produced by manual annotation alone, automatic annotation alone, or by both approaches. We are currently working to provide confidence levels for annotation, based on depth and type of evidence supporting it.
[Show abstract][Hide abstract] ABSTRACT: Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
Genome Research 07/2009; 19(7):1316-23. DOI:10.1101/gr.080531.108 · 13.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Human and Vertebrate Analysis and Annotation (HAVANA) group at the Wellcome Trust Sanger Institute produced the manually annotated geneset for the Encyclopedia of DNA Elements (ENCODE) pilot project and, as part of the Gencode subgroup, are reprising this role in the scale up to cover the whole human genome. Our manual annotation is checked computationally and validated experimentally. Loci and transcripts predicted to be absent from the initial annotation are identified by comparison with a number of state-of-the-art algorithms for identifying exons, splice sites, transcripts and pseudogenes. Where novel features are confirmed the annotation is updated. Annotated coding transcripts are analysed to assess their coding potential by investigating patterns of conservation within the coding sequence (CDS) and comparing predicted secondary structures of annotated CDSs to similar proteins with solved structures. Annotated coding transcripts are also checked against the current set of human Consensus CDSs (CCDS) to check agreement with other participating centres (EBI, NCBI, & UCSC).
An initial round of annotation and analysis of chromosomes 21 and 22 has shown that while HAVANA annotation is both comprehensive and robust, it has benefitted from computational review. 13 novel non-coding loci, 27 novel splice variants and 6 extensions to existing variants were identified, many of which were found using supporting EST/mRNA sequences that were not present at the time of initial annotation. Fewer than 10 annotated CDSs required reclassification, no CCDS sequences required updating and 26 novel pseudogene were added. The annotation of human chromosome 2 is complete and we are currently annotating chromosomes 3 and 7. Data from all members of Gencode is distributed via DAS and is now visible in our Zmap annotation interface, allowing assessment of computational predictions contemporaneous with first-pass gene annotation.
[Show abstract][Hide abstract] ABSTRACT: The zebrafish genome, which consists of 25 linkage groups and is ~1.4Gb in size, is being sequenced, finished and analysed in its entirety at the Wellcome Trust Sanger Institute. The manual annotation is provided by the Human and Vertebrate Analysis and Annotation (HAVANA) group and is released at regular intervals onto the Vertebrate Genome Annotation (Vega) database ("http://vega.sanger.ac.uk":http://vega.sanger.ac.uk) and may be viewed as a DAS source in Ensembl ("http://www.ensembl.org/Danio_rerio":http://www.ensembl.org/Danio_rerio).
Our annotation is compiled in close collaboration with the Zebrafish Information Network (ZFIN) ("http://zfin.org/":http://zfin.org/), which has enabled us to provide an accurate, dynamic and distinct resource for the zebrafish community as a whole.
Annotation is based on the reference genome sequence, which is derived from a minimal tile path assembly composed of clones that have been mapped, sequenced and meticulously finished to a sequence accuracy of over 99.9% per 100Kb. We expect to have 90% of the zebrafish genome to a finished standard by the end of 2009. Our approach to annotation uses two strategies. Firstly, the generation and annotation of gene lists comprising of cDNA (8995 in total) found in ZFIN that maps to our current reference assembly. And, secondly, by using clone by clone annotation, where we have annotated over 3200 genes, 1100 transcripts and 130 pseudogenes across 11 linkage groups and 3530 clones. As well as our on-going genome annotation we also welcome external annotation requests for specific genes and regions, which already include the annotation of 93 genes associated with human obesity and the scheduled annotation of the Major Histocompatability Complex, which will utilise reference sequence taken from libraries of a double haploid fish and complement our previous work on the human and mouse MHC already published.
External requests and any feedback, questions or requests can be sent to zfish-help [at] sanger.ac.uk.
[Show abstract][Hide abstract] ABSTRACT: Manual annotation (the "museum" model of annotation) relies on a small group of specialized curators to catalogue and classify genes according to their functional roles. This is both costly and time consuming and therefore is used only for model organisms with sufficient funding. Smaller research communities often have to rely on other models of annotation, mainly automated annotation (the "factory" model, e.g. Ensembl), and the "jamboree" model (in which a group of leading biologists from the community and bioinformaticians come together for a short intensive annotation workshop). At the Wellcome Trust Sanger Institute (WTSI), the Havana team provides high quality manual annotation of finished vertebrate genome sequences, namely human, mouse and zebrafish. We also perform the curation of specific finished regions such as the MHC in dog, cow and pig, whose whole genomes have been assembled from unfinished BACs or from whole genome shotgun sequences. In addition, we at Havana have also hosted annotation jamborees for the cow (Bos taurus) and pig (Sus scrofa) genomes. During those sessions, the research community had the opportunity to annotate their genes of interest under expert guidance using the custom written publicly available Otterlace annotation system, and the unified manual annotation guidelines. By making use of the tools and skills acquired during the cow and pig jamborees, the delegates can continue annotating their genomes remotely. For the pig genome, a highly contiguous physical map has been generated by an international effort of four laboratories (available in Pre!Ensembl) and is being used as a substrate for the swine genome sequencing project. Upcoming vertebrate genomes will be sequenced to a high depth coverage with the next generation sequencing technologies (e.g. Illumina, 454, SOLiD) but will have the drawback of not being manually finished. Manual annotation will be more accurate than the automated predictions at coping with any assembly problems derived from these high coverage but unfinished (or automatic pre-finished) genomes. Once these inherent assembly errors are corrected and the gene structures are accurately identified with manual annotation, the curated genes will be incorporated and merged with the predicted gene models in Ensembl to provide a unified view of the landscape of vertebrate genomes. I will present an introduction to our manual annotation system and our experience using it for annotation jamborees at the WTSI.