MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects

Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2011; 12(1):491. DOI: 10.1186/1471-2105-12-491
Source: PubMed


Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.
We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.
MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

Download full-text


Available from: PubMed Central · License: CC BY
    • "Assembly was performed with the Celera assembler version 8.1[39]. Annotations for the H. dujardini genome assembly were generated using the automated genome annotation pipeline MAKER2[40]. Our H. dujardini genome sequencing resulted in an assembly of 212.3 Mb with an average estimated coverage of 126X. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The superphylum Panarthropoda (Arthropoda, Onychophora, and Tardigrada) exhibits a remarkable diversity of segment morphologies, enabling these animals to occupy diverse ecological niches. The molecular identities of these segments are specified by Hox genes and other axis patterning genes during development [1, 2]. Comparisons of molecular segment identities between arthropod and onychophoran species have yielded important insights into the origins and diversification of their body plans [3-9]. However, the relationship of the segments of tardigrades to those of arthropods and onychophorans has remained enigmatic [10, 11], limiting our understanding of early panarthropod body plan diversification. Here, we reveal molecular identities for all of the segments of a tardigrade. Based on our analysis, we conclude that tardigrades have lost a large intermediate region of the body axis-a region corresponding to the entire thorax and most of the abdomen of insects-and that they have lost the Hox genes that originally specified this region. Our data suggest that nearly the entire tardigrade body axis is homologous to just the head region of arthropods. Based on our results, we reconstruct a last common ancestor of Panarthropoda that had a relatively elongate body plan like most arthropods and onychophorans, rather than a compact, tardigrade-like body plan. These results demonstrate that the body plan of an animal phylum can originate by the loss of a large part of the body.
    No preview · Article · Jan 2016 · Current biology: CB
    • "Then, we identified genomic scaffolds containing regions of homology to known H11 genes (Smith et al., 1997; Roberts et al., 2013) by mapping (E-value cut-off: 10 -5 ) these assembled transcripts (n = 1,066) using BLAT (Kent, 2002). We also used h11 genes from the H. contortus draft genome, predicted previously using MAKER2 (Holt and Yandell, 2011; cf. Schwarz et al. 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Although substantial research has been focused on the ‘hidden antigen’ H11 of Haemonchus contortus as a vaccine against haemonchosis in small ruminants, little is know about this and related aminopeptidases. In the present article, we reviewed genomic and transcriptomic data sets to define, for the first time, the complement of aminopeptidases (designated Hc-AP-1 to Hc-AP-13) of the family M1 with homologs in C. elegans, characterised by zinc-binding (HEXXH) and exo-peptidase (GAMEN) motifs. The three previously published H11 isoforms (accession nos. X94187, FJ481146 and AJ249941) had most sequence similarity to Hc-AP-2 and Hc-AP-8, whereas unpublished isoforms (accession nos. AJ249942 and AJ311316) were both most similar to Hc-AP-3. The aminopeptidases characterised here had homologs in C. elegans. Hc-AP-1 to Hc-AP-8 were most similar in amino acid sequence (28-41%) to C. elegans T07F10.1; Hc-AP-9 and Hc-AP-10 to C. elegans PAM-1 (isoform b) (53-54% similar); Hc-AP-11 and Hc-AP-12 to C. elegans AC3.5 and Y67D8C.9 (26% and 50% similar, respectively); and Hc-AP-13 to C. elegans C42C1.11 and ZC416.6 (50-58% similar). Comparative analysis suggested that Hc-AP-1 to Hc-AP-8 play roles in digestion, metabolite excretion, neuropeptide processing and/or osmotic regulation, with Hc-AP-4 and Hc-AP-7 having male-specific functional roles. The analysis also indicated that Hc-AP-9 and Hc-AP-10 might be involved in the degradation of cyclin (B3) and required to complete meiosis. Hc-AP- 11 represents a leucyl/cystinyl aminopeptidase, predicted to have metallopeptidase and zinc ion binding activity, whereas Hc-AP-12 likely encodes an aminopeptidase Q homolog also with these activities and a possible role in gonad function. Finally, Hc-AP-13 is predicted to encode an aminopeptidase AP-1 homolog of C. elegans with hydrolase activity, suggested to operate, possibly synergistically with a PEPT-1 ortholog, as an oligopeptide transporter in the gut for protein uptake and normal development and/or reproduction of the worm. Appraisal of structure-based amino acid sequence alignments revealed that all conceptually translated Hc- AP proteins, with the exception of Hc-AP-12, adopt a topology similar to those observed with the two subgroups of mammalian M1 aminopeptidases which possess either three (I, II and IV) or four (I-IV) domains. In contrast, Hc-AP-12 lacks the N-terminal domain (I), but possesses a substantially expanded domain III. Although further work needs to be done to assess amino acid sequence conservation of the different aminopeptidases among individual worms within and among H. contortus populations, we hope that these insights will support future localisation, structural and functional studies of these molecules in H. contortus as well as facilitate future assessments of a recombinant subunit or cocktail vaccine against haemonchosis.
    No preview · Article · Oct 2015 · Biotechnology Advances
    • "All rights reserved. de-novo assembled transcripts as evidence, we predicted 28,455 gene encoding loci (encoding 40,068 proteins) using the MAKER2 annotation pipeline (Holt, C. and Yandell, M. 2011). The gene set originated from 13,725 scaffolds that collectively accounted for 796 Mb of the assembly. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Here we report the draft genome sequence of perennial ryegrass (Lolium perenne), an economically important forage and turf grass species widely cultivated in temperate regions worldwide. It is classified along with wheat, barley, oats and Brachypodium distachyon in the Pooideae sub-family of the grass family (Poaceae). Transcriptome data was used to identify 28,455 gene models, and we utilize macro-co-linearity between perennial ryegrass and barley, and synteny within the grass family to establish a synteny-based linear gene order. The gametophytic self-incompatibility (SI) mechanism enables the pistil of a plant to reject self-pollen and therefore promote outcrossing. We have used the sequence assembly to characterise transcriptional changes in the stigma during pollination with both compatible and incompatible pollen. Characterisation of the pollen transcriptome identified homologs to pollen allergens from a range of species, many of which were expressed to very high levels in mature pollen grains, and potentially involved in the SI mechanism. The genome sequence provides a valuable resource for future breeding efforts based on genomic prediction, and will accelerate the development of varieties for more productive grasslands. This article is protected by copyright. All rights reserved.
    No preview · Article · Sep 2015 · The Plant Journal
Show more