MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects

Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2011; 12(1):491. DOI: 10.1186/1471-2105-12-491
Source: PubMed


Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.
We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.
MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

Full-text preview

Available from: PubMed Central
    • "Then, we identified genomic scaffolds containing regions of homology to known H11 genes (Smith et al., 1997; Roberts et al., 2013) by mapping (E-value cut-off: 10 -5 ) these assembled transcripts (n = 1,066) using BLAT (Kent, 2002). We also used h11 genes from the H. contortus draft genome, predicted previously using MAKER2 (Holt and Yandell, 2011; cf. Schwarz et al. 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Although substantial research has been focused on the ‘hidden antigen’ H11 of Haemonchus contortus as a vaccine against haemonchosis in small ruminants, little is know about this and related aminopeptidases. In the present article, we reviewed genomic and transcriptomic data sets to define, for the first time, the complement of aminopeptidases (designated Hc-AP-1 to Hc-AP-13) of the family M1 with homologs in C. elegans, characterised by zinc-binding (HEXXH) and exo-peptidase (GAMEN) motifs. The three previously published H11 isoforms (accession nos. X94187, FJ481146 and AJ249941) had most sequence similarity to Hc-AP-2 and Hc-AP-8, whereas unpublished isoforms (accession nos. AJ249942 and AJ311316) were both most similar to Hc-AP-3. The aminopeptidases characterised here had homologs in C. elegans. Hc-AP-1 to Hc-AP-8 were most similar in amino acid sequence (28-41%) to C. elegans T07F10.1; Hc-AP-9 and Hc-AP-10 to C. elegans PAM-1 (isoform b) (53-54% similar); Hc-AP-11 and Hc-AP-12 to C. elegans AC3.5 and Y67D8C.9 (26% and 50% similar, respectively); and Hc-AP-13 to C. elegans C42C1.11 and ZC416.6 (50-58% similar). Comparative analysis suggested that Hc-AP-1 to Hc-AP-8 play roles in digestion, metabolite excretion, neuropeptide processing and/or osmotic regulation, with Hc-AP-4 and Hc-AP-7 having male-specific functional roles. The analysis also indicated that Hc-AP-9 and Hc-AP-10 might be involved in the degradation of cyclin (B3) and required to complete meiosis. Hc-AP- 11 represents a leucyl/cystinyl aminopeptidase, predicted to have metallopeptidase and zinc ion binding activity, whereas Hc-AP-12 likely encodes an aminopeptidase Q homolog also with these activities and a possible role in gonad function. Finally, Hc-AP-13 is predicted to encode an aminopeptidase AP-1 homolog of C. elegans with hydrolase activity, suggested to operate, possibly synergistically with a PEPT-1 ortholog, as an oligopeptide transporter in the gut for protein uptake and normal development and/or reproduction of the worm. Appraisal of structure-based amino acid sequence alignments revealed that all conceptually translated Hc- AP proteins, with the exception of Hc-AP-12, adopt a topology similar to those observed with the two subgroups of mammalian M1 aminopeptidases which possess either three (I, II and IV) or four (I-IV) domains. In contrast, Hc-AP-12 lacks the N-terminal domain (I), but possesses a substantially expanded domain III. Although further work needs to be done to assess amino acid sequence conservation of the different aminopeptidases among individual worms within and among H. contortus populations, we hope that these insights will support future localisation, structural and functional studies of these molecules in H. contortus as well as facilitate future assessments of a recombinant subunit or cocktail vaccine against haemonchosis.
    Biotechnology Advances 10/2015; DOI:10.1016/j.biotechadv.2015.10.003 · 9.02 Impact Factor
    • "All rights reserved. de-novo assembled transcripts as evidence, we predicted 28,455 gene encoding loci (encoding 40,068 proteins) using the MAKER2 annotation pipeline (Holt, C. and Yandell, M. 2011). The gene set originated from 13,725 scaffolds that collectively accounted for 796 Mb of the assembly. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Here we report the draft genome sequence of perennial ryegrass (Lolium perenne), an economically important forage and turf grass species widely cultivated in temperate regions worldwide. It is classified along with wheat, barley, oats and Brachypodium distachyon in the Pooideae sub-family of the grass family (Poaceae). Transcriptome data was used to identify 28,455 gene models, and we utilize macro-co-linearity between perennial ryegrass and barley, and synteny within the grass family to establish a synteny-based linear gene order. The gametophytic self-incompatibility (SI) mechanism enables the pistil of a plant to reject self-pollen and therefore promote outcrossing. We have used the sequence assembly to characterise transcriptional changes in the stigma during pollination with both compatible and incompatible pollen. Characterisation of the pollen transcriptome identified homologs to pollen allergens from a range of species, many of which were expressed to very high levels in mature pollen grains, and potentially involved in the SI mechanism. The genome sequence provides a valuable resource for future breeding efforts based on genomic prediction, and will accelerate the development of varieties for more productive grasslands. This article is protected by copyright. All rights reserved.
    The Plant Journal 09/2015; DOI:10.1111/tpj.13037 · 5.97 Impact Factor
  • Source
    • "We employed a two-pass, iterative procedure using the MAKER v2.31.6 pipeline (Cantarel et al. 2008; Holt & Yandell 2011) to manage and evaluate the different evidences for gene annotation. For the first pass we predicted genes using SNAP (Korf 2004) with hidden-Markov models developed from the CEGs identified from CEGMA and an ab initio prediction of genes from GENEMARK-ES (Lomsadze et al. 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The single-humped dromedary (Camelus dromedarius), is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation, and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbor a variety of biological and physiological characteristics with evolutionary, economic, and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65x coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21,167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single nucleotide polymorphisms with a mean density of 0.71 x 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes; highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 07/2015; DOI:10.1111/1755-0998.12443 · 3.71 Impact Factor
Show more