MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2011; 12(1):491. DOI: 10.1186/1471-2105-12-491
Source: PubMed


Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.
We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.
MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

30 Reads
  • Source
    • "We employed a two-pass, iterative procedure using the MAKER v2.31.6 pipeline (Cantarel et al. 2008; Holt & Yandell 2011) to manage and evaluate the different evidences for gene annotation. For the first pass we predicted genes using SNAP (Korf 2004) with hidden-Markov models developed from the CEGs identified from CEGMA and an ab initio prediction of genes from GENEMARK-ES (Lomsadze et al. 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The single-humped dromedary (Camelus dromedarius), is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation, and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbor a variety of biological and physiological characteristics with evolutionary, economic, and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65x coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21,167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single nucleotide polymorphisms with a mean density of 0.71 x 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes; highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 07/2015; DOI:10.1111/1755-0998.12443 · 3.71 Impact Factor
    • "The retrieved scaffolds were subsequently annotated using a combination of tools, including Maker 2.10 (Holt and Yandell 2011) with RepBase 19.5 (Jurka et al. 2005), SNAP, and Augustus trained on Rhodosporidium toruloides NP11 model (PRJNA169538). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In most fungi, sexual reproduction is bipolar, that is, two alternate sets of genes at a single mating-type (MAT) locus determine two mating types. However, in the Basidiomycota, a unique (tetrapolar) reproductive system emerged in which sexual identity is governed by two unlinked MAT loci, each of which controls independent mechanisms of self/nonself recognition. Tetrapolar to bipolar transitions have occurred on multiple occasions in the Basidiomycota, resulting for instance, from linkage of the two MAT loci into a single inheritable unit. Nevertheless, owing to the scarcity of molecular data regarding tetrapolar systems in the earliest-branching lineage of the Basidiomycota (subphylum Pucciniomycotina), it is presently unclear if the last common ancestor was tetrapolar or bipolar. Here, we address this question, by investigating the mating system of the Pucciniomycotina yeast Leucosporidium scottii. Using whole-genome sequencing and chromoblot analysis, we discovered that sexual reproduction is gove
    Genetics 07/2015; 201(1). DOI:10.1534/genetics.115.177717 · 5.96 Impact Factor
  • Source
    • "Assembly also provided a circular DNA molecule of 27,717 bp in size with a G + C content of 21.9% corresponding to the whole MLO genome sequence. Genes were carried out using the Maker gene annotation pipeline [15]. The Maker pipeline was set with the results of ab initio gene prediction algorithms Augustus [16] and SNAP [17], the 6020 protein-coding genes of Blastocystis ST7 [5], ESTs of both Blastocystis ST7 [5] and ST1 [18] and 414 manually-designed genes of the ST4-WR1 isolate. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The intestinal protistan parasite Blastocystis is characterized by an extensive genetic variability with 17 subtypes (ST1–ST17) described to date. Only the whole genome of a human ST7 isolate was previously sequenced. Here we report the draft genome sequence of Blastocystis ST4-WR1 isolated from a laboratory rodent at Singapore.
    Genomics Data 02/2015; 12. DOI:10.1016/j.gdata.2015.01.009
Show more

Preview (2 Sources)

30 Reads
Available from