Anchored Hybrid Enrichment for Massively High-Throughput Phylogenomics

Department of Scientific Computing, Florida State University, Dirac Science Library, Tallahassee, FL 32306-4102, USA.
Systematic Biology (Impact Factor: 14.39). 05/2012; 61(5):727-44. DOI: 10.1093/sysbio/sys049
Source: PubMed


The field of phylogenetics is on the cusp of a major revolution, enabled by new methods of data collection that leverage both genomic resources and recent advances in DNA sequencing. Previous phylogenetic work has required labor-intensive marker development coupled with single-locus polymerase chain reaction and DNA sequencing on clade-by-clade and locus-by-locus basis. Here, we present a new, cost-efficient, and rapid approach to obtaining data from hundreds of loci for potentially hundreds of individuals for deep and shallow phylogenetic studies. Specifically, we designed probes for target enrichment of >500 loci in highly conserved anchor regions of vertebrate genomes (flanked by less conserved regions) from five model species and tested enrichment efficiency in nonmodel species up to 508 million years divergent from the nearest model. We found that hybrid enrichment using conserved probes (anchored enrichment) can recover a large number of unlinked loci that are useful at a diversity of phylogenetic timescales. This new approach has the potential not only to expedite resolution of deep-scale portions of the Tree of Life but also to greatly accelerate resolution of the large number of shallow clades that remain unresolved. The combination of low cost (~1% of the cost of traditional Sanger sequencing and ~3.5% of the cost of high-throughput amplicon sequencing for projects on the scale of 500 loci × 100 individuals) and rapid data collection (~2 weeks of laboratory time) are expected to make this approach tractable even for researchers working on systems with limited or nonexistent genomic resources.

162 Reads
  • Source
    • "Our method represents the first effort to simultaneously capture different mitochondrial DNA from mass samples to recover taxonomic composition. Although previous work has demonstrated some success in cross-taxa sequence capture (Lemmon et al. 2012; Li et al. 2013), it is crucial to understand whether and how the presence of multiple divergent species affect capture success. Our results suggest that probes based on references from a "
    [Show abstract] [Hide abstract]
    ABSTRACT: Biodiversity analyses based on Next Generation Sequencing (NGS) platforms have developed by leaps and bounds in recent years. A PCR-free strategy, which can alleviate taxonomic bias, was considered as a promising approach to delivering reliable species compositions of targeted environments. The major impediment of such a method is the lack of appropriate mitochondrial DNA enrichment ways. Because mitochondrial genomes (mitogenomes) make up only a small proportion of total DNA, PCR-free methods will inevitably result in a huge excess of data (> 99%). Furthermore, the massive volume of sequence data is highly demanding on computing resources. Here, we present a mitogenome enrichment pipeline via a gene capture chip that was designed by virtue of the mitogenome sequences of the 1000 Insect Transcriptome Evolution project (1KITE, www .1kite . org). A mock sample containing 49 species was used to evaluate the efficiency of the mitogenome capture method. We demonstrate that the proportion of mitochondrial DNA can be increased by ca. 100-fold (from the original 0.47% to 42.52%). Variation in phylogenetic distances of target taxa to the probe set could in principle result in bias in abundance. However, the frequencies of input taxa were largely maintained after capture (R(2) =0.81). We suggest that our mitogenome capture approach coupled with PCR-free shotgun sequencing could provide ecological researchers an efficient NGS method to deliver reliable biodiversity assessment. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 10/2015; DOI:10.1111/1755-0998.12472 · 3.71 Impact Factor
  • Source
    • "These fundamental questions have barely been addressed. Interestingly, most targeted-sequence capture studies have only presented phylogenetic results for data sets including approximately 30% missing data or less (e.g., Crawford et al. 2012; Faircloth et al. 2012, 2013; Lemmon et al. 2012; Leaché et al. 2014; Smith et al. 2014; Xi et al. 2014). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Targeted sequence capture is becoming a widespread tool for generating large phylogenomic datasets to address difficult phylogenetic problems. However, this methodology often generates datasets in which increasing the number of taxa and loci increases amounts of missing data. Thus, a fundamental (but still unresolved) question is whether sampling should be designed to maximize sampling of taxa or genes, or to minimize the inclusion of missing data cells. Here, we explore this question for an ancient, rapid radiation of lizards, the pleurodont iguanians. Pleurodonts include many well-known clades (e.g. anoles, basilisks, iguanas, spiny lizards) but relationships among families have proven difficult to resolve strongly and consistently using traditional sequencing approaches. We generated up to 4,921 ultraconserved elements with sampling strategies including 16, 29, and 44 taxa, from 1,179 to ~2.4 million characters per matrix and ~30% to 60% total missing data. We then compared mean branch support for interfamilial relationships under these 15 different sampling strategies for both concatenated (maximum likelihood) and species tree (NJst) approaches (after showing that mean branch support appears to be related to accuracy). We found that both approaches had the highest support when including loci with up to 50% missing taxa (matrices with ~40-55% missing data overall). Thus, our results show that simply excluding all missing data may be highly problematic as the primary guiding principle for the inclusion or exclusion of taxa and genes. The optimal strategy was somewhat different for each approach, a pattern that has not been shown previously. For concatenated analyses, branch support was maximized when including many taxa (44) but fewer characters (1.1 million). For species-tree analyses, branch support was maximized with minimal taxon sampling (16) but many loci (4,789 of 4,921). We also show that the choice of these sampling strategies can be critically important for phylogenomic analyses, since some strategies lead to demonstrably incorrect inferences (using the same method) that have strong statistical support. Our preferred estimate provides strong support for most interfamilial relationships in this important but phylogenetically challenging group.
    Systematic Biology 09/2015; DOI:10.1093/sysbio/syv058 · 14.39 Impact Factor
  • Source
    • "Targeted enrichment, or sequence capture, was originally done with hybridization of sheared whole genomic DNAs to microarrays containing conserved sequences derived from either EST libraries or transcriptomes of species that bracketed the species of interest (Ng et al., 2009). Subsequently, the conserved sequence oligonucleotides, referred to as " probes " or " baits, " labeled with biotin for capture in solution, have become more often used, as microarray data production is more costly and less simple to work with and interpret (Cronn et al., 2012; Lemmon et al., 2012; Zhou & Holliday, 2012; Li et al., 2013; Stull et al., 2013; Stephens et al., 2015). For animals, in addition to PCRgenerated or cDNA-derived baits (Penalba et al., 2014), noncoding nuclear gene regions flanking Ultra Conserved Elements (UCEs) in the genomes of many phyla allow sequence capture using the same baits across species separated by up to hundreds of millions of years (Faircloth et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Single and low copy nuclear genes offer a larger number of, and more rapidly evolving, characters than the chloroplast and nuclear ribosomal gene sequences that have dominated plant phylogenetic studies to date. Until recently, only one or a few low copy nuclear gene markers were included in such studies. Now, the rapid adoption of “next generation sequencing” (NGS) techniques offers simpler and cheaper access to hundreds of, and not just tens of, coding and noncoding DNA regions. In this review, we describe the most commonly-used NGS methods available for accessing nuclear genes and discuss many NGS case studies that have been published in the last two to three years. These approaches include whole genome sequencing to target microsatellites, transcriptome sequencing, Exon-Primed Intron-Crossing sequencing (EPIC), targeted enrichment (or sequence capture), RAD sequencing (RAD-Seq, including genotyping-by-sequencing or GBS), and genome skimming. We also discuss some of the challenges to, and posed by,the NGS approaches.
    Journal of Systematics and Evolution 08/2015; 53(5). DOI:10.1111/jse.12174 · 1.49 Impact Factor
Show more