ArticlePDF Available

Identifying and removing haplotypic duplication in primary genome assemblies

Authors:

Abstract and Figures

Motivation: Rapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results: Here we present a novel tool "purge_dups" that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability: The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information: Supplementary data are available at Bioinformatics online.
Content may be subject to copyright.
Genome analysis
Identifying and removing haplotypic duplication in
primary genome assemblies
Dengfeng Guan
1,2
, Shane A. McCarthy
2
, Jonathan Wood
3
, Kerstin Howe
3
,
Yadong Wang
1,
* and Richard Durbin
2,3,
*
1
Department of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology, Harbin 150001, China,
2
Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK and
3
Wellcome Sanger Institute, Wellcome Genome
Campus, Cambridge CB10 1SA, UK
*To whom correspondence should be addressed.
Associate Editor: Alfonso Valencia
Received on August 13, 2019; revised on December 17, 2019; editorial decision on January 7, 2020; accepted on January 19, 2020
Abstract
Motivation: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production
of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high
heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in
contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to re-
solve this problem. However, they either focus only on removing contained duplicate regions, also known as haplo-
tigs, or fail to use all the relevant information and hence make errors.
Results: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically
identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate
that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining com-
pleteness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into as-
sembly pipelines.
Availability and implementation: The source code is written in C and is available at https://github.com/dfguan/
purge_dups.
Contact: ydwang@hit.edu.cn or rd109@cam.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The superior and increasing throughput of long-read sequencing
technologies, such as from Pacific Biosciences (Pacbio) and Oxford
Nanopore Technologies (ONT), is revolutionizing the sequencing of
genomes for new species (Phillippy, 2017). Long-read assemblers,
such as Falcon (Chin et al., 2016) and Canu (Koren et al., 2017),
typically generate haplotype-fused paths of a diploid genome, with
Falcon-unzip (Chin et al., 2016) further able to separate the initial
assembly into primary contigs and haplotigs. However, when there
is high heterozygosity as in many outbred species, for example, most
insects and marine animals, the allelic relationships between haplo-
typic regions can be hard to identify, causing not only haplotigs to
be mislabeled as primary contigs, but also overlaps to be kept among
the primary contigs. The majority of these retained overlaps are be-
tween homologous chromosomes, and the resulting duplication
harms downstream processes, such as scaffolding and gene annota-
tion, leading to incorrect results.
Tools such as purge_haplotigs (Roach et al., 2018) and
HaploMerger2 (Huang et al., 2017) have been designed to resolve
this problem. Purge_haplotigs makes use of both read depth and se-
quence similarity to identify haplotigs. However, it does not identify
heterozygous overlaps, and requires users to specify read-depth cut-
offs manually. HaploMerger2 seeks to identify both haplotigs and
overlaps, but it ignores read depth and relies only on the alignment
of contigs to each other.
Here we describe a novel purging tool, purge_dups, to resolve
the haplotigs and overlaps in a primary assembly, using both se-
quence similarity and read depth. Purge_dups is now being used rou-
tinely in the Vertebrate Genomes Project assembly pipeline.
2 Materials and methods
Given a primary assembly and long-read sequencing data, we apply
the following steps to identify haplotigs and overlaps. A more
V
CThe Author(s) 2020. Published by Oxford University Press. 1
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics, 2020, 1–3
doi: 10.1093/bioinformatics/btaa025
Advance Access Publication Date: 23 January 2020
Applications Note
Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa025/5714742 by guest on 24 April 2020
detailed description of the methods is available in the
Supplementary Material.
1. We use minimap2 (Li, 2016) to map long-read sequencing data
onto the assembly and collect read depth at each base position in
the assembly. The software then uses the read-depth histogram
to select a cutoff to separate haploid from diploid coverage
depths, allowing for scenarios where the total assembly is domi-
nated by either haploid or diploid sequence.
2. We segment the input draft assembly into contigs by cutting at
blocks of ‘N’s, and use minimap2 to generate an all by all self-
alignment.
3. We next recognize and remove haplotigs in essentially the same
way as purge_haplotigs, and remove all matches associated with
haplotigs from the self-alignment set.
4. Finally we chain consistent matches in the remainder to find
overlaps, then calculate the average coverage of the matching
intervals for each overlap, and mark an unambiguous overlap as
heterozygous when the average coverage on both contigs is less
than the read-depth cutoff found in step 1, removing the se-
quence corresponding to the matching interval in the shorter
contig.
3 Results and discussion
We evaluated the performance of purge_dups (v1.0.0) on four
Falcon-unzip primary assemblies: Arabidopsis thaliana (At) (Chin
et al., 2016), Anopheles coluzzi (Ac) (Kingan et al., 2019),grape
Vitis vinifera L. cv. Cabernet Sauvignon (Vv) and pinecone soldier-
fish Myripristis murdjan (Mm), and compared our results to those
of purge_haplotigs (v1.0.4), HaploMerger2. The expected genome
sizes and heterozygosities of these genomes calculated by
GenomeScope (Vurture et al., 2017) are given in Supplementary
Table S1, with heterozygosity ranging from 0.6% (Ac) to 1.6%
(Vv).
Fig. 1. K-mer comparison plots for draft and purge_dups Mm assemblies (k¼21).
The horizontal axis represents the copy number of k-mers in short reads from the
same sample, the vertical axis shows the number of distinct k-mers and the colored
lines denote k-mers which occur in the given number of times in the assembly. (a)
The purple line shows 209.1 million two-copy k-mers accumulating in the haploid
and diploid areas, which correspond to duplicated haplotigs or overlaps in the pri-
mary assembly. (b) Only 7.6 million two-copy k-mers remain after purging with
purge_dups. (Color version of this figure is available at Bioinformatics online.)
Table 1. BUSCO scores and assembly metrics
BUSCO scores (%) Assembly size (Mb) Num. Contigs
C C(S) C(D) F M
At-orig 98.1 91.9 6.2 0.3 1.6 140 172
At-PH 97.7 96.0 1.7 0.6 1.7 123 109
At-PD 97.8 96.7 1.1 0.6 1.6 121 96
At-HM 96.8 95.6 1.2 0.6 2.6 122 117
At-HMm 96.8 95.7 1.1 0.6 2.6 121 102
Ac-orig 98.7 94.7 4.0 0.6 0.7 266 372
Ac-PH 98.8 96.9 1.9 0.5 0.7 253 224
Ac-PD 98.9 98.6 0.3 0.6 0.5 246 192
Ac-HM 98.5 98.2 0.3 0.6 0.9 245 223
Ac-HMm 98.6 98.4 0.2 0.6 0.8 246 212
Vv-orig 92.2 79.8 12.4 1.5 6.3 591 718
Vv-PH 92.1 88.1 4.0 1.6 6.3 457 259
Vv-PD 91.9 89.9 2.0 1.9 6.2 452 324
Vv-HM NA NA NA NA NA NA NA
Vv-HMm 91.8 89.9 1.9 1.8 6.4 458 383
Mm-orig 95.8 79.0 16.8 2.0 2.2 1250 1290
Mm-PH 94.5 89.1 5.4 2.4 3.1 888 517
Mm-PD 94.4 90.9 3.5 2.7 2.9 838 563
Mm-HM 94.6 91.3 3.3 2.5 2.9 850 600
Mm-HMm 94.7 91.6 3.1 2.6 2.7 845 443
Mm-origS 95.3 70.7 24.6 2.2 2.5 1252 764
Mm-PHS 94.7 87.5 7.2 2.5 2.8 891 221
Mm-PDS 94.8 91.2 3.6 2.7 2.5 840 222
Mm-HMS 94.9 91.3 3.6 2.5 2.6 852 343
Mm-HMmS 94.8 91.6 3.2 2.5 2.7 848 365
C, complete genes; C(S), complete single-copy genes; C(D), complete duplicate genes; F, fragmented genes; M, missing genes; orig, Falcon-unzip; PH, purge_ha-
plotigs; PD, purge_dups; HM, HaploMerger2; HMm, HaploMerger2 with masking; PHS, PDS, HMS, HMmS: purge_haplotigs (respectively purge_dups,
HaploMerger2 with and without repeat masking) after scaffolding and polishing. Values in bold indicate the best score of each type in each section. The
HaploMerger2 run without masking on Vv did not complete.
2D.Guan et al.
Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa025/5714742 by guest on 24 April 2020
K-mer comparison analysis (Mapleson et al., 2017) shows that
purge_dups removes 96.4% of duplicated haploid-unique k-mers in
the Falcon-unzip assembly of Mm (Fig. 1). Comparable figures for
HaploMerger2 and purge_haplotigs are 95.7% and 81.2% respect-
ively (Supplementary Fig. S1) and for At are 88.4%, 87.3% and
80.7% respectively (Supplementary Fig. S2). Supplementary Figures
S3 and S4 show examples of regions where purge_dups removes
both contained and overlapping duplication, whereas purge_haplo-
tigs only removes fully contained duplication.
Table 1 presents statistics on assembly and for the four assem-
blies, using Benchmarking Universal Single-Copy Orthologs
(BUSCOs) (Sim~
ao et al., 2015) to assess the consequences of purging
for gene set completeness and duplication. Results are given for the
original assemblies, purge_haplotigs, purge_dups and
HaploMerger2 (with and without repeat masking). All purging
methods remove a substantial amount of sequence from the primary
assembly and decrease BUSCO duplication. No single method per-
forms uniformly best across all assemblies and all metrics. However
purge_haplotigs consistently leaves more duplicated sequence and
genes. For all assemblies other than Mm, purge_dups gives the high-
est fraction of single-copy complete genes, and the lowest fraction of
missing genes. Although purge_dups has only a limited ability to ex-
plicitly handle repeats it does not exhibit signs of significant
overpurging.
For Mm, we also had 10X Genomics linked read data, and used
this for scaffolding using Scaff10x (https://github.com/wtsi-hpag/
Scaff10X). Following this with a round of polishing with Arrow
closed a number of gaps, reducing contig number further and
increasing contig N50. For the purge_haplotigs assembly, this
resulted in 221 scaffolds with N50 8.17 Mb, and the final contig
N50 3.48 Mb, whereas scaffolding the purge_dups assembly gener-
ated 222 scaffolds with N50 23.68 Mb, and contig N50 increased
substantially from 2.63 Mb to 11.98 Mb. The nominal contiguity
was even greater for the scaffolded HaploMerger2 masked assembly
with scaffold N50 34.53 Mb, and contig N50 16.39 Mb. However,
when we further assessed the scaffolds with QUAST (Gurevich
et al., 2013), the purge_dups scaffolds had the highest NGA50
(characteristic length of material correctly aligned to the genome) of
16.73 Mb, while HaploMerger2 scaffolds only had 7.86 Mb
NGA50, with 126 scaffold misassemblies compared to 22 for pur-
ge_dups (Supplementary Table S2).
The improvements that purging makes to contiguity following
scaffolding indicate that divergent heterozygous overlaps can be a
significant barrier to scaffolding, and that it is important to remove
them as well as removing contained haplotigs. To our knowledge,
scaffolders that use long-range information, such as Scaff10X with
linked reads or SALSA with Hi-C data, do not handle heterozygous
overlaps. We therefore recommend applying purge_dups directly
after initial assembly, prior to scaffolding. Although HaploMerger2
can also link adjacent contigs using overlap information after purg-
ing, our tests suggest that it makes false joins, perhaps because it
does not use read depth to distinguish haplotypic duplication from
repeat duplication.
In conclusion, purge_dups can significantly improve genome
assemblies by removing overlaps and haplotigs caused by sequence
divergence in heterozygous regions. This both removes false duplica-
tions in primary draft assemblies while retaining completeness and
sequence integrity, and can improve scaffolding. It runs autono-
mously without requiring user specification of cutoff thresholds,
allowing it to be included in an automated assembly pipeline.
Acknowledgements
We thank members of the Vertebrate Genomes Project assembly group for in-
put and advice, including Arang Rhie, Zemin Ning, William Chow, Ying
Yan, Adam Phillippy and Erich Jarvis. The Mm genome was sequenced at the
Sanger Institute as part of the Vertebrate Genomes Project, we thank mem-
bers of the Sanger Institute DNA pipelines group for generating the sequence
data and Byrappa Venkatesh for providing the sample and we thank Jonas
Korlach, Mara Lawniczak, Haynes Heaton and Christine Lambert for supply-
ing raw data for Ac.
Funding
This work was supported by the National Key Research and Development
Program of China [2017YFC0907503, 2018YFC0910504 and
2017YFC1201201 to D.G. and Y.W.]; China Scholarship Council to D.G.;
Wellcome Trust [WT207492 to S.A.M. and R.D., and WT206194 to J.W.
and K.H.].
Conflict of Interest: R.D. is a consultant for Dovetail Inc. All other authors
declared no conflict of interest.
References
Chin,C.-S. et al. (2016) Phased diploid genome assembly with single-molecule
real-time sequencing. Nat. Methods,13, 1050–1054.
Gurevich,A. et al. (2013) QUAST: quality assessment tool for genome assem-
blies Bioinformatics,29, 1072–1075.
Huang,S. et al. (2017) HaploMerger2: rebuilding both haploid sub-assemblies
from high-heterozygosity diploid genome assembly. Bioinformatics,33,
2577–2579.
Kingan,S.B. et al. (2019) A high-quality de novo genome assembly from a sin-
gle mosquito using PacBio sequencing. Genes,10, 62.
Koren,S. et al. (2017) Canu: scalable and accurate long-read assembly via
adaptive k-mer weighting and repeat separation. Genome Res., 27,
722–736.
Li,H. (2016) Minimap and miniasm: fast mapping and de novo assembly for
noisy long sequences. Bioinformatics,32, 2103–2110.
Mapleson,D. et al. (2017) KAT: a K-mer analysis toolkit to quality control
NGS datasets and genome assemblies. Bioinformatics,33, 574–576.
Phillippy,A.M. (2017) New advances in sequence assembly. Genome Res., 27,
xi–xiii.
Roach,M.J. et al. (2018) Purge haplotigs: allelic contig reassignment
for third-gen diploid genome assemblies. BMC Bioinformatics,19, 460.
Sim~
ao,F.A. et al. (2015) BUSCO: assessing genome assembly and
annotation completeness with single-copy orthologs. Bioinformatics,31,
3210–3212.
Vurture,G.W. et al. (2017) GenomeScope: fast reference-free genome profiling
from short reads. Bioinformatics,33, 2202–2204.
purge_dups 3
Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa025/5714742 by guest on 24 April 2020
... www.nature.com/scientificdata/ −l 10000 -m 300 -f 0.8" 24 . The assembled genome size was approximately 1.16 Gb, with a contig N50 value of 7.28 Mb with Hifiasm. ...
Article
Full-text available
As an important economic and ecological fish, Amur minnow (Phoxinus lagowskii) plays a significant role in food products as well as evolutionary, ecological research. However, a high-quality chromosome-level genome of P. lagowskii is not currently available. In this study, we report a T2T (Telomere-to-telomere) genome for P. lagowskii with chromosome-level. The finally assembled genome size is 1.04 G, with a contig N50 of 41.7 Mb, comprising 25 chromosomes. The transposable elements constituted 512.40 Mb (49.22%) of the assembled P. lagowskii genome, with DNA transposons 25.02% being the predominant repeat type. A total of 2,4610 protein-coding genes were predicted in P. lagowskii genome, with 99.96% of these genes being functionally annotated. The identification of telomeres, BUSCO assessment, mapping coverage, and sequencing depth collectively demonstrated the high quality of the genome assembly. The T2T genomic information serves as an invaluable resource for studies in evolution, comparative genomics, fish breeding applications, and ecological research.
... Those two assemblies were then processed by PURGE_DUPS (Guan et al., 2020) to remove duplicate contigs and then merged by quickmerge (Chakraborty et al., 2016). The merged assembly was polished two times by NextPolish (Hu et al., 2020) with Illumina short-read. ...
Preprint
Recent genomic studies have suggested that hybridization may play a significant role in adaptive radiation, rapid speciation, and convergent evolution. The genus Oxera , a plant taxon thought to have diversified at its beginning through adaptive radiation in New Caledonia, provided an opportunity to investigate these processes. Within the robusta subclade of Oxera , characterized by bird-pollinated yellow-orange flowers, convergent evolution of flower shape is likely to have occurred. We aimed to elucidate the hybridization history of the robusta subclade by whole genome sequencing and MIG-seq data. Our analyses revealed an ancestral introgression from O. coriacea to O. sympatrica , whose flowers are remarkably similar to each other. Among the introgressed genomic regions, we identified several genes potentially involved in flower shape development. O. sympatrica and its sympatric sister species exhibit distinct flower shapes, and pollinator-mediated reproductive isolation presumed to be a major barrier between them. The ancestral introgression uncovered in this study may have driven the convergent evolution of flower shape in the robusta subclade and played a crucial role in the speciation process of O. sympatrica . These finding contribute to our understanding of the interplay between hybridization, adaptive radiation, and speciation process.
Article
Full-text available
Rare species are highly vulnerable to anthropogenic threats due to their unique life‐history traits and specialised adaptations. The Andean condor (Vultur gryphus), the world's largest soaring bird, exemplifies these challenges with exceptional flight efficiency, delayed maturity, long lifespan, extreme sexual dimorphism and a critical scavenging role. The species faces significant threats, including habitat loss, persecution and poisoning. Meanwhile, conservation efforts have been hindered by knowledge gaps, including limited genetic data. Herein, we present the first chromosome‐scale reference genome for the species, a key resource for investigating its evolution and ecology, as well as informing conservation measures. The assembly spans 1.19 Gb with 97.4% completeness, including 29 autosomes and the Z chromosome. High synteny with the California condor (Gymnogyps californianus) genome reflects their close evolutionary relationship. Genomic diversity in Andean condors (~0.65He/Kbp; π: 6.73e‐4) was lower than in California condors (~0.97 He/Kbp; π: 1.09e−3). Runs of Homozygosity (RoH) analyses revealed a smaller genomic proportion (~15%) with shorter elements in Andean condors (> 5 Mb covering 1.43% of the genome). In contrast, California condors showed a higher genomic proportion (~40%), with longer RoH segments (> 5 Mb covering 7.3% of the genome). Analyses of gene family evolution revealed divergent patterns of expansion and contraction between Andean and California condors, including genes linked to detoxification metabolism, high‐altitude adaptation and immune response. Shared genomic trends among avian scavengers highlight convergent evolution in stress response and metabolic pathways. This study provides a key genomic resource for advancing avian research and guiding conservation strategies for threatened vultures.
Article
Successfully breeding high-yield, lodging-resistant hybrid rice varieties is critical for ensuring food security. Two-line hybrid rice system plays an essential role in rice breeding, and 1892S, an important two-line sterile line, has contributed significantly to the development of over 100 hybrid rice varieties with superior agronomic traits, including lodging resistance. Despite its importance, a comprehensive understanding of the genomic basis underlying these traits in 1892S has been lacking due to the limitations of short-read sequencing technologies. To address this gap, we utilized advanced telomere-to-telomere (T2T) genome assembly techniques to generate a high-quality, gap-free genome of 1892S—the final genome comprises 12 complete chromosomes with 40,560 protein-coding genes. Comparative genomic analysis identified multiple known lodging resistance genes, including SD1, Sdt97, SBI, OsFBA2, APO1, and OsTB1, with unique allelic variations that may enhance resistance. The pan-genome analysis identified 2347 strain-specific genes in 1892S, further supporting its unique genetic advantages. This study represents the complete T2T genome assembly of a two-line sterile line and provides novel insights into the genetic foundation of lodging resistance in hybrid rice. This study highlights the genetic potential of 1892S in hybrid rice breeding and provides a model for the genomic analysis of other two-line sterile lines, offering valuable insights for improving in hybrid rice, including traits lodging resistance, yield stability, and adaptability, which are crucial for global food security.
Article
Full-text available
The Arunachali yak (Bos grunniens), a high-altitude ruminant endemic to India, is genetically and economically significant for pastoral communities. However, challenges such as population decline, inbreeding, and genetic dilution through crossbreeding threaten its conservation. We generated a high-quality reference genome using PacBio HiFi long-read sequencing, Bionano optical mapping, and Hi-C sequencing to elucidate its genomic features and adaptive mechanisms. The 2.85 Gb genome, assembled with hifiasm, achieved a contig N50 of 70.4 Mb and a scaffold N50 of 102.99 Mb, with high completeness validated by BUSCO analysis. Annotation identified 25,855 protein-coding genes, with 81.5% functionally characterized. Repeat analysis revealed 44.68% transposable elements, with LINEs (28.26%) being the most abundant. This comprehensive genomic resource can help provide crucial insights into high-altitude adaptation, disease resistance, and productivity traits, and could support genomic-assisted breeding, conservation, and sustainable management of the Arunachali yak.
Article
Full-text available
Pear (Pyrus L) is one of the most significant fruit crops globally, recognized for its substantial economic value and potential health benefits. ‘Danxiahong’ is an elite pear cultivar in the north of China, characterized by its flushed fruit skin and excellent inner quality. In this study, we utilized PacBio HiFi long reads, Hi-C reads and second-generation sequencing data to assemble the genome of ‘Danxiahong’. Two telomere-to-telomere gap-free and haplotype-resolved pear genomes were successfully assembled, with the sizes of 495.37 Mb and 501.60 Mb, and contig N50 of 28.97 Mb and 29.32 Mb. Approximately 62.50% and 62.76% repeat sequences were mapped to the 17 chromosomes for each haplotype. Gene annotations analysis identified a total of 39,936 genes in Hap1 and 39,707 genes in Hap2, respectively. The haplotype-resolved genome of ‘Danxiahong’ significantly contributes to the investigation of genes and molecular mechanisms related to fruit quality, while also facilitating the Multi-Omics analysis, such as comparative genomics, transcriptomics, proteomics, and allelic expression research.
Article
Full-text available
The family Scarabaeidae is one of the largest and most ecologically significant groups within the order Coleoptera, comprising over 35,000 described species. However, the limited availability of high-quality genome assemblies has hindered comprehensive studies on their ecology and evolutionary biology. In this study, we present a high-quality, chromosome-level genome assembly of Kibakoganea sinica, generated by integrating PacBio HiFi, Illumina, and Hi-C sequencing data. The final assembly spans 601.44 Mb, comprising 23 scaffolds (scaffold N50: 60.23 Mb) and 70 contigs (contig N50: 24.49 Mb), with 99.57% of the total assembly (598.84 Mb) successfully anchored to 10 chromosomes. BUSCO analysis (n = 1,367) indicates a high level of completeness, with 99.2% of genes detected: 95.2% as single-copy and 4.0% as duplicated. Repetitive elements account for 44.43% (267.24 Mb) of the genome, and a total of 12,940 protein-coding genes were predicted. This chromosome-scale genome assembly provides a foundational resource for future research into the biology and adaptation of Scarabaeidae.
Article
The soft tick family Argasidae contains vectors of medical and veterinary importance, but few molecular resources are available compared to hard ticks (Ixodidae). One example is Ornithodoros turicata, a recognized vector of Borrelia turicatae, causal agent of human relapsing fever, and a putative vector of African swine fever virus. To address the current lack of molecular resources for the Argasidae, we generated a chromosome-level genome assembly for O. turicata using PacBio sequencing in conjunction with an Illumina Hi-C library. The resulting reference genome has a total of 1.1 Gb in length and was assembled into 10 chromosomes and 368 unplaced scaffolds, an N50 of 2.7 Mb, and a QV score of 43.58. The orthology analysis indicated high-quality with a 97.8% BUSCO completeness score and 36,149 annotated genes. 33% of the genome was identified as repeats and masked. The generation of the first soft tick reference genome establishes a basis for future studies to utilize genomic tools to support vector-borne disease management.
Article
Full-text available
A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Article
Full-text available
Background Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs. Results A new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs. Conclusions Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence. Electronic supplementary material The online version of this article (10.1186/s12859-018-2485-7) contains supplementary material, which is available to authorized users.
Article
Full-text available
Article
Full-text available
De novo assembly is a difficult issue for heterozygous diploid genomes. The advent of high-throughput short-read and long-read sequencing technologies provides both new challenges and potential solutions to the issue. Here, we present HaploMerger2 (HM2), an automated pipeline for rebuilding both haploid sub-assemblies from the polymorphic diploid genome assembly. It is designed to work on pre-existing diploid assemblies, which are typically created by using de novo assemblers. HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies. Using HM2, we demonstrate that two haploid sub-assemblies reconstructed from a real, highly-polymorphic diploid assembly show greatly improved continuity. Availability and Implementation: Source code, executables and the testing data set are freely available at https://github.com/mapleforest/HaploMerger2/releases/. Contact:hshengf2@mail.sysu.edu.cn Supplementary information : Supplementary data are available at Bioinformatics online.
Article
Full-text available
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Article
Full-text available
Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilised by assemblers, provides useful insights that can inform the assembly process and result in better assemblies. Results: We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. Availability: KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT CONTACT: bernardo.clavijo@earlham.ac.ukSupplementary Information: Supplementary Information (SI) is available at Bioinformatics online. In addition, the software documentation is available online at: http://kat.readthedocs.io/en/latest/.
Article
GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate, and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels, and error rates. Availability and implementation: http://genomescope.org , https://github.com/schatzlab/genomescope.git. Contact: mschatz@jhu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
Article
Motivation: Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10kbp in length, which have enabled high-quality genome assembly at an affordable cost. However, at present, long reads have an error rate as high as 10–15%. Complex and computationally intensive pipelines are required to assemble such reads. Results: We present a new mapper, minimap, and a de novo assembler, miniasm, for efficiently mapping and assembling SMRT and ONT reads without an error correction stage. They can often assemble a sequencing run of bacterial data into a single contig in a few minutes, and assemble 45-fold C. elegans data in 9 minutes, orders of magnitude faster than the existing pipelines, though the consensus sequence error rate is as high as raw reads. We also introduce a pairwise read mapping format (PAF) and a graphical fragment assembly format (GFA), and demonstrate the interoperability between ours and current tools. Availability and implementation: https://github.com/lh3/minimap and https://github.com/lh3/miniasm Contact: hengli{at}broadinstitute.org
Article
Genomics has revolutionised biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50. We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO. Software implemented in Python and datasets available for download from http://busco.ezlab.org. Evgeny.Zdobnov@unige.ch. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.