ArticlePDF Available

A chromosome-level genome assembly of the spider mite Tetranychus piercei McGregor

Authors:

Abstract and Figures

Despite the rapid advances in sequencing technology, limited genomic resources are currently available for phytophagous spider mites, which include many important agricultural pests. One of these pests is Tetranychus piercei (McGregor), a serious banana pest in East Asia exhibiting remarkable tolerance to high temperature. In this study, we assembled a high-quality genome of T. piercei using a combination of PacBio long reads and Illumina short reads sequencing. With the assistance of chromatin conformation capture technology, 99.9% of the contigs were anchored into three pseudochromosomes with a total size of 86.02 Mb. Repetitive elements, accounting for 14.16% of this genome (12.20 Mb), are predominantly composed of long-terminal repeats (30.7%). By combining evidence of ab initio prediction, transcripts, and homologous proteins, we annotated 11,881 protein-coding genes. Both the genome and proteins have high BUSCO completeness scores (>94%). This high-quality genome, along with reliable annotation, provides a valuable resource for investigating the high-temperature tolerance of this species and exploring the genomic basis that underlies the host range evolution of spider mites.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
A chromosome-level genome
assembly of the spider mite
Tetranychus piercei McGregor
Lei Chen
1,2, Xin-Yue Yu1,2, Feng Zhang
1, Hua-Meng Zhang1, Li-Xue Guo1, Lu Ren1,
Xiao-Yue Hong1 & Jing-Tao Sun1 ✉
Despite the rapid advances in sequencing technology, limited genomic resources are currently available
for phytophagous spider mites, which include many important agricultural pests. One of these pests is
Tetranychus piercei (McGregor), a serious banana pest in East Asia exhibiting remarkable tolerance to
high temperature. In this study, we assembled a high-quality genome of T. piercei using a combination
of PacBio long reads and Illumina short reads sequencing. With the assistance of chromatin
conformation capture technology, 99.9% of the contigs were anchored into three pseudochromosomes
with a total size of 86.02 Mb. Repetitive elements, accounting for 14.16% of this genome (12.20 Mb),
are predominantly composed of long-terminal repeats (30.7%). By combining evidence of ab initio
prediction, transcripts, and homologous proteins, we annotated 11,881 protein-coding genes. Both the
genome and proteins have high BUSCO completeness scores (>94%). This high-quality genome, along
with reliable annotation, provides a valuable resource for investigating the high-temperature tolerance
of this species and exploring the genomic basis that underlies the host range evolution of spider mites.
Background & Summary
Phytophagous spider mites in Tetranychidae comprise more than 1,300 species, many of which are serious
agricultural pests, such as Tetranychus urticae Koch, Panonychus citri McGregor1. Spider mites are notorious
for developing rapid resistance to pesticides, causing signicant economic losses since the widespread use of
synthetic insecticides and fungicides aer World War II2. e global acaricide market was valued at up to 400
million dollars annually3. In addition to their economic importance, spider mites exhibit diverse host ranges,
from monophagous (e.g. Tetranychus lintearius) to extremely polyphagous (e.g. T. urticae)4, making them an
ideal system for exploring the mechanisms underlying the evolution of host range. e two-spotted spider mite
(T. urticae) is the rst species of Chelicerate to have its whole genome sequenced, which was obtained by Sanger
sequencing and assembled at the scaold level5. Subsequently, in 2019, the genome was further rened and
anchored into three pseudochromosomes6. is genome provides important insights into the host adaptation
and pesticide resistance evolution of T. urticae and suggests a possible link between its rapid development of
pesticide resistance and its strong adaptive ability to host plants5,7.
Tetranychus piercei (McGregor) is a major pest on banana (Musa spp.), papaya (Carica papaya) and other
crops in East Asia810. It can also feed on plants such as mulberry (Morus alba), rose (Rosa sp.), peach (Prunus
persica), sweetsop (Annona squamosa), cassava (Manihot esculenta) and g (Ficus carica)4. Being the phyloge-
netically older sister species of T. urticae, T. piercei has a narrower host range and distinct host plant prefer-
ences11,12. Notably, T. piercei exhibits greater tolerance to high temperature compared to T. urticae, positioning
it as a potential replacement for T. urticae as a major pest in the context of global warming13. However, limited
information on its genetic resources hinders our understanding of its strong tolerance to high temperature and
the evolution of the detoxication system in Tetranychinae.
In this study, we employed a combination of PacBio continuous long reads, accurate Illumina short reads and
chromosomal conformation capture (Hi-C) data to assemble a chromosomal-level genome of T. piercei, which
includes 3 chromosomes and 11,881 protein-coding genes. Synteny analysis revealed dramatic chromosomal rear-
rangement between T. piercei and T. urticae. is high-quality genome will facilitate in-depth biological studies of
T. piercei and enable exploration of the genomic basis underlying the host range evolution of spider mites.
1Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China. 2These authors
contributed equally: Lei Chen, Xin-Yue Yu. e-mail: jtsun@njau.edu.cn
DATA DESCRIPTOR
OPEN
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
Methods
Raw material collection. At least 100 wild spider mites, including larvae, nymphs, and adults, were col-
lected from Trachycarpus fortunei in Sanya Hainan province (18.29°N, 109.47°E), a tropical region in China. We
amplied the nuclear ribosomal internal transcribed spacer (ITS) sequence to conrm the species identication
of T. piercei14. By backcrossing male sons with a virgin female, an isofemale line was constructed to minimize
heterozygosity. e isofemale line was reared on potted soybeans at a population size of thousands for at least 20
generations.
DNA and RNA sequencing. Total genomic DNA was extracted from more than 5,000 adult females using
MagAttract® HMW DNA Kit. e PacBio 30 kb SMRTbell library was prepared with more than 5 μg gDNA
using the SMRTbellTM Prep Kit 2.0 (Pacic Biosciences). e mode of Continuous Long Read (CLR) was run
on the Sequel IIe system, and generated 6.44 Gbp raw data (75-fold depth). Illumina whole-genome sequencing
was prepared using a 350 bp-insert fragment library (150 bp paired-end) by Truseq DNA PCR-free Kit, which
was further sequenced on an Illumina NovaSeq 6000 platform. High-throughput chromosome conformation
capture (Hi-C) included cross-linking, HindIII restriction enzyme digestion, end repair, DNA cyclization, puri-
cation and capture. e Hi-C library with 300–700 bp insert size library was sequenced on the NovaSeq 6000
platform, and 8.45 Gbp reads were generated to scaold chromosomes. Total RNA was extracted from 100 adult
females feeding on common beans using TRIzol Reagent (Invitrogen, USA) according to the manufacturer’s
instructions. RNA library was constructed using the VAHTS mRNA-seq v2 Library Prep Kit (Vazyme, Nanjing,
China) and sequenced on an Illumina NovaSeq 6000 platform. Finally, we generated 6.44 Gb (75×) PacBio long
reads, 13.10 Gb (152×) Illumina short reads, 8.45 Gb Hi-C (98×) reads, and 12.62 Gb transcriptome reads for our
genome assembly.
Genome survey. Duplicate and low-quality Illumina raw reads (base quality < Q20, length < 15 bp, polymer
A/G/C > 10 bp) were trimmed and removed using BBtools package v38.8215. e 21-mer depth distribution was
counted using script khist.sh of BBtools v38.82. GenomeScope v2.016 was used to estimate the genome size and
heterozygosity of T. piercei with the maximum kmer coverage at 1,000×. Based on the distribution of kmer cov-
erage and frequency, the estimated genome size of T. piercei was 86.45 Mb, with a heterozygosity rate of around
0.001% and a repeat content proportion of approximately 14.6% (Fig.1).
Chromosome staining. Unfertilized (haploid) and fertilized (diploid) eggs of T. piercei, laid within 8 hours,
were collected in a centrifuge tube containing phosphate buered saline (PBS; 0.85% NaCl, 1.4 mM KH2PO4,
8 mM Na2HPO4, PH 7.1). e PBS was then discarded, and 500 μL 50% sodium hypochlorite solution was added,
allowing it to stand for 2–3 minutes. Aer discarding the sodium hypochlorite solution, a mixture of hexane and
methanol (1:1) was added to the centrifuge tube, and the contents were vigorously shaken for 3 min to remove the
chorion. e eggs were rehydrated through an ethanol series (95, 70, 50 and 35%) and then washed in PBT (PBS
containing 0.1% Triton X-100) for 15 min. Aer being washed ve times for 1 min each in PBS, the eggs were
incubated with a uorescence quenching agent containing DAPI at room temperature for 5 min. Subsequently,
the eggs were mounted on slides and covered with coverslips for further microscopic investigation using a Leica
Fig. 1 Genome survey at 21-mer of T. piercei estimated by GenomeScope. e vertical dotted lines represent
the peaks of dierent coverages for the heterozygous, the homozygous, and the duplicated sequences (the last
two peaks) separately.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
TCS SP8 confocal microscope. e egg DNA staining of diploid females and haploid males (Fig.2a,b) consistently
indicates that the genome of T. piercei consists of three chromosomes.
Genome assembly. e CLR reads were set as input to Flye v2.9 and Raven v1.6.0 to assemble continuous
long reads17,18. e better assembly from Flye, which had a greater N50 length and completeness, was retained as
the following primary assembly. One round of built-in long reads polishing was performed by Flye v2.9. en,
two rounds of short reads were used to polish and ll in gaps of the primary assembly with NextPolish v1.4.019.
Haplotigs and duplication caused by haplotype divergence were eliminated by Purge_dups v1.2.5 using the align-
ment program Minimap2 v2.2320,21. Hi-C reads were aligned to the purged genome using BWA v0.7.1722 to anchor,
order and orient contigs into chromosomal assembly following 3D-DNA pipeline23. en, we manually reviewed
and corrected assembled errors using Juicebox v1.11.0824. Vector contaminants were checked against the UniVec
database using BLAST + v2.11.0 with the VecScreen parameters25. Bacterial and human being contaminants were
detected in the assembly against the Nt database using MMseqs v13-4511126. e completeness of genome assem-
bly was evaluated by BUSCO version 5.2.227 using the arachnida_odb10 dataset (creation date 2020-08-05). e
reads from the whole genome sequencing were aligned back to the genome assembly to access the mapping rate.
Aer de novo assembly, polishing and purging, 56 contigs (N50 of 3.33 Mb) with a total length of 86.13 Mb
were generated, accounting for 99.63% of the estimated genome. By combining Hi-C with high-throughput
sequencing and manual adjustment, we anchored these contigs into 4 scaolds, including 3 megascaolds and
1 unplaced scaold (Fig.2c). is unplaced scaold was identied as bacterial contamination by NCBI and
was subsequently excluded. Finally, a total length of 86.02 Mb contigs was assigned to 3 pseudochromosomes
(Fig.3), with scaold N50 of 29.25 Mb (Table1). e GC content of the T. piercei genome was 32.17%, which is
similar to that of T. urticae (32.25%).
Genome annotation. e repetitive elements were identied using RepeatModeler v2.0.2, which discovered
the complete long terminal repeats (LTR) with the integration of LTRharvest and LTR_retriever28. RepeatMasker
v4.1.2p1 and RMBlast v2.11.0 were searched against the custom repeat library of Dfam 3.5 and Repbase
v20181026 to so mask repeats of the genome assembly2931. Two ab initio gene prediction soware, BRAKER2
v2.1.6 and GeMoMa v1.832,33, were used to nd protein-coding gene structure based on the masked genome.
Transcriptome mapping conducted by HISAT2 v2.2.0 and homologous proteins from ve species (Daphnia
magna, Dermacentor silvarum, Drosophila melanogaster, T. urticae, and Varroa destructor) were provided to
assist gene prediction of BRAKER2/GeMoMa. Genome-guided transcript assembly was performed by StringTie
v2.1.6, and the results were used as mRNA evidence for MAKER2 v3.01.0334,35. e same homologous proteins
Fig. 2 Chromosome staining, Hi-C heatmap and synteny. (a) DAPI staining of chromosomes in diploid female
egg of T. piercei. e blue signals represent condensed chromosomal regions, while the red dashed lines in the
model panel represent simulated chromosome boundaries. (b) DAPI staining of chromosomes in haploid
male egg. (c) Genome-wide chromosomal interaction heatmap generated in Hi-C interaction analysis with
each chromosome in the blue box. e frequency of Hi-C interaction links is represented by the color, which
ranges from white (low) to red (high). (d) Synteny dot plot based on protein homologous between T. piercei and
T. urticae.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
were also fed to MAKER2 as the protein evidence. Finally, MAKER2 combined ab initio prediction, mRNA and
homology-protein evidence to generate gene models with direct predictions not allowed for transcripts and
proteins. Proteins with lengths shorter than 30 aa were discarded. e functional annotation (TableS1) of pre-
dicted protein sequences was searched against UniProt, InterProScan and eggNOG databases. Diamond v2.0.1136
under the ‘very sensitive’ mode was used to assign gene function of the best hits in the UniProt database. Protein
domains, Gene Ontology terms and pathways were supported by Pfam, SMART, Superfamily and CDD using
InterProScan537. Complementary annotation of ko, KEGG categories were provided using eggNOG-mapper
v2.1.5 against EggNOG 5.0 database38,39.
To explore chromosomal synteny between T. piercer and T. uritcae6, we conducted a homologous search
between their respective protein sequences using Diamond v2.0.1136 with default parameters. e resulting
homologous dot plot was visualized to detect collinearity, inversions and translocations using WGDI v0.6.540.
Data Records
The raw reads and genome assembly have been deposited in the NCBI databases under BioProject
PRJNA833563. e PacBio, Illumina, Hi-C, and transcriptome data are available under identication num-
bers SRR23622209-SRR2362221241. e nal chromosome assembly has been deposited at GenBank under the
accession number GCA_036759885.142. e nal chromosome-level genome assembly, annotation, and protein
sequences are available at the Figshare database (https://doi.org/10.6084/m9.gshare.22215145)43.
Technical Validation
Evaluation of the genome assembly. Compared to T. urticae6, our genome assembly exhibits greater
contiguity and completeness attributed to the utilization of long reads and Hi-C sequencing. e contigs num-
ber, contigs N50 of T. piercei are much better than those of the two-spotted spider mite (56 vs. 1,996; 3.33 Mb vs.
0.21 Mb; Table1). We mapped whole-genome resequencing reads to the T. piercei genome and found that 92.5%
20
10
0
30
20
10
0
20
10
0
GC Content
Gene Density
LTR/Gypsy
LTR/Copia
DNA Transposons
250 μm
Fig. 3 Circular karyotype representation of the chromosomes in non-overlapping windows of 100 kb. Tracks
from inside to outside are GC content, gene density, Gypsy density, Copia density and DNA transposons density.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
of PacBio long reads and 96.51% of Illumina short reads could be well aligned. e complete benchmarking
universal single-copy orthologs (BUSCOs) under genome mode were used to assess the genome completeness.
A total of 94.6% (2,775/2,934) complete BUSCOs were identied, including 89.0% (2,612) single-copy BUSCOs,
5.6% (163) duplicated BUSCOs, 0.7% (22) fragmented BUSCOs, and 4.7% (137) missing BUSCOs.
Repeat elements and protein-coding genes. A total of 12.20 Mb repetitive elements were identied,
accounting for 14.16% of the genome (Fig.3). Besides unclassied repeats (4.67%), long-terminal repeat (LTR)
represented the most common repeat element (4.36%), followed by simple repeats (2.04%), DNA transposons
(1.67%), long-interspersed elements (LINE, 0.68%), and others (0.75%).
Using multiple lines of evidence, we annotated 11,881 protein-coding genes for T. piercei (Table1). Most
genes (>97%) had ‘annotation edit distance’ (AED) scores smaller than 0.5, indicating strong support from
the evidence of transcript and homologous protein. Under the protein model, the complete BUSCOs for our
genome annotation were 2,773 (94.5%). Although T. urticae had more coding genes than T. piercei (19,102 vs.
11,881), the results of BUSCOs suggested that T. piercei had higher completeness (90.4% vs. 94.5%; Table1).
Based on the synteny of homologous proteins, we found that the chromosomes of the two species underwent
dramatic inversion and translocation events (Fig.2d). More than half of Chr3 in T. piercer is syntenic to frag-
ments of Chr1 and Chr2 in T. urticae.
Code availability
All commands and pipelines were executed following the manuals and protocols of the corresponding
bioinformatic soware. e versions and parameters of the soware have been detailed in the Methods section.
Received: 6 December 2023; Accepted: 25 March 2024;
Published: xx xx xxxx
References
1. Helle, W. & Sabelis, M. W. Spider Mites: eir Biology, Natural Enemies and Control. 1A (Elsevier, Amsterdam, 1985).
2. Walter, D. E. & Proctor, H. C. Mites: Ecology, Evolution and Behaviour. (Springer, Dordrecht, 2013).
3. Van Leeuwen, T., Tirry, L., Yamamoto, A., Nauen, . & Dermauw, W. e economic importance of acaricides in the control of
phytophagous mites and an update on recent acaricide mode of action research. Pestic. Biochem. Physiol. 121, 12–21 (2015).
4. Migeon, A., Nouguier, E. & Doreld, F. Spider Mites Web: a comprehensive database for the Tetranychidae. Trends in Acarology
557–560 (2010).
5. Grbić, M. et al. e genome of Tetranychus urticae reveals herbivorous pest adaptations. Nature 479, 487–492 (2011).
6. Wybouw, N. et al. Long-Term Population Studies Uncover the Genome Structure and Genetic Basis of Xenobiotic and Host Plant
Adaptation in the Herbivore Tetranychus urticae. Genetics 211, 1409–1427 (2019).
7. Dermauw, W. et al. A lin between host plant adaptation and pesticide resistance in the polyphagous spider mite Tetranychus urticae.
Proc. Natl. Acad. Sci. USA 110, E113–E122 (2013).
8. Fu, Y., Zhang, F., Peng, Z., Liu, . & Jin, Q. e eects of temperature on the development and reproduction of Tectranychus
Tetranychus piercei McGregor (Acari: Tetranychidae) in banana. Syst. Appl. Acarol. 7, 69 (2002).
9. Ullah, M. S., Gotoh, T. & Lim, U. T. Life history parameters of three phytophagous spider mites, Tetranychus piercei, T. truncatus and
T. b am bus ae (Acari: Tetranychidae). J. Asia-Pac. Entomol. 17, 767–773 (2014).
10. Ohno, S. et al. Non-crop host plants of Tetranychus spider mites (Acari: Tetranychidae) in the eld in Oinawa, Japan: Determination
of possible sources of pest species and inference on the cause of peculiar mite fauna on crops. Appl. Entomol. Zool. 45, 465–475
(2010).
Metric Tetranychus pie rcei Tetranychus urticae
Assembly
Length of contigs 86.13 Mb 90.83 Mb
Length of chromosome 86.02 Mb 85.75 Mb
Contig N50 3.33 Mb 0.21 Mb
Scaold N50 29.25 Mb 29.22 Mb
Contigs number 56 1996
Scaolds number 4 601
Pseudochromosomes number 3 3
Length of Chr1 26,633,653 bp 32,654,540 bp
Length of Chr2 30,138,663 bp 29,215,314 bp
Length of Chr3 29,245,017 bp 23,884,949 bp
GC% 32.17 32.25
BUSCO completeness 94.6% 94.6%
Annotation
No. of protein genes 11,881 19,102
Mean protein length 494.7 aa 363 aa
Exons per gene 3.35 3.53
Mean exon length 459.6 bp 362.5 bp
Introns per gene 2.46 2.87
Mean intron length 603.0 bp 501.7 bp
BUSCO completeness 94.5% 90.4%
Tab le 1. Statistics for the assembly and annotation of Tetrany chu s genome.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
11. Hu, Q.-Q. et al. Phylogenetic-elated Divergence in Perceiving Suitable Host Plants among Five Spider Mites Species (Acari:
Tetranychidae). Insects 13, 705 (2022).
12. Matsuda, T., ozai, T., Ishii, . & Gotoh, T. Phylogeny of the spider mite sub-family Tetranychinae (Acari: Tetranychidae) inferred
from NA-Seq data. PLoS ONE 13, e0203136 (2018).
13. Gotoh, T., Moriya, D. & Nachman, G. Development and reproduction of ve Tetranychus species (Acari: Tetranychidae): Do they
all have the potential to become major pests? Exp. Appl. Acarol. 66, 453–479 (2015).
14. Ge, C., Ding, X.-L., Zhang, J.-P. & Hong, X.-Y. Tetranychus urticae (green form) on Gossypium hirsutum in China: two records
conrmed by aedeagus morphology and FLP analysis. Syst. Appl. Acarol. 18, 239–245 (2013).
15. Bushnell, B., ood, J. & Singer, E. BBMerge – Accurate paired shotgun read merging via overlap. PLoS ONE 12, e0185056 (2017).
16. Vurture, G. W. et al. GenomeScope: fast reference-free genome proling from short reads. Bioinformatics 33, 2202–2204 (2017).
17. olmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37,
540–546 (2019).
18. Vaser, . & Šiić, M. Time- and memory-ecient genome assembly with aven. Nat. Comput. Sci. 1, 332–336 (2021).
19. Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and ecient genome polishing tool for long-read assembly. Bioinformatics 36,
2253–2255 (2020).
20. Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898
(2020).
21. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
22. Li, H. & Durbin, . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
23. Dudcheno, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaolds. Science 356,
92–95 (2017).
24. Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 3, 99–101
(2016).
25. Camacho, C. et al. B LAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
26. Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat.
Biotechnol. 35, 1026–1028 (2017).
27. Manni, M., Bereley, M. ., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Worows along
with Broader and Deeper Phylogenetic Coverage for Scoring of Euaryotic, Proaryotic, and Viral Genomes. Mol. Biol. Evol. 38,
4647–4654 (2021).
28. Flynn, J. M. et al. epeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117,
9451–9457 (2020).
29. TarailoGraovac, M. & Chen, N. Using epeatMaser to Identify epetitive Elements in Genomic Sequences. Curr. Protoc.
Bioinforma. 25, (2009).
30. Storer, J., Hubley, ., osen, J., Wheeler, T. J. & Smit, A. F. e Dfam community resource of transposable element families, sequence
models, and genome annotations. Mob. DNA 12, 2 (2021).
31. Bao, W., ojima, . . & ohany, O. epbase Update, a database of repetitive elements in euaryotic genomes. Mob. DNA 6, 11
(2015).
32. Brůna, T., Ho, . J., Lomsadze, A., Stane, M. & Borodovsy, M. BAE2: automatic euaryotic genome annotation with
GeneMar-EP+ and AUGUSTUS supported by a protein database. NA Genomics Bioinforma. 3, lqaa108 (2021).
33. eilwagen, J., Hartung, F., Paulini, M., Twardzio, S. O. & Grau, J. Combining NA-seq data and homology-based gene prediction
for plants, animals and fungi. BMC Bioinformatics 19, 189 (2018).
34. ovaa, S. et al. Transcriptome assembly from long-read NA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
35. Holt, C. & Yandell, M. MAE2: an annotation pipeline and genome-database management tool for second-generation genome
projects. BMC Bioinformatics 12, 491 (2011).
36. Buchn, B., euter, . & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368
(2021).
37. Jones, P. et al. InterProScan 5: genome-scale protein function classication. Bioinformatics 30, 1236–1240 (2014).
38. Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bor, P. & Huerta-Cepas, J. eggNOG-mapper v2: Functional Annotation,
Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
39. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090
organisms and 2502 viruses. Nucleic Acids es. 47, D309–D314 (2019).
40. Sun, P. et al. WGDI: A user-friendly toolit for evolutionary analyses of whole-genome duplications and ancestral aryotypes. Mol.
Plant 15, 1841–1851 (2022).
41. NCBI Sequence ead Archive, https://identiers.org/ncbi/insdc.sra:SP424604 (2023).
42. Sun, J.-T. GenBan https://identiers.org/ncbi/insdc.gca:GCA_036759885.1 (2024).
43. Chen, L., Zhang, F. & Sun, J.-T. e genome assembly and annotation of Tetranychus piercei, gshare, https://doi.org/10.6084/
m9.gshare.22215145.v5 (2023).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. U2003112,
32020103011, 32202290) and the Natural Science Foundation of Jiangsu Province (Grant No. BK20221003).
Author contributions
J.T.S., F.Z. and L.C. designed the research. X.Y.Y. contributed to the sampling and sequencing. F.Z. assembled and
annotated the genome. L.C., X.Y.Y., H.M.Z. and L.G.X. analyzed the data and drew the pictures. L.R. performed
chromosome staining. L.C. and J.T.S. wrote the dra manuscript, and X.Y.H. improved the manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary information e online version contains supplementary material available at https://doi.
org/10.1038/s41597-024-03189-0.
Correspondence and requests for materials should be addressed to J.-T.S.
Reprints and permissions information is available at www.nature.com/reprints.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2024) 11:340 | https://doi.org/10.1038/s41597-024-03189-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2024
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Preprint
Full-text available
Technological advances have propelled DNA sequencing of non-model organisms, making sequencing more accessible and cost effective, which has also increased the availability of raw data in public repositories. However, contamination is a significant concern, and the use and reuse of sequencing data requires quality control and curation. A reference genome for the Australian native rainforest tree Rhodamnia argentea Benth. (malletwood) was assembled from Oxford Nanopore Technologies (ONT) long-reads, 10x Genomics Chromium linked-reads, and Hi-C data (N50 = 32.3 Mbp and BUSCO completeness 98.0%) with 99.0% of the 347 Mbp assembly anchored to 11 chromosomes (2 n = 22). The R . argentea genome will inform conservation efforts for Myrtaceae species threatened by the global spread of the fungal disease myrtle rust. We observed contamination in the sequencing data and further investigation revealed an arthropod source. Here, we demonstrate the feasibility of assembling a high-quality gapless telomere-to-telomere mite genome using contaminated host plant sequencing data. The mite exhibits genome streamlining and has a 35 Mbp genome (68.6% BUSCO completeness) on two chromosomes, capped with a novel TTTGG telomere sequence. Phylogenomic analysis suggests that it is a previously unsequenced eriophyoid mite. Despite its unknown identity, this complete nuclear genome provides a valuable resource to investigate invertebrate genome reduction. This study emphasises the importance of checking sequencing data for contamination, especially when working with non-model organisms. It also enhances our understanding of two species, including a tree that faces substantial conservation challenges, contributing to broader biodiversity initiatives. Significance The genomes of Rhodamnia argentea and an associated eriophyoid mite, which contaminated the tree raw sequencing data, were assembled for the first time. We generated valuable chromosome-level genomic resources for the conservation of myrtle rust impacted tree species, pest genomics, and understanding genome streamlining. The research underscores the growing prevalence of sequencing experiments in non-model organisms while emphasising the importance of quality control and curation of sequencing data.
Article
Full-text available
Evidence of whole-genome duplications (WGDs) and subsequent karyotype changes has been detected in most major lineages of life on Earth. To clarify the complex resulting multiple-layered patterns of gene collinearity in genome analyses there is a need for convenient and accurate toolkits. To meet this need, we introduce here WGDI (Whole-Genome Duplication Integrated analysis), a Python-based command-line tool that facilitates comprehensive analysis of recursive polyploidization events and cross-species genome alignments. WGDI supports three main workflows (polyploid inference, hierarchical inference of genomic homology, and ancestral chromosome karyotyping) that can improve detection of WGD and characterization of related events based on high-quality chromosome-level genomes. It can extract complete synteny blocks and facilitates reconstruction of the detailed karyotypic evolution. This toolkit is freely available at GitHub (https://github.com/SunPengChuan/wgdi). For an illustrative example of its application, WGDI convincingly clarified karyotype evolution in Aquilegia coerulea and Vitis vinifera following WGDs and rejected the hypothesis that Aquilegia contributed as a parental lineage to the allopolyploid origin of core dicots.
Article
Full-text available
Simple Summary Many spider mites are important agricultural pests in both fields and greenhouses worldwide and are diversified in their host plant range. How spider mites perceive their suitable host plants remains not completely clear. In this study, we found that spider mites cannot locate suitable host plants by volatile odours from a long distance, but they can use olfactory sensation in combination with gustatory sensation to make a precise selection for suitable host plants at a short distance. Highly polyphagous species showed strong sensitivity in sensing suitable host plants rather than the expected lowered sensitivity. We also found that the similarity among the five spider mite species in their performance in perceiving suitable host plants was highly correlated with their relative phylogenetic relationship. Abstract Spider mites belonging to the genus Tetranychus infest many important agricultural crops in both fields and greenhouses worldwide and are diversified in their host plant range. How spider mites perceive their suitable host plants remains not completely clear. Here, through two-host-choice designs (bean vs. tomato, and bean vs. eggplant), we tested the efficacies of the olfactory and gustatory systems of five spider mite species (T. urticae, T. truncatus, T. pueraricola, T. piercei, and T. evansi), which differ in host plant range in sensing their suitable host plant, by Y-tube olfactometer and two-choice disc experiments. We found that spider mites cannot locate their suitable host plants by volatile odours from a long distance, but they can use olfactory sensation in combination with gustatory sensation to select suitable host plants at a short distance. Highly polyphagous species displayed strong sensitivity in sensing suitable host plants rather than the lowered sensitivity we expected. Intriguingly, our principal component analyses (PCAs) showed that the similarity among five spider mite species in the performance of perceiving suitable host plants was highly correlated with their relative phylogenetic relationships, suggesting a close relationship between the chemosensing system and the speciation of spider mites. Our results highlight the necessity of further work on the chemosensing system in relation to host plant range and speciation of spider mites.
Article
Full-text available
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: (i) de novo gene prediction from raw contigs, (ii) built-in pairwise orthology prediction, (iii) fast protein domain discovery, and (iv) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.
Article
Full-text available
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP. An updated version of DIAMOND uses improved algorithmic procedures and a customized high-performance computing framework to make seemingly prohibitive large-scale protein sequence alignments feasible.
Article
Full-text available
Dfam is an open access database of repetitive DNA families, sequence models, and genome annotations. The 3.0–3.3 releases of Dfam ( https://dfam.org ) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species, and for both curated and uncurated datasets. In addition, releases since Dfam 3.0 provide auxiliary consensus sequence models, transposable element protein alignments, and a formalized classification system to support the growing diversity of organisms represented in the resource. The latest release includes 266,740 new de novo generated transposable element families from 336 species contributed by the EBI. This expansion demonstrates the utility of many of Dfam’s new features and provides insight into the long term challenges ahead for improving de novo generated transposable element datasets.
Article
Full-text available
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Article
Full-text available
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license ( https://github.com/Dfam-consortium/RepeatModeler , http://www.repeatmasker.org/RepeatModeler/ ).
Article
Full-text available
Motivation: Rapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results: Here we present a novel tool "purge_dups" that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability: The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information: Supplementary data are available at Bioinformatics online.