ArticlePDF Available

Abstract and Figures

Eriophyoidea represents a highly diverse superfamily of herbivorous mites in the Acariformes, including over 5,000 named species that are distributed worldwide. However, the lack of chromosome-level genome prevents our understanding of the evolution in this group. Here, we report the first chromosome-level genome assembly of Setoptus koraiensis using Illumina, PacBio, and Hi-C sequencing technologies. The assembled genome has a size of 47 Mb with an N50 of 24.53 Mb, anchored into two chromosomes. The chromosome-level genome assembly had a BUSCO completeness of 89%. We identified 5,954 protein-coding genes, with 4,770 genes that could be functionally annotated. This genome provides resources to further understand the genetic and evolution of eriophyoid mites.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2025) 12:446 | https://doi.org/10.1038/s41597-025-04814-2
www.nature.com/scientificdata
A chromosome-level genome
assembly of eriophyoid mite
Setoptus koraiensis
Zi-Kai Shao, Lei Chen , Jing-Tao Sun & Xiao-Feng Xue ✉
Eriophyoidea represents a highly diverse superfamily of herbivorous mites in the Acariformes,


chromosome-level genome assembly of Setoptus koraiensis





Eriophyoid mites (Acariformes, Eriophyoidea) are among the largest superfamilies in the Arachnida, compris-
ing over 5,000 name species1,2 and exhibiting a worldwide distribution3. ese tiny (~200 um in length, among
the smallest arthropods), vermiform to fusiform mites have only two pairs of legs, and are strictly phytopha-
gous, reecting high hostplant specicity4,5; some of them can cause massive economic losses in agriculture and
forestry6.
Despite the need to understand the ecology and evolution among eriophyoid mites, there are no
chromosome-level assembled genomes for eriophyoid mites yet. A near chromosome genome assembly has
been published for tomato russet mite Aculops lycopersici7, but the lack of high-quality chromosome-level
genome resources has limited further comparative genomic analyses among eriophyoid mites.
In this study, we assembled a chromosome-level genome for the Setoptus koraiensis (Eriophyoidea,
Phytoptidae) using PacBio long-reads sequencing, Illumina short-reads sequencing, and high-throughput
chromatin conformation capture (Hi-C) sequencing. Our assembly resulted in a genome size of 47 Mb across
two chromosomes, with scaold N50 lengths of 24.53 Mb (Table1). is genome is the rst chromosome-level
genome among eriophyoid mites, providing signicant new data resources for understanding the Eriophyoidea.

 At least 100,000 wild S. koraiensis individuals, including eggs, juveniles and adults,
were collected from Pinus koraiensis Siebold & Zucc. (Pinaceae), in Lishui, Nanjing city, Jiangsu province,
China (31.3921°N, 118.5417°E). Samples were identied by morphological characteristics with molecular evi-
dence (mitochondrial COI). Vouchers were deposited in the Arthropod/Mite Collection of the Department of
Entomology, Nanjing Agricultural University, Jiangsu Province, China.
 Genomic DNA was extracted from more than 100,000 individuals using MagAttract
HMW DNA Kit. The Pacbio 30 kb SMRTbell library was prepared with more than 5 μg gDNA using the
SMRTbellTM Prep Kit 2.0 (Pacic Biosciences). e mode of Continuous Long Read (CLR) was run on the Sequel
II platform. Illumina whole-genome sequencing was prepared using a 350 bp-insert fragment library (150 bp
paired-end) by Truseq DNA PCR-free Kit, which was further sequenced on an Illumina NovaSeq 6000 platform.
High-throughput chromosome conformation capture (Hi-C) included cross-linking, HindIII restriction enzyme
digestion, end repair, DNA cyclization, purication and capture. e Hi-C library with 300–700 bp insert size
library was sequenced on the NovaSeq 6000 platform. Finally, we generated 24.25 Gb (~496X) PacBio long reads,
9.5 Gb (~194X) Illumina short reads, and 9 Gb Hi-C (~184X) reads for our genome assembly.
Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China. e-mail: xfxue@
njau.edu.cn


Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2025) 12:446 | https://doi.org/10.1038/s41597-025-04814-2
www.nature.com/scientificdata
www.nature.com/scientificdata/
 Duplicate and low-quality Illumina raw reads (base quality < Q20, length < 15 bp, polymer
A/G/C/ > 10 bp) were trimmed and removed using BBtools package v38.828. e 21-mer depth distribution was
counted using script ‘khist.sh’ of BBtools. Genome Scope v2.09 was used to estimate the genome size and hete-
rozygosity of S. koraiensis with the maximum kmer coverage at 1,000×. Based on the distribution of kmer cover-
age and frequency, the estimated genome size of S. koraiensis was 45.72 Mb, with a heterozygosity rate of around
1.13% and a repeat content proportion of approximately 3.3% (Fig.1).
 e CLR reads were set as input to Flye v2.610 to assemble continuous long reads. One
round of built-in long reads polishing was performed by Flye v2.6. en, two rounds of short reads were used
to polish and ll in gaps of the primary assembly with NextPolish v1.4.111. Haplotigs and duplication caused by
haplotype divergence were eliminated by Purge_dups v1.2.512 using the alignment program Minimap2 v2.2813.
Hi-C reads were aligned to the purged genome using BWA v0.7.1814 and Juicer v1.615 to anchor, order and orient
contigs into chromosomal assembly following 3D-DNA16 pipeline. en, we manually reviewed and corrected
assembled errors using Juicebox v2.1717. Contaminations were checked and deleted against the UniVec and NCBI
nucleotide databases using BLAST + v2.11.018 and MMseqs2 v1619. e completeness of genome assembly was
evaluated by BUSCO version 5.2.220 using the eukaryota_odb10 dataset (creation date 2020-09-10). e reads
from the whole genome sequencing were aligned back to the genome assembly to access the mapping rate. Aer
de novo assembly, polishing and contaminant removal, the S. koraiensis genome has a genome size of 49.9 Mb
with 565 scaolds, an N50 length of 24.53 Mb, with 94.2% of assembled genomes anchored to two chromosomes
(Fig.2) resulting in a nal genome size of 47 Mb (Table1).
 e repetitive elements were identied using RepeatModeler v2.0.521, which discov-
ered the complete long terminal repeats (LTR) with the ‘-LTRstruct’ pipeline. RepeatMasker v4.1.622 was searched
against the custom repeat library of Dfam 3.823 and Repbase v2018102624 with options ‘-no_is -norna -xsmall -q’
to so mask repeats of the genome assembly.
For gene structure annotation, we performed a pipeline integrating ab initio and homolog-based meth-
ods. Braker v2.1.525 was used to obtain ab initio gene predictions employing GeneMark-ES/ET/EP v4.3326 and
Augustus v3.4.027 based on reference proteins from the OrthoDB v11 database28. GeMoMa v1.929 was used for
Characteristics Setoptus koraiensis
Genome Size (Mb) 47
Number of contigs 266
Number of chromosomes 2
Scaold N50 length (Mb) 24.53
BUSCO completeness (%) 89
Repetitive elements Size (Mb) 6.42 (13.82%)
Tab le 1. Statistics of Setoptus koraiensis genome assembly. State: We would be happy to be published without
further edits.
Fig. 1 GenomeScope genome size estimates for Setoptus koraiensis.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2025) 12:446 | https://doi.org/10.1038/s41597-025-04814-2
www.nature.com/scientificdata
www.nature.com/scientificdata/
homology prediction with the parameters “GeMoMa.c = 0.4 GeMoMa.p = 10”, and the protein sequences of six
species (Aculops lycopersici (GCA_015350385.1), Tetranychus urticae (GCA_039701765.1), Tetranychus piercei
(GCA_036759885.1), Panonychus citri (GCA_014898815.1), Pyemotes zhonghuajia (GCA_025170145.1), Blomia
tropicalis (GCA_029204025.1)) were provided to assist gene prediction. e results obtained from BRAKER and
GeMoMa were combined and provided to MAKER v3.01.0330. e functional annotation of predicted protein
sequences was searched against UniProt, InterProScan and eggNOG databases. Diamond v2.1.1031 was used to
assign the gene function of the best hits in the UniProt database under the ‘very sensitive’ mode. Gene Ontology
(GO) and pathway (KEGG) were annotated using InterProScan v5.7232 and eggnog-mapper v2.1.1233 against
Pfam34, SMART35, Superfamily36, CDD37, and EggNOG 5.0.2 database38.
Data Records
The raw reads and genome assembly have been deposited in the NCBI databases under BioProject
PRJNA1196018. The PacBio, Illumina, and Hi-C data are available under identification numbers
SRR32458739-SRR3245874139. e nal chromosome assembly has been deposited at GenBank under the
accession number GCA_048013815.140. e mitochondrial COI sequence has been deposited at GenBank
under the accession number PV16383341. e genome assembly and annotation les are available in Figshare42.
Fig. 2 Genome-wide chromosomal heatmap of Setoptus koraiensis, the blue boxes show super scaolds.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2025) 12:446 | https://doi.org/10.1038/s41597-025-04814-2
www.nature.com/scientificdata
www.nature.com/scientificdata/
Technical Validation
We mapped the Illumina sequencing data to the nal assembly with BWA v0.7.18, and the mapping rate was
92.9%. We assessed the completeness of the genome assembly using BUSCO v5.4.2 with the ‘eukaryota_odb10’
database, and a total of 89% (83.9% single-copied genes, 5.1% duplicated genes, 5.5% fragmented, and 5.5%
missing genes) completed BUSCOs were identied, which is higher than that of A. lycopersici (86.3%). We
masked 13.82% (6.42 Mb) repetitive regions of the S. koraiensis genome. Among them, 0.2% of repeat sequences
were short interspersed elements (SINEs), 1.29% were long interspersed elements (LINEs), 0.92% were long
terminal repeats (LTRs), 1.61% were DNA transposons, and 5.14% were unclassied (Fig.3). We identied 5,954
protein-coding genes, with 4,770 genes that could be functionally annotated. e BUSCO completeness for pro-
tein sequence is 77.3% (71.4% single-copied genes, 5.9% duplicated genes, 3.9% fragmented, and 18.8% missing
genes) with the ‘eukaryota_odb10’ database. All evidence strongly supported the completeness and accuracy of
S. koraiensis genome assembly.
Code availability
No custom scripts or code were used in this study.
Received: 27 December 2024; Accepted: 12 March 2025;
Published: xx xx xxxx
References
1. Zhang, Z.-Q. Eriophyoidea and allies: where do they belong? Syst. Appl. Acarol. 22, 1091–1095 (2017).
2. Zhang, Z.-Q. Phylum Arthropoda von Siebold, 1848. in Animal biodibersity: An Outline of Higher-Level Classication and Survey of
Taxonomic ichness (ed. Zhang, Z.-Q.) 99–103 (Magnolia Press, 2011)
3. Li, N., Sun, J.-T., Yin, Y., Hong, X.-Y. & Xue, X.-F. Global patterns and drivers of herbivorous eriophyoid mite species diversity.
J. Biogeogr. 50, 330–340 (2022).
4. Soraca, A., Smith, L., Oldeld, G., Cristofaro, M. & Amrine, J. W. Host-plant specicity and specialization in eriophyoid mites and
their importance for the use of eriophyoid mites as biocontrol agents of weeds. Exp. Appl. Acarol. 51, 93–113 (2010).
5. Yin, Y. et al. DNA barcoding uncovers cryptic diversity in minute herbivorous mites (Acari, Eriophyoidea). Mol. Ecol. esour. 22,
1986–1998 (2022).
6. de Lillo, E., Pozzebon, A., Valenzano, D. & Duso, C. An intimate relationship between eriophyoid mites and their host plants–a
review. Front. Plant. Sci. 9, 1786 (2018).
7. Greenhalgh, . et al. Genome streamlining in a minute herbivore that manipulates its host plant. Elife 9 (2020).
8. Bushnell, B. BBtools. Available online: https://sourceforge.net/projects/bbmap/ (accessed on 1 October 2024) (2014).
9. anallo-Benavidez, T. ., Jaron, . S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free proling of polyploid
genomes. Nat. Commun. 11, 1432 (2020).
10. olmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37,
540–546 (2019).
11. Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and ecient genome polishing tool for long-read assembly. Bioinformatics 36,
2253–2255 (2020).
12. Gua n , D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898
(2020).
13. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
14. Li, H. & Durbin, . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Fig. 3 Circular karyotype representation of the chromosomes of Setoptus koraiensis. Tracks from inside to
outside are GC content (GC), density of protein-coding genes (GENE), DNA transposons (DNA), LTR/LINE/
SINE retrotransposons (LTR, LINE, SINE), and simple repeats (Simple).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2025) 12:446 | https://doi.org/10.1038/s41597-025-04814-2
www.nature.com/scientificdata
www.nature.com/scientificdata/
15. Durand, N. C. et al. Juicer provides a one-clic system for analyzing loop-resolution Hi-C experiments. Cell. Syst. 3, 95–98 (2016).
16. Dudcheno, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaolds. Science 356,
92–95 (2017).
17. Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell. Syst. 3, 99–101 (2016).
18. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
19. Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat.
Biotechnol. 35, 1026–1028 (2017).
20. Manni, M., Bereley, M. ., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined worows along with
broader and deeper phylogenetic coverage for scoring of euaryotic, proaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021) .
21. Flynn, J. M. et al. epeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA 117,
9451–9457 (2020).
22. Smit, A. F. A., Hubley, . & Green, P. epeatMaser Open-4.0. Available online: http://www.repeatmaser.org (accessed on 1
October 2024) (2013–2015).
23. Hubley, . et al. e Dfam database of repetitive DNA families. Nucleic. Acids. es. 44, D81–89 (2016).
24. Bao, W., ojima, . . & ohany, O. epbase Update, a database of repetitive elements in euaryotic genomes. Mob DNA 6, 11 (2015).
25. Bruna, T., Ho, . J., Lomsadze, A., Stane, M. & Borodovsy, M. BAE2: automatic euaryotic genome annotation with
GeneMar-EP+ and AUGUSTUS supported by a protein database. NA. Genom. Bioinform. 3, lqaa108 (2021).
26. Bruna, T., Lomsadze, A. & Borodovsy, M. GeneMar-EP+: euaryotic gene prediction with self-training in the space of genes and
proteins. NA. Genom. Bioinform. 2, lqaa026 (2020).
27. Stane, M., Steinamp, ., Waac, S. & Morgenstern, B. AUGUSTUS: a web server for gene nding in euaryotes. Nucleic. Acids.
es. 32, W309–312 (2004).
28. uznetsov, D. et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic. Acids. es. 51,
D445–D451 (2023).
29. eilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic. Acids. es. 44, e89 (2016).
30. Holt, C. & Yandell, M. MAE2: an annotation pipeline and genome-database management tool for second-generation genome
projects. BMC Bioinformatics 12, 491 (2011).
31. Buchn, B., euter, . & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
32. Finn, . D. et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic. Acids. es. 45, D190–D199 (2017).
33. Cantalapiedra, C. P., Hernandez-Plaza, A., Letunic, I., Bor, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation,
orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
34. El-Geb ali, S. et al. e Pfam protein families database in 2019. Nucleic. Acids. es. 47, D427–D432 (2019).
35. Letunic, I. & Bor, P. 20 years of the SMAT protein domain annotation resource. Nucleic. Acids. es. 46, D493–D496 (2018).
36. Wilson, D. et al. SUPEFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic. Acids.
es. 37, D380–386 (2009).
37. Marchler-Bauer, A. et al. CDD/SPACLE: functional classication of proteins via subfamily domain architectures. Nucleic. Acids.
es. 45, D200–D203 (2017).
38. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090
organisms and 2502 viruses. Nucleic. Acids. es. 47, D309–D314 (2019).
39. NCBI Sequence ead Archive https://identiers.org/ncbi/insdc.sra:SP565774 (2024).
40. Shao, Z.-. GenBan https://identiers.org/ncbi/insdc.gca:GCA_048013815.1 (2025).
41. Shao, Z.-. GenBan https://identiers.org/ncbi/insdc:PV163833 (2025).
42. Shao, Z.-., Chen, L., Sun, J.-T. & Xue, X.-F. A chromosome-level genome assembly of eriophyoid mite Setoptus oraiensis. gshare
https://doi.org/10.6084/m9.gshare.28087958 (2025).

is research was funded by the National Natural Science Foundation of China (32170466). is work was also
supported by the high-performance computing platform of Bioinformatics Center, Nanjing Agricultural University.
Author contributions
Z.-K.S. and X.-F.X. conceived and designed the study. Z.-K.S. analyzed the data. X.-F.X., L.C. and J.-T.S. had
substantial contributions to the interpretation of the data, writing, and review of the nal manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to X.-F.X.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribu-
tion and reproduction in any medium or format, as long as you give appropriate credit to the original author(s)
and the source, provide a link to the Creative Commons licence, and indicate if you modied the licensed mate-
rial. You do not have permission under this licence to share adapted material derived from this article or parts of
it. e images or other third party material in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative
Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/.
© e Author(s) 2025
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Aim Environmental drivers and host richness play key roles in affecting herbivore diversity. However, the relative effects of these factors and their effects on lineages characterized by high host specificity are not well known. In this study, we explored the extent to which contemporary climate, Quaternary climate change, habitat heterogeneity and host plants determine the species richness and endemism patterns of herbivorous eriophyoid mites. Location Global. Taxon Eriophyoid mites (Acari: Eriophyoidea). Methods We compiled a dataset comprising 4278 eriophyoid mite species from 22,973 occurrence sites based on a comprehensive search of the published literature and the Global Biodiversity Information Facility (GBIF) as a basis for predicting their global distribution patterns. We measured the association of environmental variables and host plant richness with species richness and endemism of eriophyoid mites through multiple regression analyses using a simultaneous autoregressive (SAR) model, an ordinary least squares (OLS) model and a random forest model. We examined the direct and indirect effects of these environmental variables and the host plant richness on eriophyoid mite diversity using structural equation models (SEMs). Results The species richness and endemism patterns of eriophyoid mites are concentrated in temperate regions. Contemporary climate, Quaternary climate change, habitat heterogeneity and host plants all significantly affected eriophyoid mite richness, while Quaternary climate change, habitat heterogeneity and host plants contributed to the eriophyoid mite endemism. Abiotic factors indirectly influenced the species richness and endemism of eriophyoid mites, via biotic factors—host plants. Main Conclusions The species richness and endemism of eriophyoid mites peak in temperate regions, opposite to the patterns of plants and some other organisms. Complex interactions among biotic and abiotic factors shape the current eriophyoid mite species diversity.
Article
Full-text available
OrthoDB provides evolutionary and functional annotations of genes in a diverse sampling of eukaryotes, prokaryotes, and viruses. Genomics continues to accelerate our exploration of gene diversity and orthology is the most precise way of bridging gene functional knowledge with the rapidly expanding universe of genomic sequences. OrthoDB samples the most diverse organisms with the best quality genomics data to provide the leading coverage of species diversity. This update of the underlying data to over 18 000 prokaryotes and almost 2000 eukaryotes with over 100 million genes propels the coverage to another level. This achievement also demonstrates the scalability of the underlying OrthoLoger software for delineation of orthologs, freely available from https://orthologer.ezlab.org. In addition to the ab-initio computations of gene orthology used for the OrthoDB release, the OrthoLoger software allows mapping of novel gene sets to precomputed orthologs and thereby links to their annotations. The LEMMI-style benchmarking of OrthoLoger ensures its state-of-the-art performance and is available from https://lemortho.ezlab.org. The OrthoDB web interface has been further developed to include a pairwise orthology view from any gene to any other sampled species. OrthoDB-computed evolutionary annotations as well as extensively collated functional annotations can be accessed via REST API or SPARQL/RDF, downloaded or browsed online from https://www.orthodb.org.
Article
Full-text available
Eriophyoid mites (Acari: Eriophyoidea) are among the smallest of terrestrial arthropods and the most species‐rich group of herbivorous mites with a high host specificity. However, knowledge of their species diversity has been impeded by the difficulty of their morphological differentiation. This study assembles a DNA barcode reference library that includes 1850 mitochondrial COI sequences which provides coverage for 45% of the 930 species of eriophyoid mites known from China, and for 37 North American species. Sequence analysis showed a clear barcode gap in nearly all species, reflecting the fact that intraspecific divergences averaged 0.97% versus a mean of 18.51% for interspecific divergences (minimum nearest‐neighbour distances) in taxa belonging to three families. Based on these results, we used DNA barcoding to explore the species diversity of eriophyoid mites as well as their host interactions. The 1850 sequences were assigned to 531 barcode index numbers (BINs). Analyses examining the correspondence between these BINs and species identifications based on morphology revealed that members of 45 species were assigned to two or more BINs, resulting in 1.16 times more BINs than morphospecies. Richness projections suggest that over 2345 BINs occurred at the sampled locations. Host plant analysis showed that 89% of these mites (BINs) attack only one or two congeneric host species, but the others have several hosts. Furthermore, host‐mite network analyses demonstrate that eriophyoid mites are high host‐specific, and modularity is high in plant‐mite networks. By creating a highly effective identification system for eriophyoid mites in the Barcode of Life Data Systems database (BOLD), DNA barcoding will advance our understanding of the diversity of eriophyoid mites and their host interactions.
Article
Full-text available
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: (i) de novo gene prediction from raw contigs, (ii) built-in pairwise orthology prediction, (iii) fast protein domain discovery, and (iv) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP. An updated version of DIAMOND uses improved algorithmic procedures and a customized high-performance computing framework to make seemingly prohibitive large-scale protein sequence alignments feasible.
Article
Full-text available
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Article
Full-text available
The tomato russet mite, Aculops lycopersici, is among the smallest animals on earth. It is a worldwide pest on tomato and can potently suppress the host's natural resistance. We sequenced its genome, the first of an eriophyoid, and explored whether there are genomic features associated with the mite's minute size and lifestyle. At only 32.5 Mb, the genome is the smallest yet reported for any arthropod and, reminiscent of microbial eukaryotes, exceptionally streamlined. It has few transposable elements, tiny intergenic regions, and is remarkably intron-poor, as more than 80% of coding genes are intronless. Furthermore, in accordance with ecological specialization theory, this defense-suppressing herbivore has extremely reduced environmental response gene families such as those involved in chemoreception and detoxification. Other losses associate with this species' highly derived body plan. Our findings accelerate the understanding of evolutionary forces underpinning metazoan life at the limits of small physical and genome size.
Article
Full-text available
We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.
Article
Full-text available
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license ( https://github.com/Dfam-consortium/RepeatModeler , http://www.repeatmasker.org/RepeatModeler/ ).