ArticlePDF Available

Chromosome-level Genome Assembly of Theretra japonica (Lepidoptera: Sphingidae)

Authors:

Abstract and Figures

Theretra japonica is an important pollinator and agricultural pest in the family Sphingidae with a wide range of host plants. High-quality genomic resources facilitate investigations into behavioral ecology, morphological and physiological adaptations, and the evolution of genomic architecture. However, chromosome-level genome of T. japonica is still lacking. Here we sequenced and assembled the high-quality genome of T. japonica by combining PacBio long reads, Illumina short reads, and Hi-C data. The genome was contained in 95 scaffolds with an accumulated length of 409.55 Mb (BUSCO calculated a genome completeness of 99.2%). The 29 pseudochromosomes had a combined length of 403.77 Mb, with a mapping rate of 98.59%. The genomic characterisation of T. japonica will contribute to further studies for Sphingidae and Lepidoptera.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
Chromosome-level Genome
Assembly of Theretra japonica
(Lepidoptera: Sphingidae)
Ming Yan, Bao-Shan Su, Yi-Xin Huang
, Zhen-Bang Xu, Zhuo-Heng Jiang &
Xu Wang ✉
Theretra japonica is an important pollinator and agricultural pest in the family Sphingidae with a wide
range of host plants. High-quality genomic resources facilitate investigations into behavioral ecology,
morphological and physiological adaptations, and the evolution of genomic architecture. However,
chromosome-level genome of T. japonica is still lacking. Here we sequenced and assembled the high-
quality genome of T. japonica by combining PacBio long reads, Illumina short reads, and Hi-C data. The


T. japonica will contribute to further
studies for Sphingidae and Lepidoptera.
Background & Summary
Sphingidae, commonly recognized as hawkmoth, is a member of the Lepidoptera, currently boasting over 1,460
recorded species worldwide1,2. ey are medium to large, heavy-bodied insects with bullet-shaped bodies and
long, blade-like wings. Hawkmoths are known for their powerful ight, which can reach speeds of 40–50 kilom-
eters per hour. Numerous hawkmoths are globally recognized as agricultural and forestry pests, including spe-
cies such as Clanis undulosa Moore, 1879, eretra oldenlandiae (Fabricius, 1775), and Ampelophaga rubiginosa
Bremer & Grey, 1853, etc.3. e larvae of hawkmoths, which are known to inict substantial economic harm on
crops, predominantly survive by feeding on the foliage of trees and vegetables.
e adult eretra japonica (Boisduval, 1869) acts as an important pollinator for a wide range of plants
(Fig.1). However, during its larval stage, it gains an unsavory reputation as a destructive pest of agricultural
crops. It is a prevalent pest in Korea, Japan, Russia, and China, preferentially for damaging Cissus, Colocasia,
Hydrangea, Parthenocissus, Ampelopsis, Ipomoea batatas, Cayratia japonica and Vitis (https://tpittaway.tripod.
com/china/china.htm). In China, T. japonica can be seen almost everywhere and usually damages crops from
June to October every year. Severe damage by this pest can result in complete destruction of the leaf tissue,
leaving only the leaf veins and twigs, and in extreme cases, the entire plant may die. Such an infestation can
signicantly impair the growth and development of the aected plants.
Genomic resources containing high-quality reference genomes and transcriptomes facilitate comparisons
between populations and species to answer questions ranging from broad-chromosomal evolution to the genetic
basis of important adaptations4. A comprehensive understanding of its genome is therefore needed to promote
more innovative management strategies for this destructive pest. However, genomic information of Sphingidae
remains scarce. To date only 11 chromosome-level genomes have been published for species of Sphingidae
(Sphinx pinastri, Hyles euphorbiae, Hemaris fuciformis, Mimas tiliae, Laothoe populi, Deilephila porcellus,
Deilephila elpenor, Lapara coniferarum, Amorpha juglandis, Hyles vespertilio and Manduca sexta) (submission
date, October 25, 2023).
1Anhui Provincial Key Laboratory of the Conservation and Exploitation of Biological Resources, College of Life
Sciences, Anhui Normal University, Wuhu, Anhui, 241000, China. 2Collaborative Innovation Center of Recovery and
Reconstruction of Degraded Ecosystem in Wanjiang Basin Co-founded by Anhui Province and Ministry of Education,
School of Ecology and Environment, Anhui Normal University, Wuhu, Anhui, 241000, China. 3Key Laboratory of
Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, 100101, China.
4Guangxi Institute of Botany, Chinses Academy of Sciences, Guilin, Guangxi, 541006, China. 5School of Life Science,
Westlake University, Hangzhou, Zhejiang, 310023, China. e-mail: wangxu0322@ahnu.edu.cn


Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
Here we present a high-quality chromosome-level genome assembly of T. japonica, the rst reference genome
assembly of a member of the eretra. We annotated repeated sequences, non-coding RNA (ncRNA), and
protein-coding genes. is study has signicant implications for the development of modern pest control strate-
gies and serves as a reference for future genome comparison research in Lepidoptera and other insects.
Methods
Samples collection and sequencing. e specimens used in this study were all collected from Baishaguan
Wood Inspection Station, City Shangrao, Province Jiangxi, China, on September 5, 2022. We used these ve
individuals for PacBio, genome survey, Hi-C and transcriptome sequencing. One male specimen was used for
Hi-C, one male specimen for second-generation transcriptome sequencing, one male specimen for third-gen-
eration full-length transcriptome sequencing, and one female and one male specimen for second-generation
whole genome sequencing and third-generation whole genome sequencing. We removed the intestinal tract of
the samples to minimise contamination by gut microorganisms and stored the samples in liquid nitrogen at
80 °C before delivering them to the company (Berry Genomics, Beijing, China).
ird generation PacBio HiFi sequencing was performed using the SMRTbell® Express Template Prep Kit
2.0 to generate PacBio HiFi 15 K libraries. Aer fullling the quality control criteria, the DNA fragments were
cut to a size of 15 Kb using the Megaruptor (Diagenode B06010001, Liege, Belgiu) instrument and concentrated
using AMPure®PB Beads. e SMRTbell library construction was completed with the assistance of the 2.0 kit.
Finally, fragment screening was conducted using the SageELF system. For second-generation whole-genome
sequencing, the Agencourt AMPure XP-Medium kit was utilized to construct BGISEQ-500 libraries, with insert
fragment sizes ranging from 200 to 400 bp. For conventional second-generation transcriptome sequencing,
RNA was extracted using TRIzolTM Reagent, followed by library construction using the VAHTS mRNA-seq v2
Library Prep Kit. e Hi-C library was constructed through the following steps: crosslinking cells with formal-
dehyde, digesting DNA with MboI, lling ends and mark with biotin, ligating the resulting blunt-end fragments,
purication and random shearing DNA into 300–500 bp fragments. Aer quality control test of the libraries
using Qubit 2.0, an Agilent 2100 instrument (Agilent Technologies, CA, USA) and q-PCR, 150 bp PE sequenc-
ing of the Hi-C library were performed on the Illumina Novaseq 6000 platform by Berry Genomics Company
(http://www.berrygenomics.com/. Beijing, China). Finally, we obtained 161.49 Gb of sequencing data, com-
prising 66.05 Gb (161.28×) of WGS data, 40.81 Gb (99.64×) of HiFi data, 34.68 Gb (84.69×) of Hi-C data, and
19.95 Gb of transcriptome data (Table1).
Genome survey and assembly. The main purpose of Genome Survey analysis is to predict genome
size, heterozygosity, and the proportion of repetitive sequences in order to facilitate the subsequent selec-
tion of appropriate genome assembly tools and adjustment of corresponding parameters. Firstly, the obtained
second-generation BGI data obtained with fastp v 0.23.01 (‘-q 20 -D -g -x -u 10 -5 -r -c’) is subjected to quality
control and trimming positions with a base quality of at least 20, removing duplicate sequences, trimming of
poly-G/X tails, ensuring the proportion of disqualied bases does not exceed 10%, and correction of bases in
overlapping regions5. e survey was derived based on the k-mer frequency distribution analysed by BBTools
v38.82(https://sourceforge.net/projects/bbmap/), with the sequence length set to 21 k-mer. Genome characteriza-
tion was performed using GenomeScope v2.06 with the maximum k-mer coverage set to 10,000 with the param-
eters ‘-k 21 -p 2 -m 100000’.
Fig. 1 Photograph of an adult specimen of the eretra japonica (Photo by Zhuo-Heng Jiang).
Genomic libraries WGS HiFi Hi-C RNA-sr
Sequencing data
(Gbp) 66.05 40.81 34.68 9.59
Average length
(bp) 150 18327.50 150 150
Sequencing
coverage (x) 161.28 99.64 84.69
Tab le 1. Sequencing data for genome assembly.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
High-quality HiFi reads were generated using pbccs v6.4.0, and the rst rounds were assembled using the
default parameters of hiasm v0.16.17 default parameters. We did not polish the assembled bases as their QV
values were above 60. We used Purge_dups v1.2.58 to remove redundancies in assembly based on contig simi-
larity and sequencing depth. Minimap2 v2.49,10 was used to compare HiFi reads to the genome (‘-x map-hi’),
and the genome itself (‘-xasm5 -DP’). Purge_dups was used default parameters (‘-2 -a 70’). Genome survey
analysis predicts only the length of autosomal chromosomes of about 392 Mb. Second-generation sequencing
was performed on female samples, with the sex chromosomes sequenced at half the depth of the autosomes, so
392 Mb should only be the length of an autosomal chromosome. e k-mer frequency distribution indicates that
the genome has a low repeat content, and the potential for contamination of the data is extremely low, which
can be neglected.
We used Hi-C data and 3D-DNA v18092211 for chromosome anchoring and assembly of contigs. Hi-C data
were rst quality controlled using Juicer v1.6.212; followed by two rounds of assembly using 3D-DNA v180922.
Assembly aer the rst round of assembly anchoring was performed using Juicebox v1.11.0812 for manual
error correction before the second round of nal anchoring. Finally, we assessed the sequencing depth of each
pseudochromosome using bamtocov v.2.7.013, where the input comparison bam was generated by minimap2
based on HiFi reads (‘-ax map-hi’). e quality of chromosome assembly was extremely high, resulting in 29
chromosome-level assemblies, with only 3 chromosomes not being gap-free (Fig.2).
Genome completeness was assessed using BUSCO v5.2.214 based on the insecta_odb10 reference data-
set which contains 1,367 single-copy orthologous genes. In addition, both genomic and second-generation
Fig. 2 Hi-C interaction heatmaps, with each chromosome and contig framed in blue and green, respectively.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
transcriptomic raw sequences were mapped back to the genome assembly to assess the utilization of the orig-
inal data and the integrity of the assembly. Minimap2 was used as the mapping tool, and mapping rates were
calculated using SAMtools v1.1015. Possible contamination during the assembly process was investigated using
MMseq2v1316, performing a blastn-like search against two alignment databases: NCBI nt and UniVec. Finally,
the size of the assembled genome was 409.55 Mb (Table2 and S1), which was essentially consistent with the
results of genome survey analysis. e number of scaolds and contigs was 95 and 101 respectively, and the GC
content was 37.16%. e 29 pseudochromosomes (Fig.3) totaled 403.77 Mb, and the assembly rate was 98.59%.
e genome assembly was evaluated by BUSCO, and the completeness was 99.2%, and the duplication rate was
only 0.1%. e mapping ratios of second-generation survey, second-generation RNA, third-generation RNA,
and third-generation Hi data are 96.84%, 99.97%, 89.21%, and 74.43% respectively. All these indicators demon-
strate that the assembly has reached an extremely high standard in terms of both continuity and completeness.
e annotation of genomes primarily encompasses the identication and labeling of repeat sequences,
non-coding RNA (ncRNA), protein-coding genes, and the delineation of gene functions. e prediction of
repeat sequence was performed using RepeatMasker v4.1.2p1(http://www.repeatmasker.org) and the nal
database of repeat sequences was used for comparison and identication. Using RepeatModeler v2.0.317 so-
ware and an additional LTR searc(‘-LTRStruct’), this de novo repeat library was created using the principle
of repeat sequence specic structure and de novo prediction, which is compatible with Dfam 3.518 and the
RepBase 2018102619 database and was integrated into the nal repeat sequence reference database. e results
of RepeatMasker v4.1.2p1 (http://www.repeatmasker.org) and the nal repetitive sequence database showed that
there were 913,482 repetitive sequences (131,570,928 bp), with a percentage of repetitive sequences of 32.13%.
e ve categories of repetitive sequences with the highest percentage were SINE (10.77%), Unknown (8.11%),
LINEs (6.85%), DNA (1.70%) and Simple Repeats (1.54%), while the percentage of LTRs was extremely low, at
only 0.81%.
Two strategies were chosen for the annotation of non-coding RNA. Infernal v1.1.420 was used to annotate
rRNA, snRNA, and miRNA by aligning the genomic sequences with the known non-coding RNA database. For
the tRNA sequences within the genome, tRNAscan-SE v2.0.921 was used for prediction, and low-condence
tRNAs were ltered out using the inbuilt scripts (‘EukHighCondenceFilter’) of the soware. We obtained a
total of 1,872 genomic ncRNA annotation results, including 299 rRNAs, 435 miRNAs, 117 snRNAs, 953 tRNAs,
3 ribozymes, and 2 lncRNAs. e snRNAs consisted of 83 spiceosomal RNAs (U1, U2, U4, U5, and U6), 23 C/D
box snoRNAs and 5 HACA-box snoRNAs.
Predicting the structures of protein-coding genes integrate prediction results based on ab initio gene models,
genes from transcriptome assembly and homologous proteins using MAKER v3.01.0322.
To identify the structure of protein-coding genes, we used three methods including ab initio de novo predic-
tion of genes, comparison of transcript sequences and genomes to predict gene structures, and comparison of
predictions with known protein sequences of homologous species. MAKER v3.01.03 was then used to synthesise
these three types of evidence to predict the structure of protein-coding genes.
To expand the range of potential coding gene candidates by using BRAKER v2.1.623 and GeMoMa v1.824
and integrating both transcriptome and protein evidence, we merged the predictions of the two as an input le
for MAKER ab initio(ab.g3). We used MAKER to automatically train two ab initio prediction tools, Augustus
v3.3.425 and GeneMark-ES/ET/EP v4.6826, and integrated arthropod protein sequence and transcriptome data
from the OrthoDB10 v1 database27 to improve prediction accuracy. GeMoMa(GeMoMa.c=0.4 GeMoMa.p=10)
uses protein homology and intron position information to predict genes. We downloaded protein sequences
from NCBI for 6 homologous species of T. japonica with high quality of assembly and annotation, namely
Bombyx mori (Bombycoidea), Drosophila melanogaster (Diptera), Spodoptera frugiperda (Noctuoidea), Pieris
rapae (Papilionoidea), Manduca sexta (Bombycoidea) and Chilo suppressalis (Pyraloidea). e transcriptome
was generated using HISAT2 v2.2.028 comparing the second-generation RNA-sr transcriptome data with the
genome to generate BAM comparison les. We then used StringTie v2.2.029 soware to perform parametric
assembly (‘-mix’) based on the second-and third-generation transcriptomes. Finally, we performed homol-
ogy comparisons with protein sequences of homologous species downloaded from NCBI. In total, 14,614
protein-coding genes were predicted by the MAKER process, with an average gene length of 9,076.6 bp. Each
gene contained an average of 7.4 exons, with an average exon length of 307.7 bp. Each gene contained an average
of 6.3 introns, with an average intron length of 1146.0 bp. Each gene contained 7.2 coding sequences (CDS),
and the average length was 225.4 bp. e predicted protein gene sequences were subjected to BUSCO integrity
assessment, and the results were C: 99.4% [S: 69.3%, D: 30.1%], F: 0.1%, M: 0.5% (n:1367), which is higher than
99.2% of the genome score.
Genome assembly Number
Size (bp) 409,552,430
Number of scaolds/contigs 95/101
Number of pseudo-chromosomes (sizes) 29 (403,774,580 bp)
N50 scaold/contig length (Mb) 14.58/14.27
GC (%) 37.16
BUSCO completeness (%) 99.2
Tab le 2. Genome assembly statistics for chromosome-level assembly of eretra japonica.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
We have used two strategies to annotate the gene function of protein-coding genes (PCGs). e rst method
is to use the highly sensitive mode (‘-very-sensitive -e 1e-5’) in Diamond v2.0.11.14930, search the UniProtKB
database for gene functions, and then compare with and the database to predict gene functions. Another method
is to compare with the ve comprehensive databases Pfam31, SMART32, Superfamily33, CDD34, and eggNOG
v5.035 to predict the conserved sequence and structural domain of proteins in gene set, Gene Ontology(GO),
KEGG, Reactome, etc., the rst four databases were searched InterProScan 5.53-87.036, and eggNOG v5.0
database was searched with eggNOG-mapper v2.1.537. Finally, the results predicted by the above two methods
were integrated to obtain the nal prediction of gene function. e results showed that 14,221 (97.31%) genes
matched the entries in the UniProtKB database. InterProScan identied the protein structure domains in 11,890
protein-coding genes. A total of 10,425 genes were identied by InterProScan and eggNOG-mapper as GO
pathway entries and 4,896 genes as KEGG pathway entries.
Data Records
e raw sequencing data and genome assembly of eretra japonica have been deposited at the National Center
for Biotechnology Information (NCBI) and China National GeneBank DataBase (CNGBdb)38. e Hi-C, HiFi,
WGS, and transcriptome data can be found under identifcation numbers SRR26855496-SRR268554993942 in
NCBI and under CNP0004835 in CNGBdb. e assembled genome has been deposited in the NCBI assembly
0
10
0
10
0
0
10
0
0
10
0
20
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
10
0
20
10
0
Chr01
Chr02
Chr03
Chr04
Chr05
Chr06
Chr07
Chr08
Chr09
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
Chr18
Chr19
Chr20
Chr21
Chr22
Chr23
Chr24
Chr25
Chr26
Chr27
Chr28
Chr29
Chr
GC
Gene
DNA
SINE
LINE
LTR
SR
Fig. 3 Characterization of the eretra japonica genome. From the outer ring to the inner ring are the distributions
of chromosome length, GC content, gene density, TEs (DNA, SINE, LINE, and LTR), and simple repeats.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
with the accession number GCA_033459515.143. In addition, the annotations for repeated sequences, gene
structure and functional predictions have been placed in the Figshare database44.
Technical Validation
e assessment of the quality of the genome assembly has been a two-step process. Initially, we assessed the
completeness of the assembly using BUSCO v5.2.245 based on the insecta_odb10 database (n = 1,367). e nal
genome assembly displayed a BUSCO completeness of 99.2%, comprising of 99.1% single-copy BUSCOs, 0.1%
duplicated BUSCOs, 0.2% fragmented BUSCOs, and 0.6% missing BUSCOs. We then calculated the mapping
rate to measure assembly accuracy. e BGI, HiFi, and RNA-sr data repo rate reached 96.84%, 99.97%, and
89.21%, respectively. Overall, these assessments reect the high quality of the genomic assembly.
Code availability
No specifc script was used in this work. All commands and pipelines used in data processing were executed
according to the manual and protocols of the corresponding bioinformatic sofware.
Received: 22 November 2023; Accepted: 10 June 2024;
Published: xx xx xxxx
References
1. Li, J. et al. Characterization of the complete mitochondrial DNA of eretra japonica and its phylogenetic position within the
Sphingidae (Lepidoptera, Sphingidae). ZooKeys 754, 127–139 (2018).
2. aila, E. J. et al. Order Lepidoptera Linnaeus, 1758. In: Zhang, Z.-Q. (Ed.) Animal biodiversity: An outline of higher-level
classication and survey of taxonomic richness. Zootaxa 3148, 212–221 (2011).
3. Zhu, H. F. & Wang, L. Y. Fauna Sinica: Insecta. Vol. 11, Lepidoptera, Sphingidae. (pp. 359. Science Press, Beijing, 1997).
4. Westfall, A. . et al. A chromosome-level genome assembly for the eastern fence lizard (Sceloporus undulatus), a reptile model for
physiological and evolutionary ecology. Gigascience 10 (2021).
5. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
6. anallo-Benavidez, T. ., Jaron, . S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free proling of polyploid
genomes. Nature Communications 11, 1432 (2020).
7. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with
hiasm. Nature Methods 18, 170–175 (2021).
8. Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898
(2020).
9. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
10. Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
11. Dudcheno, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaolds. Science 356,
92–95 (2017).
12. Durand, N. C. et al. Juicer Provides a One-Clic System for Analyzing Loop-esolution Hi-C Experiments. Cell Systems 3, 95–98
(2016).
13. Birolo, G. & Telatin, A. BamToCov: an ecient toolit for sequence coverage calculations. Bioinformatics 38, 2617–2618 (2022).
14. Manni, M., Bereley, M. ., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Worows along
with Broader and Deeper Phylogenetic Coverage for Scoring of Euaryotic, Proaryotic, and Viral Genomes. Molecular Biology and
Evolution 38, 4647–4654 (2021).
15. Danece, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
16. Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature
Biotechnology 35, 1026–1028 (2017).
17. Flynn, J. M. et al. epeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National
Academy of Sciences 117, 9451–9457 (2020).
18. Storer, J., Hubley, ., osen, J., Wheeler, T. J. & Smit, A. F. e Dfam community resource of transposable element families, sequence
models, and genome annotations. Mobile DNA 12, 2 (2021).
19. Bao, W., ojima, . . & ohany, O. epbase Update, a database of repetitive elements in euaryotic genomes. Mobile DNA 6, 11
(2015).
20. Nawroci, E. P. & Eddy, S. . Infernal 1.1: 100-fold faster NA homology searches. Bioinformatics 29, 2933–2935 (2013).
21. Chan, P. P. & Lowe, T. M. tNAscan-SE: Searching for tNA Genes in Genomic Sequences. Methods in molecular biology 1962, 1–14
(2019).
22. Holt, C. & Yandell, M. MAE2: an annotation pipeline and genome-database management tool for second-generation genome
projects. BMC Bioinformatics 12, 491 (2011).
23. Brůna, T., Ho, . J., Lomsadze, A., Stane, M. & Borodovsy, M. BAE2: automatic euaryotic genome annotation with
GeneMar-EP+ and AUGUSTUS supported by a protein database. NAR Genomics and Bioinformatics 3, lqaa108 (2021).
24. eilwagen, J., Hartung, F., Paulini, M., Twardzio, S. O. & Grau, J. Combining NA-seq data and homology-based gene prediction
for plants, animals and fungi. BMC Bioinformatics 19, 189 (2018).
25. Stane, M., Diehans, M., Baertsch, . & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo
gene nding. Bioinformatics 24, 637–644 (2008).
26. Brůna, T., Lomsadze, A. & Borodovsy, M. GeneMar-EP+: euaryotic gene prediction with self-training in the space of genes and
proteins. NAR Genomics and Bioinformatics 2, lqaa026 (2020).
27. riventseva, E. V. et al. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for
evolutionary and functional annotations of orthologs. Nucleic Acids Research 47, D807–D811 (2019).
28. im, D., Paggi, J. M., Par, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and
HISAT-genotype. Nature Biotechnology 37, 907–915 (2019).
29. ovaa, S. et al. Transcriptome assembly from long-read NA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).
30. Buchn, B. & euter, . H.-G. Drost, Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18,
366–368 (2021).
31. El-Gebali, S. et al . e Pfam protein families database in 2019. Nucleic Acids Research 47, D427–D432 (2019).
32. Letunic, I., hedar, S. & Bor, P. SMAT: recent updates, new developments and status in 2020. Nucleic Acids Research 49,
D458–D460 (2021).
33. Wilson, D. et al. SUPEFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids
Research 37, D380–D386 (2009).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2024) 11:770 | https://doi.org/10.1038/s41597-024-03500-z
www.nature.com/scientificdata
www.nature.com/scientificdata/
34. Wang, J. et al. e conserved domain database in 2023. Nucleic Acids Research 51, D384–D388 (2023).
35. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090
organisms and 2502 viruses. Nucleic Acids Research 47, D309–D314 (2019).
36. Blum, M. et al. e InterPro protein families and domains database: 20 years on. Nucleic Acids Research 49, D344–D354 (2021).
37. Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bor P. & Huerta-Cepas, J. eggNOG-mapper v2: Functional Annotation,
Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution 38, 5825–5829 (2021).
38. 严明(Yan Ming); 安徽师范大学. Theretra japonica genome sequencing and assembly. CNGBdb. https://doi.org/10.26036/
CNP0004835 (2023).
39. NCBI Sequence Read Archive https://identiers.org/ncbi/insdc.sra:S26855496 (2023).
40. NCBI Sequence Read Archive https://identiers.org/ncbi/insdc.sra:S26855497 (2023).
41. NCBI Sequence Read Archive https://identiers.org/ncbi/insdc.sra:S26855498 (2023).
42. NCBI Sequence Read Archive https://identiers.org/ncbi/insdc.sra:S26855499 (2023).
43. Yan, M. & Wang, X. eretra japonica isolate JX, whole genome shotgun sequencing project, Genbank., https://identiers.org/
ncbi/insdc.gca:GCA_033459515.1 (2023).
44. Huang, Y. X. Genome assembly and annotations of eretra japonica (Lepidoptera: Sphingidae). gshare. https://doi.org/10.6084/
m9.gshare.24276991.v1 (2023).
45. Waterhouse, . M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol. Biol. Evol. 35,
543–548 (2018).
Competing interests
e authors declare no competing interests.
Additional information
Supplementary information e online version contains supplementary material available at https://doi.
org/10.1038/s41597-024-03500-z.
Correspondence and requests for materials should be addressed to X.W.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2024
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Recently, an additional subfamily, Langiinae, were proposed as a basal clade of the family (Wang et al., 2021). Sphingidae usually pose a significant threat to agricultural production (Yan et al., 2024), and are also considered as resource insects (Kawahara et al., 2009;Li et al., 2024;Reinwald et al., 2022). In ecosystems, the sphingid moths pollinate many nocturnal flowering plants (Krpač et al., 2019), while their larvae consume leaves of various host plants (Nagamine et al., 2019). ...
Article
Full-text available
Caterpillars have a significant impact on human economy, because their plant‐attacking and silk‐producing habits. The larval mouthparts play a crucial role in feeding and spinning and exhibit an extremely morphological diversity, which is closely related to their taxonomic status, feeding habits, and even the developmental stages. However, the larval mouthparts have not been fully elucidated in the megadiverse Sphingidae. In this study, the larval mouthparts of Ampelophaga rubiginosa Bremer & Grey, 1853, Laothoe amurensis (Staudinger, 1892), Smeritus planus Walker, 1856, Dolbina tancrei Staudinger,1887, Phyllosphingia dissimilis (Bremer, 1861), and Marumba sperchius (Ménétriés, 1857) were morphologically observed and compared using scanning electron microscopy. The mouthparts of six species are morphologically diverse on the labral notches, mandibles, spinneret and labial palps arrangement of the mouthparts. The morphological diversity of larval mouthparts is briefly discussed, considering their taxonomical and functional aspects.
Article
Full-text available
NLM’s conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein ‘dark matter’ that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Article
Full-text available
Motivation Many genomics applications require the computation of nucleotide coverage of a reference genome or the ability to determine how many reads map to a reference region. Results BamToCov is a toolkit for rapid and flexible coverage computation that relies on the most memory efficient algorithm and is designed for integration in pipelines, given its ability to read alignment files from streams. The tools in the suite can process sorted BAM or CRAM files, allowing the user to extract coverage information via different filtering approaches and to save the output in different formats (BED, Wig or counts). The BamToCov algorithm can also handle strand-specific and/or physical coverage analyses. Availability This program, accessory utilities, and their documentation are freely available at https://github.com/telatin/BamToCov. Supplementary information Supplementary data are available at Bioinformatics online.
Article
Full-text available
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: (i) de novo gene prediction from raw contigs, (ii) built-in pairwise orthology prediction, (iii) fast protein domain discovery, and (iv) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Article
Full-text available
Background: High-quality genomic resources facilitate investigations into behavioral ecology, morphological and physiological adaptations, and the evolution of genomic architecture. Lizards in the genus Sceloporus have a long history as important ecological, evolutionary, and physiological models, making them a valuable target for the development of genomic resources. Findings: We present a high-quality chromosome-level reference genome assembly, SceUnd1.0 (using 10X Genomics Chromium, HiC, and Pacific Biosciences data), and tissue/developmental stage transcriptomes for the eastern fence lizard, Sceloporus undulatus. We performed synteny analysis with other snake and lizard assemblies to identify broad patterns of chromosome evolution including the fusion of micro- and macrochromosomes. We also used this new assembly to provide improved reference-based genome assemblies for 34 additional Sceloporus species. Finally, we used RNAseq and whole-genome resequencing data to compare 3 assemblies, each representing an increased level of cost and effort: Supernova Assembly with data from 10X Genomics Chromium, HiRise Assembly that added data from HiC, and PBJelly Assembly that added data from Pacific Biosciences sequencing. We found that the Supernova Assembly contained the full genome and was a suitable reference for RNAseq and single-nucleotide polymorphism calling, but the chromosome-level scaffolds provided by the addition of HiC data allowed synteny and whole-genome association mapping analyses. The subsequent addition of PacBio data doubled the contig N50 but provided negligible gains in scaffold length. Conclusions: These new genomic resources provide valuable tools for advanced molecular analysis of an organism that has become a model in physiology and evolutionary ecology.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP. An updated version of DIAMOND uses improved algorithmic procedures and a customized high-performance computing framework to make seemingly prohibitive large-scale protein sequence alignments feasible.
Article
Full-text available
Background: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. Findings: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. Conclusion: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.
Article
Full-text available
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly. Hifiasm is a haplotype-resolved de novo genome assembler for long-read high-fidelity sequencing data based on phased assembly graphs.
Article
Full-text available
Dfam is an open access database of repetitive DNA families, sequence models, and genome annotations. The 3.0–3.3 releases of Dfam ( https://dfam.org ) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species, and for both curated and uncurated datasets. In addition, releases since Dfam 3.0 provide auxiliary consensus sequence models, transposable element protein alignments, and a formalized classification system to support the growing diversity of organisms represented in the resource. The latest release includes 266,740 new de novo generated transposable element families from 336 species contributed by the EBI. This expansion demonstrates the utility of many of Dfam’s new features and provides insight into the long term challenges ahead for improving de novo generated transposable element datasets.
Article
We present several recent improvements to minimap2, a versatile pairwise aligner for nucleotide sequences. Now minimap2 v2.22 can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default, addressing major weakness in minimap2 v2.18 or earlier. Availability and implementation: https://github.com/lh3/minimap2.