ArticlePDF Available

The genome sequence of the Webb’s Wainscot, Globia sparganii (Esper, 1790)

F1000
Wellcome Open Research
Authors:

Abstract

We present a genome assembly from an individual male Globia sparganii (the Webb’s Wainscot; Arthropoda; Insecta; Lepidoptera; Noctuidae). The genome sequence is 676.7 megabases in span. Most of the assembly is scaffolded into 31 chromosomal pseudomolecules, including the Z sex chromosome. The mitochondrial genome has also been assembled and is 15.36 kilobases in length. Gene annotation of this assembly on Ensembl identified 18,385 protein coding genes.
DATA NOTE
The genome sequence of the Webb’s Wainscot, Globia
sparganii (Esper, 1790) [version 1; peer review: awaiting peer
review]
Gavin R. Broad 1, Natural History Museum Genome Acquisition Lab,
Darwin Tree of Life Barcoding collective,
Wellcome Sanger Institute Tree of Life programme,
Wellcome Sanger Institute Scientific Operations: DNA Pipelines collective,
Tree of Life Core Informatics collective, Darwin Tree of Life Consortium
1Natural History Museum, London, England, UK
First published: 06 Dec 2023, 8:565
https://doi.org/10.12688/wellcomeopenres.20181.1
Latest published: 06 Dec 2023, 8:565
https://doi.org/10.12688/wellcomeopenres.20181.1
v1
Abstract
We present a genome assembly from an individual male Globia
sparganii (the Webb’s Wainscot; Arthropoda; Insecta; Lepidoptera;
Noctuidae). The genome sequence is 676.7 megabases in span. Most
of the assembly is scaffolded into 31 chromosomal pseudomolecules,
including the Z sex chromosome. The mitochondrial genome has also
been assembled and is 15.36 kilobases in length. Gene annotation of
this assembly on Ensembl identified 18,385 protein coding genes.
Keywords
Globia sparganii, Webb’s Wainscot, genome sequence, chromosomal,
Lepidoptera
This article is included in the Tree of Life
gateway.
Open Peer Review
Approval Status AWAITING PEER REVIEW
Any reports and responses or comments on the
article can be found at the end of the article.
Page 1 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Corresponding author: Darwin Tree of Life Consortium (mark.blaxter@sanger.ac.uk)
Author roles: Broad GR: Investigation, Resources, Writing – Original Draft Preparation, Writing – Review & Editing;
Competing interests: No competing interests were disclosed.
Grant information: This work was supported by Wellcome through core funding to the Wellcome Sanger Institute (206194) and the
Darwin Tree of Life Discretionary Award (218328).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2023 Broad GR et al. This is an open access article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
How to cite this article: Broad GR, Natural History Museum Genome Acquisition Lab, Darwin Tree of Life Barcoding collective et al. The
genome sequence of the Webb’s Wainscot, Globia sparganii (Esper, 1790) [version 1; peer review: awaiting peer review] Wellcome
Open Research 2023, 8:565 https://doi.org/10.12688/wellcomeopenres.20181.1
First published: 06 Dec 2023, 8:565 https://doi.org/10.12688/wellcomeopenres.20181.1
Page 2 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Species taxonomy
Eukaryota; Metazoa; Eumetazoa; Bilateria; Protostomia;
Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota;
Neoptera; Endopterygota; Amphiesmenoptera; Lepidoptera;
Glossata; Neolepidoptera; Heteroneura; Ditrysia; Obtectomera;
Noctuoidea; Noctuidae; Noctuinae; Globia; Globia sparganii
(Esper, 1790) (NCBI:txid1660644).
Background
Globia sparganii, Webb’s Wainscot, is one of many pale,
buff-coloured noctuids called wainscots in English, not all
particularly closely related (e.g., Davis et al., 2022). Webb’s
Wainscot is relatively distinctive, with a small white kidney
mark on the fore wing, partly enclosed by a black rim,
part of a dark central streak down the wing. The extent of
dark markings varies. Found across the Palaearctic, Webb’s
Wainscot was formerly very localised in Britain, on the
south coasts of England and Wales, but has been expanding
its range across South-east England and East Anglia, with a
more than 200% increase in range occupancy since the 1990s
(Randle et al., 2019). The moth was named in English after
Sydney Webb, who was the first person to find G. sparganii in
Britain, in Kent in 1879 (South, 1907), and Kent has always
been a stronghold of this species. The Latin name refers to
one of the foodplant genera, Sparganium, or Bur-reeds.
Moths are on the wing from July to October, and larvae feed
in the stems of bulrushes and some other freshwater plants,
such as Bur-reed and Yellow Flag Iris, in various freshwater
bodies, from marshes to ponds and ditches (Waring et al.,
2017). Adults often wander, and the first author has light-
trapped them at home, far from any suitable habitat.
Along with some other stem-feeding noctuids, Globia
sparganii has been sequenced in experiments to ascertain
whether bracoviruses integrate into the genome, which they
do; and whether that makes the species a potential non-target
host for wasps used as biocontrol against the crop pest species,
Sesamia nonagrioides (Lefèbvre), which it appears not to be
(Muller et al., 2022).
Genome sequence report
The genome was sequenced from one male Globia sparganii
(Figure 1) collected from Hever Castle, Kent, UK (51.19,
0.12). A total of 31-fold coverage in Pacific Biosciences
single-molecule HiFi long reads was generated. Primary
assembly contigs were scaffolded with chromosome confor-
mation Hi-C data. Manual assembly curation corrected 11
missing joins or mis-joins and removed 6 haplotypic duplica-
tions, reducing the assembly length by 0.19% and the scaffold
number by 3.92%.
The final assembly has a total length of 676.7 Mb in 48
sequence scaffolds with a scaffold N50 of 23.9 Mb (Table 1).
The snailplot in Figure 2 provides a summary of the assembly
statistics, while the distribution of assembly scaffolds on GC
proportion and coverage is shown in Figure 3. The cumulative
assembly plot in Figure 4 shows curves for subsets of
scaffolds assigned to different phyla. Most (99.92%) of the
assembly sequence was assigned to 31 chromosomal-level
scaffolds, representing 30 autosomes and the Z sex chromosome.
The Z chromosome was identified based on synteny with
Apamea epomidion (GCA_947507525.1). Chromosome-scale
scaffolds confirmed by the Hi-C data are named in order of
size (Figure 5; Table 2). While not fully phased, the assembly
deposited is of one haplotype. Contigs corresponding to the
Figure 1. Photographs of the Globia sparganii (ilGloSpar1) specimen used for genome sequencing. A. Live specimen. B. Dorsal
view and C. Ventral view of specimen during preservation and processing.
Page 3 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Table 1. Genome data for Globia sparganii, ilGloSpar1.1.
Project accession data
Assembly identier ilGloSpar1.1
Assembly release date 2023-03-10
Species Globia sparganii
Specimen ilGloSpar1
NCBI taxonomy ID 1660644
BioProject PRJEB59770
BioSample ID SAMEA7849226
Isolate information ilGloSpar1
Assembly metrics*Benchmark
Consensus quality (QV) 65 50
k-mer completeness 100% 95%
BUSCO** C:99.0%[S:98.6%,D:0.5%],
F:0.2%,M:0.8%,n:5,286
C 95%
Percentage of assembly mapped
to chromosomes
99.92% 95%
Sex chromosomes Z chromosome localised homologous pairs
Organelles Mitochondrial genome assembled complete single alleles
Raw data accessions
PacicBiosciences SEQUEL II ERR10879923, ERR10879922
Hi-C Illumina ERR10890717
Genome assembly
Assembly accession GCA_949316385.1
Accession of alternate haplotype GCA_949316295.1
Span (Mb) 676.7
Number of contigs 135
Contig N50 length (Mb) 11.4
Number of scaolds 48
Scaold N50 length (Mb) 23.9
Longest scaold (Mb) 29.3
Genome annotation
Number of protein-coding genes 18,385
Number of gene transcripts 18,570
* Assembly metric benchmarks are adapted from column VGP-2020 of “Table 1: Proposed standards and metrics
for dening genome assembly quality” from (Rhie et al., 2021).
** BUSCO scores based on the lepidoptera_odb10 BUSCO set using v5.3.2. C = complete [S = single copy,
D = duplicated], F = fragmented, M = missing, n = number of orthologues in comparison. A full set of BUSCO scores
is available at https://blobtoolkit.genomehubs.org/view/Globia%20sparganii/dataset/CASGFQ01/busco.
Page 4 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Figure 2. Genome assembly of Globia sparganii, ilGloSpar1.1: metrics. The BlobToolKit Snailplot shows N50 metrics and BUSCO
gene completeness. The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1%
of the 676,688,718 bp assembly. The distribution of scaold lengths is shown in dark grey with the plot radius scaled to the longest
scaold present in the assembly (29,284,430 bp, shown in red). Orange and pale-orange arcs show the N50 and N90 scaold lengths
(23,857,693 and 16,301,409 bp), respectively. The pale grey spiral shows the cumulative scaold count on a log scale with white scale lines
showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT
and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the
lepidoptera_odb10 set is shown in the top right. An interactive version of this gure is available at https://blobtoolkit.genomehubs.org/view/
Globia%20sparganii/dataset/CASGFQ01/snail.
second haplotype have also been deposited. The mitochon-
drial genome was also assembled and can be found as a contig
within the multifasta file of the genome submission.
The estimated Quality Value (QV) of the final assembly is
65 with k-mer completeness of 100%, and the assembly has
a BUSCO v5.3.2 completeness of 99.0% (single = 98.6%,
duplicated = 0.5%), using the lepidoptera_odb10 reference set
(n = 5,286).
Metadata for specimens, barcode results, spectra estimates,
sequencing runs, contaminants and pre-curation assembly
statistics are given at https://links.tol.sanger.ac.uk/species/
1660644.
Genome annotation report
The Globia sparganii genome assembly (GCA_949316385.1)
was annotated using the Ensembl rapid annotation pipeline
(Table 1; https://rapid.ensembl.org/Globia_sparganii_GCA_
949316385.1/Info/Index). The resulting annotation includes
18,570 transcribed mRNAs from 18,385 protein-coding
genes.
Methods
Sample acquisition and nucleic acid extraction
A male Globia sparganii (specimen ID NHMUK010635080,
ToLID ilGloSpar1) was collected from Hever Castle, Hever,
Kent, UK (latitude 51.19, longitude 0.12) on 2020-08-27
using a light trap. The specimen was collected and identified
Page 5 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Figure 3. Genome assembly of Globia sparganii, ilGloSpar1.1: BlobToolKit GC-coverage plot. Scaolds are coloured by phylum.
Circles are sized in proportion to scaold length. Histograms show the distribution of scaold length sum along each axis. An interactive
version of this gure is available at https://blobtoolkit.genomehubs.org/view/Globia%20sparganii/dataset/CASGFQ01/blob.
by Gavin Broad (Natural History Museum) and preserved on
dry ice.
DNA was extracted at the Tree of Life laboratory, Wellcome
Sanger Institute (WSI). The ilGloSpar1 sample was weighed
and dissected on dry ice with tissue set aside for Hi-C
sequencing. Abdomen tissue was cryogenically disrupted
to a fine powder using a Covaris cryoPREP Automated Dry
Pulveriser, receiving multiple impacts. High molecular weight
(HMW) DNA was extracted using the Qiagen MagAttract
HMW DNA extraction kit. HMW DNA was sheared into an
average fragment size of 12–20 kb in a Megaruptor 3 system
with speed setting 30. Sheared DNA was purified by solid-phase
reversible immobilisation using AMPure PB beads with a
1.8X ratio of beads to sample to remove the shorter fragments
and concentrate the DNA sample. The concentration of the
sheared and purified DNA was assessed using a Nanodrop
spectrophotometer and Qubit Fluorometer and Qubit dsDNA
High Sensitivity Assay kit. Fragment size distribution was
evaluated by running the sample on the FemtoPulse system.
Sequencing
Pacific Biosciences HiFi circular consensus DNA sequenc-
ing libraries were constructed according to the manufacturers’
instructions. DNA and RNA sequencing was performed
by the Scientific Operations core at the WSI on a Pacific
Page 6 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Figure 5. Genome assembly of Globia sparganii, ilGloSpar1.1: Hi-C contact map of the ilGloSpar1.1 assembly, visualised using
HiGlass. Chromosomes are shown in order of size from left to right and top to bottom. An interactive version of this gure may be viewed
at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=G0o_g7xYQeOY77HCZpQi2A.
Figure 4. Genome assembly of Globia sparganii, ilGloSpar1.1: BlobToolKit cumulative sequence plot. The grey line shows cumulative
length for all scaolds. Coloured lines show cumulative lengths of scaolds assigned to each phylum using the buscogenes taxrule.
An interactive version of this gure is available at https://blobtoolkit.genomehubs.org/view/Globia%20sparganii/dataset/CASGFQ01/
cumulative.
Page 7 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Table 2. Chromosomal pseudomolecules in
the genome assembly of Globia sparganii,
ilGloSpar1.
INSDC
accession
Chromosome Length
(Mb)
GC%
OX438653.1 1 27.33 37.5
OX438654.1 2 26.48 37.5
OX438655.1 3 26.07 37.5
OX438656.1 4 25.98 38.0
OX438657.1 5 25.71 37.5
OX438658.1 6 25.69 37.5
OX438659.1 7 25.47 37.5
OX438660.1 8 25.25 37.5
OX438661.1 9 24.81 37.5
OX438662.1 10 24.79 37.5
OX438663.1 11 24.26 37.5
OX438665.1 13 23.86 37.5
OX438664.1 12 23.86 37.5
OX438666.1 14 23.68 37.5
OX438667.1 15 23.45 37.5
OX438668.1 16 23.29 37.5
OX438669.1 17 22.61 38.0
OX438670.1 18 22.49 37.5
OX438671.1 19 22.29 38.0
OX438672.1 20 20.46 38.0
OX438673.1 21 19.99 38.0
OX438674.1 22 19.87 37.5
OX438675.1 23 19.55 38.0
OX438676.1 24 19.34 38.0
OX438677.1 25 16.3 37.5
OX438678.1 26 15.95 38.0
OX438679.1 27 13.44 38.0
OX438680.1 28 11.93 38.0
OX438681.1 29 11.35 38.0
OX438682.1 30 11.25 39.5
OX438652.1 Z 29.28 37.5
OX438683.1 MT 0.02 19.5
Biosciences SEQUEL II (HiFi) instrument. Hi-C data were
also generated from head and thorax tissue of ilGloSpar1
using the Arima2 kit and sequenced on the Illumina NovaSeq
6000 instrument.
Genome assembly, curation and evaluation
Assembly was carried out with Hifiasm (Cheng et al., 2021)
and haplotypic duplication was identified and removed with
purge_dups (Guan et al., 2020). The assembly was then
scaffolded with Hi-C data (Rao et al., 2014) using YaHS (Zhou
et al., 2023). The assembly was checked for contamination
and corrected as described previously (Howe et al., 2021).
Manual curation was performed using HiGlass (Kerpedjiev
et al., 2018) and Pretext (Harry, 2022). The mitochondrial
genome was assembled using MitoHiFi (Uliano-Silva et al.,
2023), which runs MitoFinder (Allio et al., 2020) or MITOS
(Bernt et al., 2013) and uses these annotations to select the
final mitochondrial contig and to ensure the general quality of
the sequence.
A Hi-C map for the final assembly was produced using
bwa-mem2 (Vasimuddin et al., 2019) in the Cooler file format
(Abdennur & Mirny, 2020). To assess the assembly metrics,
the k-mer completeness and QV consensus quality values
were calculated in Merqury (Rhie et al., 2020). This work
was done using Nextflow (Di Tommaso et al., 2017) DSL2
pipelines “sanger-tol/readmapping” (Surana et al., 2023a) and
“sanger-tol/genomenote” (Surana et al., 2023b). The genome
was analysed within the BlobToolKit environment (Challis
et al., 2020) and BUSCO scores (Manni et al., 2021; Simão
et al., 2015) were calculated.
Table 3 contains a list of relevant software tool versions and
sources.
Genome annotation
The BRAKER2 pipeline (Brůna et al., 2021) was used in the
default protein mode to generate annotation for the Globia
sparganii assembly (GCA_949316385.1) in Ensembl Rapid
Release.
Wellcome Sanger Institute – Legal and Governance
The materials that have contributed to this genome note have
been supplied by a Darwin Tree of Life Partner. The submission
of materials by a Darwin Tree of Life Partner is subject to the
‘Darwin Tree of Life Project Sampling Code of Practice’,
which can be found in full on the Darwin Tree of Life
website here. By agreeing with and signing up to the
Sampling Code of Practice, the Darwin Tree of Life Partner
agrees they will meet the legal and ethical requirements and
standards set out within this document in respect of all
samples acquired for, and supplied to, the Darwin Tree of Life
Project.
Further, the Wellcome Sanger Institute employs a process
whereby due diligence is carried out proportionate to the
Page 8 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
Table 3. Software tools: versions and sources.
Software tool Version Source
BlobToolKit 4.1.7 https://github.com/blobtoolkit/blobtoolkit
BUSCO 5.3.2 https://gitlab.com/ezlab/busco
Hiasm 0.16.1-r375 https://github.com/chhylp123/hiasm
HiGlass 1.11.6 https://github.com/higlass/higlass
Merqury MerquryFK https://github.com/thegenemyers/MERQURY.FK
MitoHiFi 2 https://github.com/marcelauliano/MitoHiFi
PretextView 0.2 https://github.com/wtsi-hpag/PretextView
purge_dups 1.2.3 https://github.com/dfguan/purge_dups
sanger-tol/genomenote v1.0 https://github.com/sanger-tol/genomenote
sanger-tol/readmapping 1.1.0 https://github.com/sanger-tol/readmapping/tree/1.1.0
YaHS 1.2a https://github.com/c-zhou/yahs
nature of the materials themselves, and the circumstances
under which they have been/are to be collected and provided
for use. The purpose of this is to address and mitigate any
potential legal and/or ethical implications of receipt and use
of the materials as part of the research project, and to ensure
that in doing so we align with best practice wherever possible.
The overarching areas of consideration are:
• Ethical review of provenance and sourcing of the material
Legality of collection, transfer and use (national and
international)
Each transfer of samples is further undertaken according
to a Research Collaboration Agreement or Material Transfer
Agreement entered into by the Darwin Tree of Life Partner,
Genome Research Limited (operating as the Wellcome Sanger
Institute), and in some circumstances other Darwin Tree of
Life collaborators.
Data availability
European Nucleotide Archive: Globia sparganii (Webb’s
wainscot). Accession number PRJEB59770; https://identifiers.
org/ena.embl/PRJEB59770. (Wellcome Sanger Institute, 2023)
The genome sequence is released openly for reuse. The
Globia sparganii genome sequencing initiative is part of the
Darwin Tree of Life (DToL) project. All raw sequence data
and the assembly have been deposited in INSDC databases.
Raw data and assembly accession identifiers are reported
in Table 1.
Author information
Members of the Natural History Museum Genome Acquisition
Lab are listed here: https://doi.org/10.5281/zenodo.4790042.
Members of the Darwin Tree of Life Barcoding collective are
listed here: https://doi.org/10.5281/zenodo.4893703.
Members of the Wellcome Sanger Institute Tree of Life
programme are listed here: https://doi.org/10.5281/
zenodo.4783585.
Members of Wellcome Sanger Institute Scientific Operations:
DNA Pipelines collective are listed here: https://doi.org/
10.5281/zenodo.4790455.
Members of the Tree of Life Core Informatics collective are
listed here: https://doi.org/10.5281/zenodo.5013541.
Members of the Darwin Tree of Life Consortium are listed
here: https://doi.org/10.5281/zenodo.4783558.
References
Abdennur N, Mirny LA: Cooler: Scalable storage for Hi-C data and other
genomically labeled arrays. Bioinformatics. 2020; 36(1): 311–316.
PubMed Abstract | Publisher Full Text | Free Full Text
AllioR,Schomaker‐BastosA,RomiguierJ,et al.: MitoFinder: Ecient
automated large‐scale extraction of mitogenomic data in target
enrichment phylogenomics. Mol Ecol Resour. 2020; 20(4): 892–905.
PubMed Abstract | Publisher Full Text | Free Full Text
BerntM,DonathA,JühlingF,et al.: MITOS: Improved de novo metazoan
Page 9 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
mitochondrial genome annotation. Mol Phylogenet Evol. 2013; 69(2): 313–319.
PubMed Abstract | Publisher Full Text
BrůnaT,HoKJ,LomsadzeA,et al.: BRAKER2: Automatic eukaryotic genome
annotation with GeneMark-EP+ and AUGUSTUS supported by a protein
database. NAR Genom Bioinform. 2021; 3(1): lqaa108.
PubMed Abstract | Publisher Full Text | Free Full Text
ChallisR,RichardsE,RajanJ,et al.: BlobToolKit - interactive quality
assessment of genome assemblies. G3 (Bethesda). 2020; 10(4): 1361–1374.
PubMed Abstract | Publisher Full Text | Free Full Text
ChengH,ConcepcionGT,FengX,et al.: Haplotype-resolved de novo assembly
using phased assembly graphs with hiasm. Nat Methods. 2021; 18(2):
170–175.
PubMed Abstract | Publisher Full Text | Free Full Text
DavisRB,ÕunapE,TammaruT:A supertree of Northern European
macromoths. PLoS One. 2022; 17(2): e0264211.
PubMed Abstract | Publisher Full Text | Free Full Text
DiTommasoP,ChatzouM,FlodenEW,et al.: Nextow enables reproducible
computational workows. Nat Biotechnol. 2017; 35(4): 316–319.
PubMed Abstract | Publisher Full Text
GuanD,McCarthySA,WoodJ,et al.: Identifying and removing haplotypic
duplication in primary genome assemblies. Bioinformatics. 2020; 36(9):
2896–2898.
PubMed Abstract | Publisher Full Text | Free Full Text
HarryE:PretextView (Paired REad TEXTure Viewer): A desktop application
for viewing pretext contact maps. 2022; [Accessed 19 October 2022].
Reference Source
HoweK,ChowW,CollinsJ,et al.: Signicantly improving the quality of
genome assemblies through curation. GigaScience.OxfordUniversityPress,
2021; 10(1): giaa153.
PubMed Abstract | Publisher Full Text | Free Full Text
KerpedjievP,AbdennurN,LekschasF,et al.: HiGlass: web-based visual
exploration and analysis of genome interaction maps. Genome Biol. 2018;
19(1): 125.
PubMed Abstract | Publisher Full Text | Free Full Text
ManniM,BerkeleyMR,SeppeyM,et al.: BUSCO update: Novel and
streamlined workows along with broader and deeper phylogenetic
coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol
Evol. 2021; 38(10): 4647–4654.
PubMed Abstract | Publisher Full Text | Free Full Text
MullerH,HeissererC,FortunaT,et al.: Investigating bracovirus
chromosomal integration and inheritance in lepidopteran host and
nontarget species. Mol Ecol. 2022; 31(21): 5538–5551.
PubMed Abstract | Publisher Full Text
RandleZ,Evans-HillLJ,ParsonsMS,et al.: Atlas of Britain & Ireland’s Larger
Moths. Newbury:NatureBureau,2019.
Reference Source
RaoSSP,HuntleyMH,DurandNC,et al.: A 3D map of the human genome
at kilobase resolution reveals principles of chromatin looping. Cell. 2014;
159(7): 1665–1680.
PubMed Abstract | Publisher Full Text | Free Full Text
RhieA,McCarthySA,FedrigoO,et al.: Towards complete and error-free
genome assemblies of all vertebrate species. Nature. 2021; 592(7856): 737–746.
PubMed Abstract | Publisher Full Text | Free Full Text
RhieA,WalenzBP,KorenS,et al.: Merqury: Reference-free quality,
completeness, and phasing assessment for genome assemblies. Genome
Biol. 2020; 21(1): 245.
PubMed Abstract | Publisher Full Text | Free Full Text
SimãoFA,WaterhouseRM,IoannidisP,et al.: BUSCO: assessing genome
assembly and annotation completeness with single-copy orthologs.
Bioinformatics. 2015; 31(19): 3210–3212.
PubMed Abstract | Publisher Full Text
South R: The Moths of the British Isles. London:FrederickWarne&Co,1907.
Reference Source
SuranaP,MuatoM,QiG:sanger-tol/readmapping: sanger-tol/readmapping
v1.1.0 - Hebridean Black (1.1.0). Zenodo.2023a;[Accessed21July2023].
Publisher Full Text
SuranaP,MuatoM,SadasivanBabyC:sanger-tol/genomenote (v1.0.dev).
Zenodo.2023b;[Accessed21July2023].
Publisher Full Text
Uliano-SilvaM,FerreiraJGRN,KrasheninnikovaK,et al.: MitoHiFi: a python
pipeline for mitochondrial genome assembly from PacBio high delity
reads. BMC Bioinformatics. 2023; 24(1): 288.
PubMed Abstract | Publisher Full Text | Free Full Text
VasimuddinM,MisraS,LiH,et al.: Ecient Architecture-Aware Acceleration
of BWA-MEM for Multicore Systems. In: 2019 IEEE International Parallel and
Distributed Processing Symposium (IPDPS). IEEE, 2019; 314–324.
Publisher Full Text
WaringP,TownsendM,LewingtonR:Field Guide to the Moths of Great
Britain and Ireland: Third Edition. BloomsburyWildlifeGuides,2017.
Reference Source
Wellcome Sanger Institute: The genome sequence of the Webb’s Wainscot,
Globia sparganii (Esper, 1790). European Nucleotide Archive. [dataset],
accessionnumberPRJEB59770,2023.
ZhouC,McCarthySA,DurbinR:YaHS: yet another Hi-C scaolding tool.
Bioinformatics. 2023; 39(1): btac808.
PubMed Abstract | Publisher Full Text | Free Full Text
Page 10 of 10
Wellcome Open Research 2023, 8:565 Last updated: 06 DEC 2023
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background PacBio high fidelity (HiFi) sequencing reads are both long (15–20 kb) and highly accurate (> Q20). Because of these properties, they have revolutionised genome assembly leading to more accurate and contiguous genomes. In eukaryotes the mitochondrial genome is sequenced alongside the nuclear genome often at very high coverage. A dedicated tool for mitochondrial genome assembly using HiFi reads is still missing. Results MitoHiFi was developed within the Darwin Tree of Life Project to assemble mitochondrial genomes from the HiFi reads generated for target species. The input for MitoHiFi is either the raw reads or the assembled contigs, and the tool outputs a mitochondrial genome sequence fasta file along with annotation of protein and RNA genes. Variants arising from heteroplasmy are assembled independently, and nuclear insertions of mitochondrial sequences are identified and not used in organellar genome assembly. MitoHiFi has been used to assemble 374 mitochondrial genomes (368 Metazoa and 6 Fungi species) for the Darwin Tree of Life Project, the Vertebrate Genomes Project and the Aquatic Symbiosis Genome Project. Inspection of 60 mitochondrial genomes assembled with MitoHiFi for species that already have reference sequences in public databases showed the widespread presence of previously unreported repeats. Conclusions MitoHiFi is able to assemble mitochondrial genomes from a wide phylogenetic range of taxa from Pacbio HiFi data. MitoHiFi is written in python and is freely available on GitHub (https://github.com/marcelauliano/MitoHiFi). MitoHiFi is available with its dependencies as a Docker container on GitHub (ghcr.io/marcelauliano/mitohifi:master).
Article
Full-text available
We present YaHS, a user-friendly command-line tool for construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools, and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity. Availability and implementation: YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Bracoviruses (BVs) are domesticated viruses found in braconid parasitoid wasp genomes. They are composed of domesticated genes from a nudivrius, coding viral particles in which wasp DNA circles are packaged. BVs are viewed as possible vectors of horizontal transfer of genetic material (HT) from wasp to their hosts because they are injected, together with wasp eggs, by female wasps into their host larvae, and because they undergo massive chromosomal integration in multiple host tissues. Here, we show that chromosomal integrations of the Cotesia typhae BV (CtBV) persist up to the adult stage in individuals of its natural host, Sesamia nonagrioides, that survived parasitism. However, while reproducing host adults can bear an average of nearly two CtBV integrations per haploid genome, we were unable to retrieve any of these integrations in 500 of their offspring using Illumina sequencing. This suggests either that host gametes are less targeted by CtBVs than somatic cells or that gametes bearing BV integrations are nonfunctional. We further show that CtBV can massively integrate into the chromosomes of other lepidopteran species that are not normally targeted by the wasp in the wild, including one which is divergent by at least 100 million years from the natural host. Cell entry and chromosomal integration of BVs are thus unlikely to be major factors shaping wasp host range. Together, our results shed new light on the conditions under which BV‐mediated wasp‐to‐host HT may occur and provide information that may be helpful to evaluate the potential risks of uncontrolled HT associated with the use of parasitoid wasps as biocontrol agents.
Article
Full-text available
Ecological and life-history data on the Northern European macromoth (Lepidoptera: Macroheterocera) fauna is widely available and ideal for use in answering phylogeny-based research questions: for example, in comparative biology. However, phylogenetic information for such studies lags behind. Here, as a synthesis of all currently available phylogenetic information on the group, we produce a supertree of 114 Northern European macromoth genera (in four superfamilies, with Geometroidea considered separately), providing the most complete phylogenetic picture of this fauna available to date. In doing so, we assess those parts of the phylogeny that are well resolved and those that are uncertain. Furthermore, we identify those genera for which phylogenetic information is currently too poor to include in such a supertree, or entirely absent, as targets for future work. As an aid to studies involving these genera, we provide information on their likely positions within the macromoth tree. With phylogenies playing an ever more important role in the field, this supertree should be useful in informing future ecological and evolutionary studies.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Article
Full-text available
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly. Hifiasm is a haplotype-resolved de novo genome assembler for long-read high-fidelity sequencing data based on phased assembly graphs.
Article
Full-text available
Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.
Article
Full-text available
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.
Article
Full-text available
Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.