Available via license: CC BY 4.0
Content may be subject to copyright.
Patchwork: Alignment-Based Retrieval
and Concatenation of Phylogenetic Markers from
Genomic Data
Felix Thalén
1,2
, Clara G. Köhne
1
, and Christoph Bleidorn
1,
*
1
Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
2
Cardio-CARE AG, Medizincampus Davos, Davos Wolfgang 7265, Switzerland
*Corresponding author: E-mail: christoph.bleidorn@biologie.uni-goettingen.de.
Accepted: December 06, 2023
Abstract
Low-coverage whole-genome sequencing (also known as “genome skimming”) is becoming an increasingly affordable ap-
proach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming
is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with
this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new
software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as
directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is writ-
ten in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a “hit
stitching” phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm
trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-
universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in
comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes
from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing
software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instruc-
tions, additional documentation, and the source code itself are all available via GitHub at https://github.com/fethalen/
Patchwork.
Key words: genome skimming, low-coverage sequencing, museomics, phylogenomics, short reads, single-copy genes.
Significance
Even though current sequencing and computational methods allow for the completion of high-quality genomes for all
life on earth, the availability of material for sequencing became a major bottleneck in phylogenomic studies, especially
since material stored in museum collections—or during barcoding campaigns—is often not suitable for reconstructing
high-quality, highly continuous genomes. At the same time, the output of short-read sequencing machines is increasing,
and prices for these techniques are dropping. Short-read data are still routinely used to recover organellar genomes, but
this so-called genome skimming approach is rather rarely utilized for recovering single-copy nuclear markers. We pre-
sent a new software tool called Patchwork for mining phylogenetic markers from highly fragmented genome assem-
blies, as well as directly from short sequence reads. We demonstrate the accuracy of this new approach and show in
a benchmarking study that it also outperforms existing software for similar tasks. Patchwork allows to compile prese-
lected gene sets from low-coverage short-read sequencing data sets and is thereby ideally suited when including ma-
terial from museum collections into phylogenomic studies.
© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.
GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 1
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
Introduction
Advancements in high-throughput sequencing techniques
have revolutionized the field of phylogenetics and ultimate-
ly our understanding of the tree of life (Lemmon and
Lemmon 2013). The availability of genomic and
transcriptomic data for basically all desired taxa and for a
reasonable price has transformed the field to phyloge-
nomics: genome-scale phylogenetic systematic analyses
(McCormack et al. 2013). Some challenges remain, how-
ever, as many studies still show incongruent results, low
branch support, or lacking resolution (Philippe et al.
2017; Steenwyk et al. 2023). Even though complete gen-
omes are becoming available for more and more eukar-
yotes, the access to high-molecular-weight DNA is the
bottleneck in the quest for sequencing genomes of all life
on earth (Blom 2021; Dahn et al. 2022). Nowadays and in
the past, most large-scale phylogenomic studies were con-
ducted using either transcriptome sequencing or genome
subsampling methods such as target enrichment, which fo-
cuses on a set of preselected loci (Bleidorn 2017).
Transcriptome sequencing offers a way to sequence only
the expressed portion of a genome without prior sequence
knowledge (Stark et al. 2019). Unfortunately, this approach
requires freshly collected material or specifically stored ma-
terial, e.g. deeply flash frozen or in RNAlater. Furthermore,
smaller specimens may need to be pooled together to at-
tain sufficient amounts of mRNA, and such practice risks
mixing up individuals with undetected genetic variation.
Unfortunately, a large amount of collected specimens
only exist in natural history museum collections, and most
of these are ethanol preserved and thus not usable for tran-
scriptomic studies (Call et al. 2021). As taxon sampling is
considered one of the most important factors for accurate
phylogenetic tree reconstruction (Heath et al. 2008), it
would be missing an opportunity to leave the potential of
natural history collections untapped. Target enrichment ap-
proaches, on the other hand, require prior knowledge of
target sequences (e.g. from well-annotated genomes) for
the construction of oligonucleotide probes. Moreover, the
number of enriched targets is limited by the number of oli-
gonucleotides included in the enrichment kit of choice, and
the efficiency of such approaches decreases as the
bait-to-target distance increases (Bragg et al. 2016).
Another downside is that the data produced are difficult
to reuse for other types of genomic or evolutionary studies.
A viable alternative to assemble taxon-rich phyloge-
nomic data sets is low-coverage whole-genome sequen-
cing (LC-WGS; also known as “shallow genome
sequencing” or “genome skimming”) using short-read
technologies such as Illumina sequencing (Dodsworth
2015). Relying solely on this approach has been shown to
be inadequate for the reconstruction of highly contiguous
reference-quality genomes (Rhie et al. 2021). However,
due to the introduction of newer sequencing platforms
(e.g. Illumina's NovaSeq sequencing platform) short-read
WGS became relatively cheap and prices are even expected
to drop with Ultima Genomics, another highly competitive
sequencing platform entering the market (Simmons et al.
2023). Moreover, short-read sequencing library construc-
tion also allows that highly fragmented DNA can be used
as input (Hu et al. 2021), thereby enabling the use of mater-
ial from museum collections from all around the world
(Raxworthy and Smith 2021). Consequently, LC-WGS can
be used to generate data from various sources of targeted
organisms to retrieve marker loci on a genome scale. While
this so-called “genome skimming” approach has frequent-
ly been used to reconstruct organellar genomes or other
high-copy fractions of eukaryote genomes (Richter et al.
2015; Jin et al. 2020), it seems currently underutilized to re-
trieve single-copy nuclear markers (Liu et al. 2021). One
reason is that short-read assemblies of eukaryotic genomes
tend to be highly discontinuous, and automated annota-
tion of such large, fragmented genomes remains difficult
(Salzberg 2019), as they are characterized by the presence
of “genes in pieces,” where introns interrupt coding se-
quences (Rogozin et al. 2005). Depending on the coverage,
short-read draft genomes are characterized by low N50s in
the range of few (if at all) kilobase pairs (kb; Salzberg et al.
2012), and consequently, exons of a single gene usually
end up on several contigs.
The disuse of genome skimming in large-scale phyloge-
netics could potentially be ascribed to the lack of suitable
data analysis methods (Zhang et al. 2019). Existing soft-
ware tools for working with LC-WGS data in a phyloge-
nomic context, such as aTRAM 2 (Allen et al. 2017,
2018), ALiBaSeq (Knyshov et al. 2021), and GeMoMa
(Keilwagen et al. 2016, 2018), are either written in an inter-
preted language (e.g. Perl or Python) that does not allow
the program to scale well with the large biological data
sets that are commonplace today (e.g. aTRAM 2,
ALiBaSeq) or need well-annotated reference genomes or
transcriptomes (e.g. GeMoMa). A recent addition to the
portfolio of available tools for such programs is
Read2Tree, which directly infers trees from unassembled
data (Dylus et al. 2023).
To address the limitations typically associated when
working with genome skimming data, we present
Patchwork, an alignment-based tool for mining phylogen-
etic markers directly from WGS data. Patchwork utilizes
the sequence aligner DIAMOND (Buchfink et al. 2021)
and is written in the programming language Julia
(Bezanson et al. 2017) to achieve the best possible speed,
thus allowing Patchwork to scale well with today's
genome-scale data sets. In addition, our implementation
focuses on ease of use, and our program handles each
step in the analysis—from start to finish. Using our new ap-
proach, we targeted universal single-copy orthologs
Thalén et al. GBE
2Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
(USCOs), which are available based on careful analysis of a
curated database (OrthoDB, www.orthodb.org). A set of
954 metazoan-specific USCOs has been validated against
364 metazoan genomes and shown to be indeed (i) single-
copy and (ii) nearly universally present (Manni et al. 2021).
Results
Patchwork is a reference- and alignment-based method for
mining phylogenetic markers from WGS data, using either
assembled contigs or reads as input (Fig. 1). The aim of
Patchwork is to capture multiexon or fragmented genes,
scattered across different contigs or reads. One or more ref-
erence protein sequences guide the “stitching” process,
where the best-scoring translated query nucleotide se-
quences for any given region are merged into continuous
stretches of amino acid sequences. Merged sequences go
through a masking step in which unaligned residues, am-
biguous amino acid characters (letters that do not deter-
mine a unique amino acid; they are B, J, X, or Z, where
B = D or N, J = I or L, X = unknown, and Z = E or Q), and
stop codons are removed from query sequences.
Optionally, the removal of stop codons and ambiguous
amino acid characters may be skipped by providing the
--retain-stops and --retain-ambiguous flags, respectively.
Finally, Patchwork implements a sliding window–based
alignment trimming step to remove poorly aligned residues
(e.g. due to the presence of putative noncoding regions)
from the resulting sequences. The output is available as nu-
cleotide or amino acid sequences.
Benchmark
To asses performance of our approach, we (i) test an ideal
case where the query and reference species are identical,
(ii) where the query and reference are 2 distant species,
and (iii) compare Patchwork v.0.5.1 with ALiBaSeq v.1.2
(Knyshov et al. 2021) and aTRAM v.2.4.3 (Allen et al.
2017). Throughout these benchmarks, we use Illumina
short-read nucleotide sequences from the marine annelid
Dimorphilus gyrociliatus (accession PRJEB37657 in the
European Nucleotide Archive). A highly contiguous
(N50 = 2.24 Mb) and complete (95.8% BUSCO genes
recovered, metazoa_odb10) annotated version of the
compact 73.8-Mb genome of this annelid is publicly avail-
able (Martín-Durán et al. 2021).
As we only used short-read data sets at different cov-
erages for our benchmark analyses, we created highly dis-
continuous assemblies with low N50s as typical for
FIG. 1.—Graphical overview of the Patchwork algorithm. First, a) query sequences are aligned to the provided reference sequence. These alignments may
or may not be overlapping. b) Overlapping alignments are realigned but only in the area in which they overlap. The best-scoring alignment is retained while all
others are discarded. c) Nonaligned residues are then removed, and d) the remaining regions are concatenated into a single, continuous sequence.
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 3
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
real-world low-coverage genomic data sets. We assembled
these sequence reads using SPAdes and subsequently used
Patchwork to search for near-USCOs (Seppey et al. 2019),
using a preannotated set of USCOs from that same species
as a reference. Next, we used that same assembly of
D. gyrociliatus to search for USCOs, this time using
USCOs from the leech Helobdella robusta as the reference,
a clitellate annelid that diverged at least 400 mya from
D. gyrociliatus (Erséus et al. 2020). Finally, we compared
our program to ALiBaSeq (Knyshov et al. 2021) and
aTRAM 2 (Allen et al. 2017). For this comparison, we also
subsampled the aforementioned D. gyrociliatus short reads
in order to simulate various sequencing coverages. We
decided not to include the software GeMoMa (Keilwagen
et al. 2016) in this comparison, as it heavily relies on the
availability of reference genomic or transcriptomic data.
Read2Tree (Dylus et al. 2023) has also not been included
in the comparisons, as its focus is tree inference and not
marker retrieval.
We compared the retrieved translated and stitched con-
tigs, hereafter called “recovered markers,” to the reference
D. gyrociliatus USCOs. For each reference sequence, the
evaluation included percent identical positions out of all
aligned positions as well as percent of reference sequence
positions covered by the recovered markers. Patchwork
automatically generates these statistics and produces a de-
tailed output for each reference as well as an aggregated
output over all references.
Effect of Genome Fragmentation on Accuracy
In the initial setup, we assessed the accuracy of Patchwork
using a high-quality query assembly and 815 USCO refer-
ence sequences from the same species, D. gyrociliatus,
and thereby exploring the program's performance for the
hypothetical case where the entire set of reference se-
quences should be recoverable as exactly matching stitched
contigs from the query sequences. We retrieved all of the
initial 815 markers. On average, 95.9% of all aligned posi-
tions were identical matches, with a mean query coverage
of 92.2%; this equals a combined measure of 88.4% iden-
tical matches for all reference positions, whether aligned to
query residues or not (Fig. 2 and Table 1).
Effect of Reference Divergence
In the second iteration, we aligned a set of high-coverage
query assemblies against a very distant reference set, in or-
der to estimate the program's performance when using
highly divergent sequences as reference. For this purpose,
the same D. gyrociliatus SPAdes assembly as in the previous
evaluation served as query sequence set, and 957
near-USCOs from the annotated genome of the leech
H. robusta were used as a reference. We retrieved 943
out of the 957 H. robusta reference sequences. Of these,
769 successfully aligned back to 1 and only 1 of the 778
FIG. 2.—Percent identity and query coverage in markers based on a Patchwork analysis of a SPAdes assembly of D. gyrociliatus, targeting 815 single-copy
orthologs from the same species.
Table 1
Results from Patchwork when using a D. gyrociliatus SPAdes assembly as
the query and USCOs from a long-read assembly of D. gyrociliatus as a
reference
Variable Mean Min Median Max
Reference length 447.606 77 351.0 2,748
Query length 407.953 27 322.0 2,553
Matches 385.075 24 306.0 2,549
Mismatches 21.207 0 0.0 1,075
Deletions 42.131 0 5.0 1,097
Query coverage 92.181 5.22 98.71 100.0
Identity 95.887 30.91 100.0 100.0
Thalén et al. GBE
4Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
D. gyrociliatus USCOs that were considered homologous
to the H. robusta set (Fig. 3 and Table 2). For these 769 re-
trieved markers, the average percent identity measure was
89%, with a mean query coverage of 74.5%. Put different-
ly, the recovered markers had an average of 67.2% identi-
cal matches against all reference positions.
Program Comparison
In the third setup, we compared the performance and
runtime for Patchwork to that of ALiBaSeq (Knyshov
et al. 2021) and aTRAM 2 (Allen et al. 2018), using a
D. gyrociliatus short-read data set at different sequence
coverage levels (1×, 3×, 5×, 10×, 20×, and 40×). While
Patchwork can use both reads and assembled contigs as
an input, ALiBaSeq uses assembled contigs, and aTRAM 2
is read based. Performance was assessed using a combined
measure for accuracy and completeness of the recovered
USCO markers annotated, hereafter called “total percent
identity.” Patchwork with D. gyrociliatus assembly data
performs best over almost all data sets (Fig. 4), reaching
as much as 62% total percent identity for data with at least
10× coverage. Only with a coverage of 1×, D. gyrociliatus
read-based data seem to be better suited for marker re-
trieval. Cutoff thresholds during the assembly might lead
to discarding part of the sequence data that is retained
when using reads, therefore causing the latter to achieve
higher query coverage for the 1× data set. Note that query
coverage improves especially for read data when tantan
masking in DIAMOND is disabled (i.e. by providing
--masking 0 as an argument). For data sets with higher
coverage, running Patchwork on read data still achieves
well over 50% total percent identity. Using read data there-
fore is a valid option that could be considered if the com-
pute resources necessary for assembling the sequences
are scarce. The performance of Patchwork stays approxi-
mately constant for data sets with coverages of at least
10×, independent of the used data. By comparison,
ALiBaSeq achieves approximately 7% less total percent
identity than Patchwork with assemblies for all data sets
and performs only slightly better than Patchwork with
read data for a coverage over 10×. aTRAM 2, on the other
hand, performs comparatively poorly, with a maximum to-
tal percent identity of about 22% for the data set with 20×
coverage. This is mostly due to the small number of recov-
ered markers; the markers themselves generally have a high
percent identity value. For a coverage of 1×, aTRAM 2 was
unable to recover any stitched contigs at all. The program
was also not evaluated for the data set of 40× coverage
as it had not completed within the cluster's maximum run-
time of 5 d.
Both Patchwork and ALiBaSeq are very fast; the pro-
grams terminated in under 5 min when using assembly
FIG. 3.—Percent identity and query coverage in markers based on a Patchwork analysis of a SPAdes assembly of D. gyrociliatus, targeting 957 single-copy
orthologs from the leech H. robusta.
Table 2
Results from Patchwork when using a D. gyrociliatus SPAdes assembly as
the query and USCOs from H. robusta as a reference
Variable Mean Min Median Max
Reference length 448.025 77 351.0 2.748
Query length 309.93 31 259.0 2.326
Matches 249.126 15 195.0 2.326
Mismatches 30.234 0 7.0 355
deletions 8.319 0 2.0 268
Query coverage 74.540 5.41 82.78 100.0
Identity 89.008 25.46 96.99 100.0
The recovered markers were evaluated against the set of 778 D. gyrociliatus
USCOs that were considered homologous to sequences in the H. robusta reference
set.
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 5
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
FIG. 4.—Accuracy and completeness of the recovered marker sequences for the different D. gyrociliatus data sets when run against a reference set of H.
robusta USCOs. Accuracy and completeness were jointly measured as percent identical out of all aligned positions multiplied with the total percentage of
aligned nongap positions. This integrated measure avoids a distorted performance estimation, e.g. due to small number of recovered markers but high percent
identity in the aligned positions. Patchwork was run with D. gyrociliatus assemblies unless indicated differently. ALiBaSeq received assemblies, while aTRAM 2
received reads as input.
FIG. 5.—Program runtime for each D. gyrociliatus data set. Patchwork was run both as a script and as a compiled program. It received D. gyrociliatus
assemblies unless indicated differently. ALiBaSeq was run on assemblies, while aTRAM 2 received reads as input.
Thalén et al. GBE
6Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
data (Fig. 5). The runtimes fluctuated only slightly
between data sets. Using Patchwork with reads
required more time for larger data sets, but even for
the largest evaluated data set, it finished after half an
hour. By comparison, running aTRAM 2 took days for
all data sets.
Patchwork in a Phylogenomic Context
To demonstrate how our software could be utilized in a
phylogenomic pipeline, we used it to retrieve a set of 957
metazoan-specific USCOs from a phthirapteran data set
(Allen et al. 2017). When reusing a set of 15 lice Operaional
Taxonomic Units (Hexapoda and Phthiraptera), we were
able to retrieve all 957 USCOs, for all taxa. The resulting
alignments contained few gaps for any marker; i.e. most
markers were well above the 90% aligned position trim-
ming threshold. The trimmed alignment contained
3,454,320 positions in total, compared to 5,383,303 be-
fore trimming (i.e. ∼64% positions were retained after
trimming). Our phylogenetic reconstruction resulted in a
well-supported tree (Fig. 6), which is largely congruent
with the original analysis (Allen et al. 2017), with the excep-
tion of the position of Haematopinus macronemis. How-
ever, this placement is the only part of the tree that is not
well supported, and reasons for incongruence are unclear,
which could be, e.g. slightly different choice of phylogenet-
ic markers. However, in general, the approach worked very
well, and for 951 of 954 USCOs, nearly complete exonic
data could be retrieved.
Discussion
Patchwork is a new software for quickly mining phylogen-
etic markers from WGS data. Since Patchwork can retrieve
homologous regions even in distantly related taxa, this pro-
gram lends itself especially well for recovering phylogenetic
markers for phylogenomic studies. It is simultaneously an
efficient way for increasing marker occupancy in poorly as-
sembled genomes and/or in the presence of multilocus
exons. Finally, Patchwork allows the user to combine 2 dif-
ferent data types—i.e. transcriptomic and genomic data—
into a single data set, thus further enabling an even larger
taxon sampling and encouraging data reusability.
Special consideration should be taken to avoid the cre-
ation of chimeric sequences. One way in which such se-
quences may arise is when orthologous (i.e. genes related
via a speciation event) and paralogous (i.e. genes related
via a gene duplication event) sequences are merged to-
gether. To circumvent this issue, we recommend that the
user limits the use of reference sequences to near-USCOs.
Different lineage-specific sets of such USCOs are available
based on carefully analyzed sets of homologous genes
from a curated database (Manni et al. 2021). Besides their
use in evaluating the quality of genomic and metagenomic
data, USCOs became also prominent as preselected marker
sets in phylogenomic analyses (Sahbou et al. 2022) and
have been recently proposed as a unifying framework for
DNA-based species delimitation (Dietz et al. 2023). Many
programs, e.g. the aforementioned program BUSCO, exist
for retrieving such sequences from an already assembled
Pediculus humanus B 2013
Pediculus humanus A 2013
Pediculus schaeffi A
Pediculus schaeffi B
Pthirus pubis
Pthirus gorillae
Pedicinius badius
Neohaematopinus pacificus
Hoplopleura arborcicola
Brueelia antiqua
Bothriometopus macronemis
Haematopinus macronemis
Proechionopthirus fluctus
Antarctophtirus microchir
Linognathus spicatus
84
100
100
100
100
100
100
100
81
100
100
100
Tree scale: 0.1
FIG. 6.—Phylogenetic analysis of Phthiraptera relationships as recovered from a Maximum Likelihood analysis of a combined supermatrix using USCOs as
recovered by Patchwork. Analysis was conducted using IQ-TREE 2 including model and partition finding. Bootstrap values from 1,000 pseudoreplicates are
given at the branches.
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 7
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
genome, and these could be used as reference sequences
(Waterhouse et al. 2018). Additionally, several downstream
analysis tools are available to control for the presence of
possible cross-contamination, (unexpected) paralogous
copies, or other artifacts confounding systematic studies
(Lozano-Fernandez 2022). To control for the possible arti-
factual inclusion of stretches of noncoding sequences, the
tool PREQUAL could be used to detect such and remove
such regions (Whelan et al. 2018). Finally, multiple align-
ment tools such as MACSE (Ranwez et al. 2018) can be
used to deal with putative confounding problems from
the occurrence of premature stop codons, which might oc-
cur when working with data with coverage genomic data.
The accuracy and the robustness of the results depends
on how closely related the target and the reference species
under study are. The difficulty stems from the ability to ac-
curately predict noncoding regions in aligned contigs; be-
cause alignment trimming relies on gap-excluded identity,
choosing the correct cutoff threshold becomes increasingly
easier as the level of identity approaches 100% (the identity
of noncoding regions is likely to stay the same, while the
identity to coding regions increases). On the upside, high-
quality genomes for practically all major lineages exist and
are readily available online (Formenti et al. 2022).
Moreover, and not surprisingly, the coverage of the input
read data sets correlates with the performance of retrieving
single-copy marker genes. Similar to a previous study (Liu
et al. 2021), we also find that a coverage of 10× and
more should be targeted when designing genome skim-
ming studies. However, as seen in the proof of principle,
even lower coverages enable the construction of phyloge-
nomic data matrices. For very low-coverage data sets, the
read-based mode outperforms assembly-based analyses.
For the latter, assembly size seems to be more important
than contiguity.
In summary, Patchwork allows the retrieval of (putative-
ly) single-copy genes from genome skimming data sets at
different sequencing coverage with high computational
speed. Availability and quality of biological specimens are
becoming the major bottleneck for phylogenomic studies.
Especially for phylogenomic studies relying on collection-
based material, Patchwork offers a fast and efficient way
for marker retrieval from short-read sequence data sets.
Materials and Methods
Patchwork is implemented in Julia (Bezanson et al. 2017), a
just-in-time (JIT)–compiled programming language that
is typically faster than interpreted languages such as
Python or R. Existing Julia bioinformatics packages such as
BioAlignments.jl (https://github.com/BioJulia/BioAlignments.jl)
and BioSequences.jl (https://github.com/BioJulia/Bio
Sequences.jl) were used to speed up the development
process. Patchwork is obtainable from GitHub (https://
github.com/fethalen/Patchwork), is distributed under
the GPLv3 license, and targets both Linux and macOS
(Windows users may run Patchwork by using the
Windows Subsystem for Linux).
In order to facilitate reproducibility, a Docker container
(Merkel 2014) of Patchwork is also distributed via the
BioContainers framework (da Veiga Leprevost et al.
2017). Similarly, we also provide an Apptainer definition
file for users of the Apptainer/Singularity platform.
Apptainer (formerly known as Singularity; Kurtzer et al.
2017) is another container platform that targets shared sys-
tems such as High-Performance Computing platforms,
which are commonplace at universities today.
Most phylogenomic studies include more than a handful
of taxa, and concatenating these manually gets increasingly
tedious as the data set size increases. Therefore, Patchwork
also includes a set of complementary tools for streamlining
the downstream analysis. For example, the script multi_
patchwork.sh lets the user run Patchwork on multiple input
files and concatenate homologous sequences from differ-
ent taxa into 1 file.
Initial Alignment and Database Construction
First, all reference protein sequences, regardless of whether
they are spread across multiple FASTA files or not, are
pooled together into a single FASTA file, from which a
DIAMOND database is created. There is also the option to
use an existing DIAMOND-formatted database or a BLAST
output file in a tabular format by using the --database or
--tabular options, respectively. These files are both provided
in the output of Patchwork and can thus be reutilized when
trying out different parameters. In either case, DIAMOND's
BLASTX algorithm is used to align translated nucleotide se-
quences to 1 or more reference protein sequences.
Like DIAMOND, Patchwork, by default, scores align-
ments using the substitution matrix BLOSUM62 (Henikoff
and Henikoff 1996), a gap open penalty of 11, and a gap
extension penalty of 1. Other built-in or custom substitu-
tion matrices may be used in place of the default option.
User-chosen gap open penalties and gap extension penal-
ties may also be set, as long as they fall within the limits
set by the substitution matrix of choice. For the users’ con-
venience, Patchwork supports a number of different
DIAMOND options that can usually be provided in the
same manner as in DIAMOND itself.
For all Patchwork benchmarks, we observed that disab-
ling DIAMOND's tantan masking (Frith 2011), by setting
--masking 0, as described in Table 2, yielded higher query
coverages. This effect was more pronounced for read
data sets but could also be detected in assembled data
sets. On the other hand, the number of exact matches in
all aligned positions (i.e. percent identity) between the
query and the reference decreased slightly. When
Thalén et al. GBE
8Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
combining both measures, however, disabling tantan
masking improved the overall results.
Since the alignment search is likely to result in more than
1 hit per reference region, certain measures are taken to en-
sure that none of these hits are overlapping: They are, “hit
stitching” (also known as contig or exon stitching; i.e. mer-
ging of overlapping regions), removal of unaligned resi-
dues, and concatenation of nonoverlapping regions.
Hit Stitching
During “hit stitching,” all alignments made between the
query region and the target sequence are merged in a
way such that only the highest-scoring segment pair
(HSP) for each region is retained. This results in a single,
continuous sequence, and, as a consequence, some hits
may be removed entirely (see also Fig. 1).
The “hit stitching” algorithm works as follows: First,
query regions are sorted according to how they align to
the target sequence—from first to last—and are added to
the stack. Next, each pair of query regions on the stack is
checked for overlaps. In case of an overlap, first, all regions
are sorted by their first and last position at which they align
to the reference sequence. The first region is added to the
stack. Its start and end coordinates are then compared with
those of the following region to check if they are overlap-
ping. If they are not overlapping, the next adjacent region
is added to the stack and compared with the following re-
gion. If they are overlapping, however, the region that is
currently at the top of the stack is removed. The overlap-
ping parts of this region and the next region are realigned
to identify the best-scoring sequence at that particular
interval. Then, based on the realignment score, the se-
quences are sliced such that the best-scoring sequence is
retained at the overlapping region and so that the nonover-
lapping, flanking parts of both regions, if existing, are pre-
served as well. Thus, a maximum of 3 sliced region parts are
then added to the stack as new, separate regions: The se-
quence part preceding the overlap, which originates from
the first region, the highest-scoring sequence at the over-
lap, which may be from either of the 2, and the sequence
part that follows the overlap, which originates from the se-
cond region. The algorithm then continues in the same
manner, comparing the topmost region of the stack with
the following region, until all overlaps are removed and
all regions have been added to the stack. This procedure
may require multiple iterations, since in every run, only
each pair of consecutive regions are compared and
merged.
Different aligned regions from the same contig are al-
lowed to be stitched together. While “hit stitching” may re-
sult in the creation of chimeric sequences (i.e. 2 or more
biological sequences incorrectly joined together), this pro-
cedure has the potential to increase coverage and to
(correctly) join 2 or more regions that are located on separ-
ate contigs due to incomplete assembly or sequencing
errors.
Alignment Masking
At this step, unaligned residues, ambiguous amino acid char-
acters, and stop codons (also known as “termination co-
dons”) are all removed from the resulting query sequence.
Query sequences may contain residues that do not align to
any particular region of the subject sequence. Such regions
may be noncoding regions or simply insertions. In either
case, unaligned residues are removed on the basis that inserts
are less likely to constitute phylogenetically informative sites
and risks introducing untranslated regions and therefore bias-
ing the downstream analysis. Similarly, ambiguous amino
acids are most likely noninformative, and stop codons are a
clear indicator that noncoding characters have been included
in the alignment. Although such regions are likely to be re-
moved in the subsequent step (see above), the user may
choose to keep stop codons and/or ambiguous amino acid
characters by providing the flags --retain-stops and/or
--retain-ambiguous.
Sliding Window–Based Alignment Trimming
One side effect of aligning translated nucleotide sequences
to amino acid sequences is that one might recover non-
coding portions of DNA, provided that the following 2 con-
ditions are fulfilled: (i) the noncoding DNA is located in
between 2 or more coding portions and (ii) there is a se-
quence region in the reference sequence that the non-
coding region can align to. In the resulting alignment,
noncoding portions are characterized by many indels, inter-
cepted by occasional matches. The alignment of noncoding
portions of DNA can already be observed in the alignments
produced by DIAMOND, and thus, this side effect does not
stem from Patchwork itself. In fact, the Patchwork algo-
rithm will only include noncoding parts if nothing else aligns
better to the affected region of the reference sequence.
To mitigate this effect, we have implemented a sliding
window–based alignment trimming approach to rid the
alignments from these unwanted regions. This works by
scanning the alignment from left to right, cutting all regions
where the average distance between query and reference is
above the user-provided distance threshold. The window
size and the distance threshold can both be set by the
user, but need not be, since we implemented default values
for both. This step can also be skipped over in its entirety.
This approach tries to avoid cases where a single bad, but
correct, match would have otherwise been cut out.
Concatenation and Realignment of Remaining Regions
Finally, the resulting set of ordered, nonoverlapping se-
quence regions are concatenated into 1 continuous
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 9
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
sequence. The concatenated sequence is then realigned to
the reference to obtain the final output sequence and align-
ment score.
Benchmark
Patchwork v.0.5.1 was continuously run using Julia v.1.8.2
and DIAMOND v.2.0.13 (Buchfink et al. 2021), with the
options --ultra-sensitive --frameshift 15 --masking 0. All ana-
lyses were performed on the high-performance computing
cluster maintained by the Gesellschaft für wissenschaftliche
Datenverarbeitung mbH Göttingen (GWDG), running the
Scientific Linux release 7.9 (Nitrogen) operating system with
a Linux kernel of version 3.10.0. All runs were allocated 32
Intel Xeon Platinum 9242 CPUs running at 2.30 GHz.
Elapsed time was calculated as reported by Slurm.
Effect of Genome Fragmentation on Accuracy
A publicly available set of Illumina short-read sequences
of D. gyrociliatus (Martín-Durán et al. 2021) was used
for the query set. We used SPAdes v.3.15.3 to generate
the de novo assembly, using a K-mer size of 55. A set
of 815 D. gyrociliatus USCOs retrieved from the published
high-quality genome assembly (GenBank, accession:
GCA_904063045.1) served as the reference.
Effect of Reference Divergence
We reused the de novo assembly from the previous evalu-
ation for the query, while a set of 957 near-USCOs from
the annotated genome of the leech H. robusta (GenBank,
accession: GCA_000326865.1) were used as reference se-
quences. We used the same parameter settings described
above. For this evaluation, we did not use Patchwork's
own accuracy and completeness assessment, because the
true number of identical matches and the amount of query
coverage are not known between the 2 divergent species
D. gyrociliatus and H. robusta. We therefore chose to com-
pare the recovered markers to a subset of the D. gyrociliatus
USCOs described in the previous benchmark. More specif-
ically, only those D. gyrociliatus USCOs that produced a hit
when searching against the H. robusta USCOs with
DIAMOND v2.0.13 in ultrasensitive mode were used, since
only these were considered “recoverable” in this setup. The
resulting D. gyrociliatus USCO set contains 778 sequences;
37 sequences were discarded. The set of recovered markers
was searched against the reference USCO set using
DIAMOND in --ultra-sensitive mode. For each reference se-
quence, we retrieved only the marker that produced the
highest bit score during the alignment step. We then eval-
uated percent identical positions out of all aligned positions
as well as percent of reference sequence positions covered
by the recovered markers.
Program Comparison
In order to generate data sets at different sequencing cov-
erages, we subsampled the trimmed D. gyrociliatus reads
downloaded from NCBI GenBank. Corresponding read
pairs were selected randomly from the paired-end data.
Subsampling was done using Subsample.jl, a Julia package
distributed together with Patchwork. The resulting data
sets have coverages of 1×, 3×, 5×, 10×, 20×, and 40×.
For each of the data sets, we produced a short-read-only
de novo assembly, as ALiBaSeq is designed for assembly
data, while aTRAM 2 requires read data and Patchwork
can process both. We used the assembler SPAdes
v.3.15.3 (Nurk et al. 2013), with a K-mer size of 33, and
the quality of the assembly was assessed using QUAST
v.5.0.2 (Gurevich et al. 2013). We aligned the D. gyrocilia-
tus reads and assemblies against the same set of H. robusta
USCOs mentioned before.
ALiBaSeq v.1.2 was run with the D. gyrociliatus assem-
blies described above. The program requires BLAST; the
version here used was 2.11.0. The program builds a data-
base from the D. gyrociliatus sequences and searches this
database with the H. robusta sequences before stitching
the hits together. We set the parameters according to
the guide for a protein-based search without reciprocal
search, as explained in their documentation on GitHub
(see the README file): -x a [extract all hits and join into
(super)contigs] -f S [single alignment table (TBLASTN result
file)] -e 1e-10 [e-value cutoff for further processing of
TBLASTN hits] -c 1 [extract single best (super)contig]
--amalgamate-hits [scoring scheme for (super)contigs] --is
[enable contig stitching] –ac aa-tdna [search protein
“baits” (H. robusta USCOs) against tDNA “target” data-
base (D. gyrociliatus reads)].
We ran aTRAM v.2.4.3 with the sampled D. gyrociliatus
read data sets. The program further requires BLAST, as well
as a de novo assembler, and exonerate. We used BLAST
v.2.11.0 and exonerate v.2.2.0 (Slater and Birney 2005)
and employed SPAdes v.3.15.3 for the assembly step. The
full aTRAM 2 pipeline consists of 3 consecutive steps:
Firstly, the preparation of a database from the D. gyrocilia-
tus reads, secondly, the assembly of different loci, and last-
ly, a reference-guided stitching process. The parameter
settings for the core module of aTRAM 2 as well as the
stitcher were as follows: --evalue 1e-10 --file-filter “*.fil-
tered contigs.fasta” --overlap N.
Patchwork v.0.5.1 was run with both the sampled
D. gyrociliatus read data sets and the assemblies we pro-
duced for these sampled read data. We ran the uncom-
piled program using Julia v.1.8.2 as well as the compiled
version on each data set in order to perform runtime com-
parisons. Patchwork achieves all its objectives in a 1-step
procedure, i.e. can be called with a single command,
unlike ALiBaSeq and aTRAM 2. The program builds a
Thalén et al. GBE
10 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
DIAMOND database from the H. robusta sequences and,
after obtaining D. gyrociliatus hits, proceeds to stitch
them together. All nondefault parameter settings for
Patchwork were as described above. They were used for
both read-based runs and assembly-based runs.
We ran the 3 programs with their respective parameter
settings on the different D. gyrociliatus data sets against
the H. robusta USCO set described above, which contains
957 sequences. The aTRAM 2 run for the 40× coverage
read data set was ended prematurely because it had not
terminated after 5 d. In a following step, the recovered mar-
kers produced by each program for each data set were eval-
uated with respect to completeness and accuracy of the
resulting sequences by comparing them to the same set
of 778 D. gyrociliatus USCOs mentioned above, again be-
cause only this subset could be recovered by the programs
in this setup. ALiBaSeq and aTRAM 2 output DNA se-
quences that contain the ambiguous nucleotide N in all po-
sitions that could not be recovered during stitching. These
N were removed for the subsequent evaluation steps be-
cause they distort the query coverage measure; the amount
of a reference sequence covered by the recovered marker is
artificially increased due to the uninformative inserted N.
Completeness and accuracy were measured jointly
as percent identical aligned positions multiplied with
the total amount of aligned, or recovered, positions (here
called p
identical, cov
):
pidentical, cov =nmatch
naligned ·cov.
cov =srecovered(length(srecovered ))
sUSCOs(length(sUSCOs )) .
n
match
being the total number of exact matches in all align-
ments between recovered markers and reference USCOs
and n
aligned
the total number of aligned, i.e. nongap, posi-
tions. The coverage cov was computed as the ratio of the
total lengths of all recovered markers s
recovered
and all refer-
ence USCOs s
USCOs
. We chose to combine the measures for
accuracy of the recovered markers, i.e. percent identical out
of all aligned positions, and completeness or query cover-
age, i.e. percent recovered positions, in order to avoid a dis-
torted outcome. For example, a program might recover
only a very small number of markers but these with high
percent identity, such that using only the percent identity
measure would have resulted in an overestimation of the
program's performance.
Patchwork in a Phylogenomic Context
We retrieved the raw reads from the NCBI Sequence Read
Archive (SRA accession SRR5088465, SRR5088468,
SRR5308129 SRR5308123, SRR5088469 SRR5088471,
SRR5088472, SRR5088473, SRR1182279, SRR5308136,
SRR5308138, SRR5088474, SRR5088475, SRR5308112,
and SRR5088466) using prefetch, vdb-validate, and fasterq-
dump (with the flag --split-spot), all from the NCBI SRA tool-
kit (Leinonen et al. 2011). We ran Patchwork v.0.5.1 with
each of the specimens as query input and a set of 957
near-USCOs from the leech H. robusta as reference se-
quences (see Table 2 for parameter settings). A multiple se-
quence alignment (MSA) was constructed for each of these
957 loci using MAFFT (Katoh and Standley 2013) with the
options –globalpair --ep 0.123. The resulting alignments
were trimmed with trimAl (Capella-Gutiérrez et al. 2009),
removing all positions with more than 90% gaps but retain-
ing at least 60% of each alignment (options -gt 0.9 and
-cons 60, respectively). We used FASconcat-G (Kück and
Longo 2014) to concatenate the trimmed alignments into
a supermatrix. This supermatrix was then input into
IQ-TREE 2 (Minh et al. 2020), alongside its corresponding
gene partition file, to reconstruct the phylogeny using the
maximum likelihood (ML) approach. We ran IQ-TREE 2
with extended model selection and tree inference, calculat-
ing 1,000 replicates for the ultrafast bootstrap (command
line options -m MFP and -B 1000, respectively).
Acknowledgments
This work used the Scientific Compute Cluster at GWDG,
the joint data center of Max Planck Society for the
Advancement of Science (MPG) and University of
Göttingen. We acknowledge support by the Open Access
Publication Funds of the Göttingen University.
Funding
This work was supported by the German Research
Foundation (DFG) BL787/8-1.
Data Availability
The data underlying this article are available via GitHub at
https://github.com/animal-evolution-and-biodiversity/benc
hmarking-patchwork. Patchwork is distributed under the
GPLv3 license via GitHub at https://github.com/fethalen/
patchwork.
Literature Cited
Allen JM, Boyd B, Nguyen NP, Vachaspati P, Warnow T, Huang DI,
Grady PGS, Bell KC, Cronk QCB, Mugisha L, et al.
Phylogenomics from whole genome sequences using aTRAM.
Syst Biol. 2017:66(5):786–798. https://doi.org/10.1093/sysbio/
syw105.
Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0:
an improved, flexible locus assembler for NGS data. Evol
Bioinform. 2018:14:1176934318774546. https://doi.org/10.
1177/1176934318774546.
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 11
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: a fresh approach to
numerical computing. SIAM Rev. 2017:59(1):65–98. https://doi.
org/10.1137/141000671.
Bleidorn C. Phylogenomics. An introduction. Cham: Springer
International Publishing; 2017.
Blom MPK. Opportunities and challenges for high-quality bio-
diversity tissue archives in the age of long-read sequencing.
Mol Ecol. 2021:30(23):5935–5948. https://doi.org/10.1111/
mec.15909.
Bragg JG, Potter S, Bi K, Moritz C. Exon capture phylogenomics: effi-
cacy across scales of divergence. Mol Ecol Res. 2016:16(5):
1059–1068. https://doi.org/10.1111/1755-0998.12449.
Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at
tree-of-life scale using DIAMOND. Nature Meth. 2021:18(4):
366–368. https://doi.org/10.1038/s41592-021-01101-x.
Call E, Mayer C, Twort V, Dietz L, Wahlberg N. 2021. Museomics: phy-
logenomics of the moth family Epicopeiidae (Lepidoptera) using
target enrichment. Insect Syst Divers. 5(2):6. https://doi.org/10.
1093/isd/ixaa021
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for
automated alignment trimming in large-scale phylogenetic ana-
lyses. Bioinformatics 2009:25(15):1972–1973. https://doi.org/10.
1093/bioinformatics/btp348.
Dahn HA, Mountcastle J, Balacco J, Winkler S, Bista I, Schmitt AD,
Pettersson OV, Formenti G, Oliver K, Smith M, et al.
Benchmarking ultra-high molecular weight DNA preservation
methods for long-read and long-range sequencing. GigaScience
2022:11:giac068. https://doi.org/10.1093/gigascience/giac068.
da Veiga Leprevost F, Grüning BA, Alves Aflitos S, Röst HL, Uszkoreit J,
Barsnes H, Perez-Riverol Y. BioContainers: an open-source and
community-driven framework for software standardization.
Bioinformatics 2017:33(16):2580–2582. https://doi.org/10.1093/
bioinformatics/btx192.
Dietz L, Eberle J, Mayer C, Kukowka S, Bohacz C, Baur H, Espeland M,
Huber BA, Hutter C, Mengual X, et al. Standardized nuclear mar-
kers improve and homogenize species delimitation in Metazoa.
Methods Ecol Evol. 2023:14(2):543–555. https://doi.org/10.
1111/2041-210X.14041.
Dodsworth S. Genome skimming for next-generation biodiversity ana-
lysis. Trends Plant Sci. 2015:20(9):525–527. https://doi.org/10.
1016/j.tplants.2015.06.012.
Dylus D, Altenhoff A, Majidian S, Sedlazeck FJ, Dessimoz C. Inference of
phylogenetic trees directly from raw sequencing reads using
Read2Tree. Nat Biotechnol. 2023. https://doi.org/10.1038/s41587-
023-01753-4.
Erséus C, Williams BW, Horn KM, Halanych KM, Santos SR, James SW,
Des Creuzé Châtelliers M, Anderson FE. Phylogenomic analyses re-
veal a Palaeozoic radiation and support a freshwater origin for cli-
tellate annelids. Zool Scr. 2020:49(5):614–640. https://doi.org/10.
1111/zsc.12426.
Formenti G, Theissinger K, Fernandes C, Bista I, Bombarely A,
Bleidorn C, Ciofi C, Crottini A, Godoy JA, Höglund J, et al. The
era of reference genomes in conservation genomics. Trends
Ecol Evol. 2022:37(3):197–202. https://doi.org/10.1016/j.tree.
2021.11.008.
Frith MC. A new repeat-masking method enables specific detection of
homologous sequences. Nucleic Acids Res. 2011:39(4):e23.
https://doi.org/10.1093/nar/gkq1212.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment
tool for genome assemblies. Bioinformatics 2013:29(8):
1072–1075. https://doi.org/10.1093/bioinformatics/btt086.
Heath T, Hedtke SM, Hillis DM. 2008. Taxon sampling and the accuracy
of phylogenetic analyses. J Syst Evol. 46:239–257. 10.3724/SP.J.
1002.2008.08016
Henikoff JG, Henikoff S. Blocks database and its applications. Meth
Enzymol. 1996:266:88–105. https://doi.org/10.1016/S0076-
6879(96)66008-X.
Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing tech-
nologies: an overview. Human Immunol. 2021:82(11):801–811.
https://doi.org/10.1016/j.humimm.2021.02.012.
Jin J-J, Yu W-B, Yang J-B, Song Y, dePamphilis CW, Yi T-S, Li D-Z.
GetOrganelle: a fast and versatile toolkit for accurate de novo as-
sembly of organelle genomes. Genome Biol. 2020:21(1):241.
https://doi.org/10.1186/s13059-020-02154-5.
Katoh K, Standley DM. MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Mol Biol
Evol. 2013:30(4):772–780. https://doi.org/10.1093/molbev/
mst010.
Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining
RNA-seq data and homology-based gene prediction for plants, an-
imals and fungi. BMC Bioinformatics 2018:19(1):189. https://doi.
org/10.1186/s12859-018-2203-5.
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F.
Using intron position conservation for homology-based gene pre-
diction. Nucleic Acids Res. 2016:44(9):e89. https://doi.org/10.
1093/nar/gkw092.
Knyshov A, Gordon ERL, Weirauch C. New alignment-based sequence
extraction software (ALiBaSeq) and its utility for deep level phylo-
genetics. PeerJ 2021:9:e11019. https://doi.org/10.7717/peerj.
11019.
Kück P, Longo GC. FASconCAT-G: extensive functions for multiple se-
quence alignment preparations concerning phylogenetic studies.
Frontiers Zool. 2014:11(1):81. https://doi.org/10.1186/s12983-
014-0081-x.
Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for
mobility of compute. PLoS One 2017:12(5):e0177459. https://doi.
org/10.1371/journal.pone.0177459.
Leinonen R, Sugawara H, Shumway M. The sequence read archive.
Nucleic Acids Res. 2011:39(Database):D19–D21. https://doi.org/
10.1093/nar/gkq1019.
Lemmon EM, Lemmon AR. High-throughput genomic data in systema-
tics and phylogenetics. Annu Rev Ecol Evol Syst. 2013:44(1):
99–121. https://doi.org/10.1146/annurev-ecolsys-110512-135822.
Liu B-B, Liu B-B, Ma Z-Y, Ren C, Hodel RGJ. 2021. Capturing single-
copy nuclear genes, organellar genomes, and nuclear ribosomal
DNA from deep genome skimming data for plant phylogenetics:
a case study in Vitaceae. Appl Plant Sci. 11(4):e11537. https://
doi.org/10.1111/jse.12806
Lozano-Fernandez J. A practical guide to design and assess a phyloge-
nomic study. Genome Biol Evol. 2022:14(9):evac129. https://doi.
org/10.1093/gbe/evac129.
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO up-
date: novel and streamlined workflows along with broader and
deeper phylogenetic coverage for scoring of eukaryotic, prokaryot-
ic, and viral genomes. Mol Biol Evol. 2021:38(10):4647–4654.
https://doi.org/10.1093/molbev/msab199.
Martín-Durán JM, Vellutini BC, Marlétaz F, Cetrangolo V, Cvetesic N,
Thiel D, Henriet S, Grau-Bové X, Carrillo-Baltodano AM, Gu W,
et al. Conservative route to genome compaction in a miniature an-
nelid. Nat Ecol Evol. 2021:5(2):231–242. https://doi.org/10.1038/
s41559-020-01327-6.
McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT.
Applications of next-generation sequencing to phylogeography
and phylogenetics. Mol Phylogenet Evol. 2013:66(2):526–538.
https://doi.org/10.1016/j.ympev.2011.12.007.
Merkel D. 2014. Docker: lightweight linux containers for consistent de-
velopment and deployment. Linux J. 239(2):2. 10.5555/2600239.
2600241
Thalén et al. GBE
12 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von
Haeseler A, Lanfear R. IQ-TREE 2: new models and efficient methods
for phylogenetic inference in the genomic era. Mol Biol Evol.
2020:37(5):1530–1534. https://doi.org/10.1093/molbev/msaa015.
Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A,
Lapidus A, Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y, et al.
Assembling single-cell genomes and mini-metagenomes from chi-
meric MDA products. J Comput Biol. 2013:20(10):714–737.
https://doi.org/10.1089/cmb.2013.0084.
Philippe H, de Vienne DM, Ranwez V, Roure B, Baurain D. 2017. Pitfalls
in supermatrix phylogenomics. Eur J Taxon. 283:1–25. 10.5852/
ejt.2017.283
Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2:
toolkit for the alignment of coding sequences accounting for fra-
meshifts and stop codons. Mol Biol Evol. 2018:35(10):
2582–2584. https://doi.org/10.1093/molbev/msy159.
Raxworthy CJ, Smith BT. Mining museums for historical DNA: advances
and challenges in museomics. Trends Ecol Evol. 2021:36(11):
1049–1060. https://doi.org/10.1016/j.tree.2021.07.009.
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S,
Uliano-Silva M, Chow W, Fungtammasan A, Kim J, et al.
Towards complete and error-free genome assemblies of all verte-
brate species. Nature 2021:592(7856):737–746. https://doi.org/
10.1038/s41586-021-03451-0.
Richter S, Schwarz F, Hering L, Böggemann M, Bleidorn C. The utility of
genome skimming for phylogenomic analyses as demonstrated for
glycerid relationships (Annelida, Glyceridae). Genome Biol Evol.
2015:7(12):3443–3462. https://doi.org/10.1093/gbe/evv224.
Rogozin IB, Sverdlov AV, Babenko VN, Koonin EV. Analysis of evolution
of exon-intron structure of eukaryotic genes. Brief Bioinformatics
2005:6(2):118–134. https://doi.org/10.1093/bib/6.2.118.
Sahbou A-E, Iraqi D, Mentag R, Khayi S. BuscoPhylo: a webserver for
Busco-based phylogenomic analysis for non-specialists. Sci Rep.
2022:12(1):17352. https://doi.org/10.1038/s41598-022-22461-0.
Salzberg SL. Next-generation genome annotation: we still struggle to
get it right. Genome Biol. 2019:20(1):92. https://doi.org/10.
1186/s13059-019-1715-2.
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S,
Treangen TJ, Schatz MC, Delcher AL, Roberts M, et al. GAGE: a crit-
ical evaluation of genome assemblies and assembly algorithms.
Genome Res. 2012:22(3):557–567. https://doi.org/10.1101/gr.
131383.111.
Seppey M, Manni M, Zdobnov EM. BUSCO: assessing genome assem-
bly and annotation completeness. Methods Mol Biol. 2019:1962:
227–245. https://doi.org/10.1007/978-1-4939-9173-0_14.
Simmons SK, Lithwick-Yanai G, Adiconis X, Oberstrass F, Iremadze N,
Geiger-Schuller K, Thakore PI, Frangieh CJ, Barad O, Almogy G,
et al. Mostly natural sequencing-by-synthesis for scRNA-seq using
Ultima sequencing. Nat Biotechnol. 2023:41(2):204–211. https://
doi.org/10.1038/s41587-022-01452-6.
Slater GSC, Birney E. Automated generation of heuristics for biological
sequence comparison. BMC Bioinformatics 2005:6(1):31. https://
doi.org/10.1186/1471-2105-6-31.
Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat
Rev Genet. 2019:20(11):631–656. https://doi.org/10.1038/
s41576-019-0150-2.
Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phy-
logenomics era. Nat Rev Genet. 2023:24(12):834–850. https://doi.
org/10.1038/s41576-023-00620-x.
Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P,
Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO applica-
tions from quality assessments to gene prediction and phyloge-
nomics. Mol Biol Evol. 2018:35(3):543–548. https://doi.org/10.
1093/molbev/msx319.
Whelan S, Irisarri I, Burki F. PREQUAL: detecting non-homologous
characters in sets of unaligned homologous sequences.
Bioinformatics 2018:34(22):3929–3930. https://doi.org/10.1093/
bioinformatics/bty448.
Zhang F, Ding Y, Zhu C-D, Zhou X, Orr MC, Scheu S, Luan Y-X.
Phylogenomics from low-coverage whole-genome sequencing.
Methods Ecol Evol. 2019:10(4):507–517. https://doi.org/10.
1111/2041-210X.13145.
Associate editor: Dennis Lavrov
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 13
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by guest on 23 December 2023