Access to this full-text is provided by Springer Nature.
Content available from Scientific Data
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
The reference genome of
Macropodus opercularis (the
paradise sh)
Erika Fodordo
Czimer
LowLiew
Rhie
✉
✉
Danio rerio
(Macropodus opercularis
M. opercularis
During the 20th century experimental biology gained increased inuence over descriptive biology and concom-
itantly most research eorts began to narrow into a small number of “model” species. ese organisms were not
only selected because they were considered to be representative models for the examined phenomena but were
also easy and cheap to maintain in laboratory conditions1,2. Working with these convenient experimental models
had several advantages and made a rapid accumulation of knowledge possible. It enabled scientists to compare
and build on each other’s ndings eciently as well as to share valuable data and resources that accelerated dis-
covery. As a result of this, a handful of model species have dominated the eld of biomedical studies.
Despite their broad success, these models also brought limitations. As Bolker pointed out: “e extraordinary
resolving power of core models comes with the same trade-o as a high-magnication lens: a much-reduced
eld of view”3. In the case of zebrash research this trade-o has been perhaps most apparent for behavioral
studies. Zebrash are an inherently social (shoaling) species, but most behavioral studies use them in solitary
settings, which arguably is a non-natural environment for them. erefore, the use of other teleost species with
more solitary behavioral proles is warranted for studies of individual behaviors.
Paradise sh (Macropodus opercularis Linnaeus, 1758) are a relatively small (8–11 cm long) freshwater
sh native to East Asia, Southern China, Northern Vietnam, and Laos where they are commonly found in
shallow waters with dense vegetation and reduced dissolved oxygen4. Similar to all other members of the sub-
order Anabantoidae, they are characterized by the capacity to take up oxygen directly from the air through
a highly vascularized structure covered with respiratory epithelium, the labyrinth organ (LO)5. e ability
to “air-breathe” allows anabantoids to inhabit swamps and small ponds with low levels of dissolved oxygen
that would be impossible for other sh species, therefore the LO can be considered an adaptation to hypoxic
1Department of Genetics, ELTE Eötvös Loránd University, Budapest, Hungary. 2Translational and Functional
Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA. 3Frontline Fish Genomics
Research Group, Department of Applied Fish Biology, Institute of Aquaculture and Environmental Safety, Hungarian
University of Agriculture and Life Sciences, Georgikon Campus, Keszthely, Hungary. 4Science Unit, Lingnan
University, Hong Kong, China. 5Computational and Statistical Genomics Branch, National Human Genome Research
Institute, Bethesda, MD, USA. 6 Department of Ethology, ELTE Eötvös Loránd University, Budapest, Hungary. 7These
authors contributed equally: Erika Fodor, Javan Okendo. ✉e-mail: mvarga@ttk.elte.hu; burgess@mail.nih.gov
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
conditions6. e evolution of the LO has also improved hearing in some species7,8, and may have led to the
emergence of novel and elaborate mating behaviors, including courtship, territorial display, and parental care6,9.
Another interesting behavior that these sh possess is they build egg “nests” by blowing bubbles on the sur-
face of the water6,10. ese types of intricate and complex behaviors sh made them an important ethological
model during the 1970–80 s, which resulted in a detailed ethogram of the species11,12.
We propose that with recently developed husbandry protocols13 and the advent of novel molecular tech-
niques for genome editing and transgenesis, paradise sh could become an important complementary model
species for neurogenetic studies14. Furthermore, several genomes are now available for the Siamese ghting sh
(Betta splendens), a closely related species to the paradise sh15–17, so a good quality genome sequence of para-
dise sh would enable comparative ecological and evolutionary (eco-evo) studies.
While the mitochondrial genome was already available for this species18 a full genome sequence was lack-
ing. Here, we provide a brief description and characterization of a high quality, de novo paradise sh reference
genome and transcriptome assembly.
e paradise sh used to establish our colony and the source of
the transcriptome samples were purchased from a local pet store (Trioker Ltd., Érd, Hungary). Adult paradise
sh were kept in aerated glass aquariums in the animal facility of the Institute of Biology at ELTE Eötvös Loránd
University. Husbandry conditions were specied previously13. Embryos were raised at 28.5 °C and staged as
described before19. All experimental procedures were approved by the Hungarian National Food Chain Safety
Oce (Permit Number: PE/EA/406—7/2020). Animal experiments in Hungarian academic research centres
are regulated by decree no. 40/2013 (14.II.) issued by the Hungarian Government, which was draed based
on Directive 2010/63/EU on the protection of animals used for scientic purposes. e research on paradise
sh in Dr. Varga’s laboratory was made possible by permit no. PE/EA/406-7/2020, issued by the Pest County
Government Oce on the basis of the above-mentioned government regulation. Wild-caught adult paradise sh
were captured in the areas surrounding Hong Kong and the specimens were handled in accordance to protocols
outlines in the Research Ethics Approval Application via Lingnan University (Reference number: EC051/2021).
Permission to collect wild specimens were also granted in a permit obtained from the AFCD (Agriculture,
Fisheries, and Conservation Department). e permit number is “AF GR CON 11/17 Pt. 7”.
RNA samples were collected from a mix
of embryonic stages (stage 9 – 5 days post fertilization), from caudal tail blastema taken at 3- and 5-days post
amputation, from the kidney, heart, brain, ovaries of an adult female, and the brain and testis of an adult male
paradise sh, respectively. Total RNA was isolated using TRIzol (Invitrogen, 15596026), following the manufac-
turer’s protocol. Samples were puried twice with ethanol and eluted in water. Quality and integrity of the samples
was tested on an agarose gel, by Nanodrop, and using an Agilent 2100. Ribosomal RNA (rRNA) was removed
using the Illumina Ribo-Zero kit and paired-end (PE) libraries were prepared using standard Illumina protocols.
Samples were processed on an Illumina NovaSeq PE150 platform, and a total of 218,715,409 PE reads (2x 150 bp)
were sequenced, resulting in ~65 Gbp of raw transcriptomic data.
Genomic DNA samples were isolated from the tail n of the parental F0 male and female paradise sh using
the Qiagen DNeasy Blood and Tissue Kit (cat no: 69504). Samples were eluted in TE and sent for library prepa-
ration and sequencing. Sample quality-checks were performed using standard agarose gel electrophoresis and
with a Qubit 2.0 instrument. For Illumina short-read sequencing a size-selected 150 bp insert DNA library was
prepared and processed on the Illumina NovaSeq. 5000 platform. Approximately 100 million PE reads (2 × 150
bp) were sequenced for each parent, resulting in approximately 60X coverage for each genome. For PacBio HiFi
long-read single molecule real-time (SMRT) sequencing libraries, genomic DNA was prepared using whole
tissue from the 6 month old F1 ospring and the Circulomics Nanobind tissue kit. Sequence libraries were pre-
pared using the PacBio SMRTbell Template Preparation Kit and HiFi sequenced on a Sequel II platform. A total
of 4,885,238 reads (average length: 15.5 kbp) resulted in ~73 Gbp of raw genomic sequence data.
All software versions used are listed in Supplementary TableS4. The raw data
pre-processing was conducted by doing quality control, adapter trimming, and ltration of the low-quality reads
using trim galore wrapper around FASTQC and Cutadapt20. e genome assembly was generated with the hiasm
genome assembler21 using the High-Performance Computing facility at the National Institute of Health. For the
assembly, 32 cores processing units and 512 Gb of memory was used. e lower and the upper bound binned
K-mers was set to 25 and 75, respectively. e estimated haploid genome size used for inferring reads depth was
set to 0.5Gbp. e rest of the hiasm default settings were used to assemble the homozygous genome with the
build-in duplication purging parameter set to -l1. e primary assembly Graphical Fragment Assembly (GFA) le
was converted to FASTA le format using the awk command.
e Trinity assembler22 was used to create a set of RNA transcripts from the bulk
RNA-seq data. To aid in gene prediction, we downloaded the reviewed Swissprot/Uniprot vertebrate proteins
(Download date 12/01/2022; entries 97,804 proteins) for homology comparisons in annotation pipelines. Gene
prediction was done using the AUGUSTUS23 and GeneMark-ES24 sowares as part of the BRAKER pipeline25 to
train the AUGUSTUS parameters. Final annotation using the assembled transcripts and the vertebrate proteins
database was done using the MAKER pipeline26 with the EvidenceModeller27 tool switched-on to improve gene
structure annotation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Sources for the reference data used to create Figs. 2, 3: refs. 28–34.
We performed OrthoFinder analysis35,36 with default parameters, using predicted peptides (including all alter-
native splice versions) of the zebrash genome assembly GRCz11, medaka genome assembly ASM223467v1 and
B. splendens genome assembly fBetSpl5.3. Sequences were downloaded from the ENSEMBL and NIH/NCBI
Assembly homepages, respectively.
e Illumina short read les (accession number ERR3332352) were downloaded from
the Vertebrate Genome Project (VGP) database (https://vertebrategenomesproject.org/). Illumina short read
sequencing was also performed on genomic DNA obtained from the tails of 3 wild-caught sh from the Hong
Kong region. Trim galore version 0.6.10 (https://github.com/FelixKrueger/TrimGalore), a wrapper around cut-
adapt and fastqc was then used to trim the illumina adapter sequences and to discard reads less than 25 bps.
DRAGMAP version 1.3.0 (https://github.com/Illumina/DRAGMAP) was used to map the reads to the reference
genome. e resultant sequence alignment map (SAM) le was then converted to binary alignment map (BAM),
sorted, and indexed using samtools. e Picard was then used to add the read groups information in BAM le.
e genome analysis tool kit (GATK) was then used in calling the variants by turning on the dragen mode. e
bamtools stats and the plot-vcfstats was used in the downstream analysis and visualization of the genomic vari-
ants in the variant call le (VCF).
e assembly and all DNA and RNA raw reads have been deposited in the NCBI under the BioProject study
accession PRJNA824432. Within that project there is the GenBank assembly macOpe2 (GCA_030770545.1), 9
RNA-seq raw sequence data les (SRX20729884, SRX15898419, SRX15898418, SRX15898417, SRX15898416,
SRX15898415, SRX15898414, SRX15898413, SRX15898412), one PacBio HiFi genomic raw sequence data le
(SRX15948463), one Illumina PE short read genomic raw sequence data le for the assembly (SRX15948462)
and Illumina short read genomic raw sequence data les for the 3 wild-caught samples (SAMN39260618,
SAMN39260619, SAMN39260620)37,38. e variant data for this study have been deposited in the European
Variation Archive (EVA) at EMBL-EBI under accession number PRJEB7448139.
We generated the de novo reference genome sequence for this
species using 150X coverage of PacBio SMRT HiFi long-read sequencing and the hiasm genome assembly pipe-
line21. e nal assembly consisted of 483,077,705 base pairs (bp) on 152 contigs (Supplementary TableS1). e
assembled genome demonstrated a very high contiguity with an N50 of 19.2 megabases (Mb) in 12 contigs. e
largest contig was 24,022,457 bp and the shortest contig was 14,205 bp. More than 98% of the canonical k-mers
were 1x copy number indicating that our genome assembly is of very good quality (Fig.1a). e paradise sh
genome repeat content is estimated to be ~10.4%. e “trio binning”40 mode of Hiasm was attempted using sin-
gle nucleotide variant (SNV) data collected from short read sequencing of the F0 parents, however the heterozy-
gosity rate from the lab raised sh was very low at ~0.07% making it impossible to eciently separate maternal
and paternal haplotypes. e resulting assembled reference genome is therefore a pseudohaplotype. e sequence
of the mitochondrial genome (mtDNA) was essentially identical to the previously published mtDNA sequence
for this species (16,495/16,496 identities)18. We followed the B. splendens example and numbered the M. operculis
chromosomes based on their similarity to medaka chromosomes resulting in the chromosomes being numbered
1–19, and 21–24. We performed a whole genome alignment to a recent Betta splendens assembly34 21 chromo-
somes had a 1 to 1 relationship with B. splendens chromosome 9 aligning to two separate paradise sh contigs
(Fig.1b) and M. opercularis chromosome 18 having no signicant homology to a B. splendens chromosome.
is is explained by the number of chromosomes for each species with B. splendens having 21 and M. opercularis
reportedly having 23 chromosomes41. We have not determined whether the B. splendens chromosomes fused or
if the M. opercularis chromosomes split.
e genome is relatively compressed in size and has relatively small introns (mean paradise sh intron
length = 566 bp, whereas mean average teleost intron size = 1,21428) (Fig.2) and shorter intergenic regions.
e N90 for our assembly consists of 23 contigs suggesting that most chromosomes are primarily represented
by a single contig from the de novo assembly, even without any scaolding performed. Searching the contigs
with zebrash telomeric sequences revealed “telomere-to-telomere” assemblies, i.e. contigs that had telomeric
sequences at both ends in the correct orientation, for contigs ptg000004l, ptg000010l, ptg000024l, ptg000026l,
ptg000028l, and ptg000030l representing chromosomes 3, 8, 9, 15, 17 and 21, respectively (Supplementary
TableS3). ese contigs have vertebrate telomeres at both ends while the remaining contigs have one or no
stretches of telomeric sequence at the end of the contig.
Benchmarking Universal Single-Copy Orthologs (BUSCO) was used to evaluate the completeness of our
reference genome assembly with the Actinopterygii_odb10 dataset42,43. e result showed that 98.5% of the
sequence in the reference dataset had a complete ortholog in our genome including 97.3% complete and
single-copy genes and 1.2% complete and duplicate genes. Additionally, 1.2% of the genes were reported as
fragmented and 0.3% of the genes were completely missing.
Using RepeatMasker44, we ana-
lysed and characterized the repeat content in our reference genome assembly. By using a custom-built repeat
prediction library, we identied 32,955,420 bp (6.78%) in retroelements and 11,076,209 bp (2.8%) in DNA trans-
posons (Supplementary TableS2). e retroelements were further categorised into repeat families which were
made up of short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), or long
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
terminal repeats (LTR) (Supplementary TableS3). e LINEs were the most abundant repetitive sequence in
the retroelement family at 3.38% (16,447,763 bp) followed by LTRs, 3.19% (15,490,642 bp), and SINEs occurred
Fig. 1 Basic genome assessment. (a) K-mer comparison plot showing copy number of the k-mers as a
stacked histogram colored by the copy numbers found in the paradise sh dra genome assembly. e y-axis
represents the number of distinct k-mers, and the x-axis shows the k-mer multiplicity (coverage). Most
k-mers are represented once (red peak) indicating high quality for the genomic assembly. (b) Whole genome
alignment of the B. splendens genome assembly to the M. opercularis assembly. B. splendens is on the X-axis
and M. opercularis on the Y-axis. B. splendens chromosomes 9 appears to map to two separate M. opercularis
chromosomes and M. opercularis chromosome 18 does not appear to have a clear homologous chromosome in
B. splendens. Chromosome numbering is based on B. splendens alignment to medaka chromosomes14,20.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
at a lowest frequency (0.21%) (Supplementary TableS2). In the LINEs sub-family, we identied L2/CR1/Rex
as the most abundant repetitive sequence (2.15%) followed closely with the retroviral (1.73%) LTR sub-family
(Supplementary TableS2). e proportion of the DNA transposons was estimated to be 11,076,209 bp (2.28%).
Overall, the proportion of retroelements (6.78%) was much higher in the genome compared to that of DNA
transposons (2.28%).
e Vertebrate Genome Project (VGP)45 had performed short read sequencing on a single paradise sh
purchased from a German pet shop (NCBI accession: PRJEB19273), and we captured 9 “wild” samples from the
New Territories in Hong Kong. We performed short read sequencing to ≥20X coverage for 3 of the wild-caught
sh and used the data in combination with the VGP eort to establish the SNP rates within the paradise sh
populations. We identied 5,867,521 variants having a quality score of greater or equal to 30 (Table1) across
4 individual sh. e transition/transversion rate was 1.41. Our analysis identied a total of 663,781 insertions
or deletions ranging from 1 to 60 bps. e rate of SNPs and the indels were 0.5% and 0.1%, respectively.
e Trinity transcriptome assembler was used to
assemble the Illumina short reads from the RNA-sequencing data22 into predicted transcripts. e transcriptome
assembly consisted of 366,029 contigs in 20,157 loci. e integrity of the transcriptome assembly was evaluated
by mapping the Illumina short reads back to the assembled transcriptome using bowtie246; a 98.4% overall align-
ment rate was achieved. e BUSCO analysis conrmed 99.6% completeness with 8.2% single copy orthologs
and 91.4% duplicated genes (i.e. multiple isoforms). A total of 0.4% of the genes were fragmented and 0.0% were
missing completely.
Genome annotation. We analyzed the predicted genes using OrthoFinder35 compared to the Betta splendens17,
medaka47 and zebrash48 genomes (Fig.3). Our analysis shows that 89.6% of the predicted genes (18,057/20,157)
of paradise fish could be assigned to orthogroups (Fig.3b), of which only a very low percentage – 2.5%
(511/20,517) – were present in species-specic orthogroups (Fig.3c). A vast majority of the annotated genes
Daniorerio
Gasterosteus aculeatus
Oryzias latipes
Takifugu rubripres
Te traodon nigroviridis
Cyprinus carpio
Macropodus opercularis
Polyodon spathula
Ictaluruspunctatus
Latescalcarifer
Oreochromisniloticus
Kryptolebiasmarmoratus
Mola mola
Poecilia mexicana
Xiphophorusmaculatus
Fundulus heteroclitus
Astyanax mexicanus
Paedocypris micromegethes
Betta
splendens
500
1000
3000
5000
0.3 0.6 0.9 1.2 1.5
(Estimated) genomesize(Gb)
Mean intron size (bp)
Gene count
20,000 30,000 40,000 50,000
Fig. 2 Distribution of intron sizes for various species of sh. M. opercularis is at the low end of the spectrum
for sh genome size, similar to puer sh species. e line denotes the linear regression line tted over the
data. Grey areas denote the 0.95 condence interval of the t. e diameter of the circles correlates to the
approximate gene count (20 to 50 thousand).
SNV*Indels***
n ts/tv** n
5,867,521 (0.3%) 1.35 633,781 (0.03%)
Tab le 1. Paradise sh variants call summary statistics from four sh compared to the reference assembly. *SNV
- single nucleotide variants. **ts/tv - transitions to transversions ratio. ***Indels - insertions/deletions.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
(17,546/20,517) had orthologs in at least one of the analyzed species, with 70% (14,067/20,517) having orthologs
in all the other species (Fig.3d). e ratio of shared orthogroups also supports the expected phylogeny (Fig.3e).
No custom code was generated for this project. All software with parameters are listed in Supplementary
information.
Received: 29 August 2023; Accepted: 18 April 2024;
Published: xx xx xxxx
1. Aneny, . A. & Leonelli, S. What’s so special about model organisms? Stud Hist Philosophy Sci Part 42, 313–323 (2011).
2. Farris, S. M. e rise to dominance of genetic model organisms and the decline of curiosity-driven organismal research. Plos One
15, e0243088 (2020).
3. Boler, J. ere’s more to life than rats and ies. Nature 491, 31–33 (2012).
4. Ward, . W. Ethology of the Paradise Fish, Macropodus opercularis I. Dierences between Domestic and Wild Fish. Copeia 1967, 809
(1967).
5. Peters, H. M. On the mechanism of air ventilaton in anabantoids (Pisces: Teleostei). Zoomorphologie 89, 93–123 (1978).
6. Tate, M., McGoran, . E., White, C. . & Portugal, S. J. Life in a bubble: the role of the labyrinth organ in determining territory,
mating and aggressive behaviours in anabantoids. J Fish Biol 91, 723–749 (2017).
7. Ladich, F. & Yan, H. Y. Correlation between auditory sensitivity and vocalization in anabantoid shes. J Comp Physiology 182,
737–746 (1998).
8. Schneider, H. Die Bedeutung der Atemhöhle der Labyrinthsche für ihr Hörvermögen. Zeitschri Für Vergleichende Physiologie 29,
172–194 (1942).
9. über, L., Britz, . & Zardoya, . Molecular Phylogenetics and Evolutionary Diversication of Labyrinth Fishes (Perciformes:
Anabantoidei). Systematic Biol 55, 374–397 (2006).
10. Szabó, N. et al . e paradise sh, an advanced animal model for behavioral genetics and evolutionary developmental biology. J. Exp.
Zoöl. Part B: Mol. Dev. Evol. https://doi.org/10.1002/jez.b.23223 (2023).
11. Hall, D. D. A Qualitative Analysis of Courtship and eproductive Behavior in the Paradise Fish, Macropodus opercularis (Linnaeus).
Zeitschri Für Tierpsychologie 25, 834–842 (1968).
12. Csányi, V., Tóth, P., Altbacer, V., Dóa, A. & Gerlai, J. Behavioral elements of the paradise fish (Macropodus opercularis). I.
egularities of defensive behaviour. Acta biologica Hungarica 36, 93–114 (1985).
13. ácz, A. et al. Housing, Husbandry and Welfare of a “Classic” Fish Model, the Paradise Fish (Macropodus opercularis). Animals 11,
786 (2021).
14. Matthews, B. J., Vosshall, L. B., Dicinson, M. H. & Dow, J. A. T. How to turn an organism into a model organism in 10 ‘easy’ steps.
J Exp Biol 223, jeb218198 (2020).
15. Fan, G. et al. Chromosome-level reference genome of the Siamese ghting sh Betta splendens, a model species for the study of
aggression. Gigascience 7, giy087 (2018).
16. Wang, L. et al. Genomic Basis of Striing Fin Shapes and Colors in the Fighting Fish. Mol Biol Evol 38, msab110 (2021).
17. won, Y. M. et al. Genomic consequences of domestication of the Siamese ghting sh. Sci Adv 8, eabm4950 (2022).
18. Wang, M., Zhong, L., Bian, W., Qin, Q. & Chen, X. Complete mitochondrial genome of paradise sh Macropodus opercularis
(Perciformes: Macropodusinae). Mitochondr Dna 27, 1–3 (2015).
19. Yu, T. & Guo, Y. Early Normal Development of the Paradise Fish Macropodus opercularis. Russ J Dev Biol 49, 240–244 (2018).
20. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. Embnet J 17, 10–12 (2011).
Danio_rerio
Oryzias_latipes
Betta_splendens
Macropodus_opercularis
0255075 100 0 1000 2000 010203040
Genesin
orthogroups
Species-specific
orthogroups
Genes with
orthologs
a
bc d
x103
(%)
all species
any species
Shared:
orthogroups
e
M. opercularis
B. splendens
O. latipes
D. rerio
orthologs
25 30 35 40
x103
x10
3
14
16
18
20
Fig. 3 Comparison of predicted proteins across 4 species shows the close relationships between M. opercularis
and B. splendens. (a) Evolutionary relationship between paradise sh, Siamese ghting sh, zebrash, and
Japanese medaka. (b) e number of genes per species that could be placed in an orthogroup (the set of genes
descended from a single gene in the last common ancestor of all the species). is value is close to 90% for all
four species. (c) e number of orthogroups that are specic to each species. (d) e total number of transcripts
with orthologs in at least one other species. (e) Heat map of the orthogroups for each species pair (top) and
orthologs between each species (bottom). e relative evolutionary distances between the dierent species are
consistent with the overall levels of gene conservation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
21. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with
hiasm. Nat Methods 18, 170–175 (2021).
22. Grabherr, M. G. et al. Full-length transcriptome assembly from NA-Seq data without a reference genome. Nat Biotechnol 29,
644–652 (2011).
23. Stane, M., Diehans, M., Baertsch, . & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo
gene nding. Bioinformatics 24, 637–644 (2008).
24. Borodovsy, M. & Lomsadze, A. Euaryotic Gene Prediction Using GeneMar.hmmE and GeneMarES. Curr. Protoc. Bioinform.
35, 4.6.1–4.6.10 (2011).
25. Brůna, T., Ho, . J., Lomsadze, A., Stane, M. & Borodovsy, M. BAE2: automatic euaryotic genome annotation with
GeneMar-EP + and AUGUSTUS supported by a protein database. NAR Genom. Bioinform. 3, lqaa108- (2021).
26. Cantarel, B. L. et al. MAE: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18,
188–196 (2008).
27. Haas, B. J. et al. Automated euaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced
Alignments. Genome Biol. 9, 7 (2008).
28. Moss, S. P., Joyce, D. A., Humphries, S., Tindall, . J. & Lunt, D. H. Comparative Analysis of Teleost Genome Sequences eveals an
Ancient Intron Size Expansion in the Zebrash Lineage. Genome Biol Evol 3, 1187–1196 (2011).
29. Xu, P. et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nat Genet 46, 1212–1219 (2014).
30. Gregory, T. . et al. Euaryotic genome size databases. Nucleic Acids Res 35, D332–D338 (2007).
31. Cheng, P. et al. e American Paddlesh Genome Provides Novel Insights into Chromosomal Evolution and Bone Mineralization
in Early Vertebrates. Mol Biol Evol 38, 1595–1607 (2020).
32. Jat, L. M., Dubin, A. & Johansen, S. D. Intron size minimisation in teleosts. BMC Genom. 23, 628 (2022).
33. Malmstrøm, M. et al. e Most Developmentally Truncated Fishes Show Extensive Hox Gene Loss and Miniaturized Genomes.
Genome Biol. Evol. 10, 1088–1103 (2018).
34. Zhang, W. et al. e genetic architecture of phenotypic diversity in the Betta sh (Betta splendens). Sci Adv 8, eabm4955 (2022).
35. Emms, D. M. & elly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019).
36. Emms, D. M. & elly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup
inference accuracy. Genome Biol 16, 157 (2015).
37. NCBI Sequence Read Archive. https://identiers.org/ncbi/insdc.sra:SP383622 (2023).
38. Fodor, E. et al. Macropodus opercularis isolate:MV0001. Genbank https://identiers.org/ncbi/insdc.gca:GCA_030770545.1 (2023).
39. ENA European Nucleotide Archive. https://identiers.org/ena.embl:PJEB74481 (2024).
40. oren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 36, 1174–1182 (2018).
41. Abe, S. aryotypes of 6 species of anabantoid shes. CIS 5–7 (1975).
42. Manni, M., Bereley, M. ., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Worows along
with Broader and Deeper Phylogenetic Coverage for Scoring of Euaryotic, Proaryotic, and Viral Genomes. Mol Biol Evol 38,
4647–4654 (2021).
43. Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol. Biol.
(Clion, NJ) 1962, 227–245 (2019).
44. Smit, A., Hubley, . & Green, P. RepeatMasker Open-4.0.
45. hie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
46. L angmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).
47. Ichiawa, . et al . Centromere evolution and CpG methylation during vertebrate speciation. Nat Commun 8, 1833 (2017).
48. Howe, . et al. e zebrash reference genome sequence and its relationship to the human genome. Nature 496, 498–503 (2013).
e authors thank Lars Martin Jakt and his team for early access to their unpublished data. is work utilized
the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We would like to thank Adam
Phillippy and Brandon Pickett for helpful discussions. e research project was part of the ELTE ematic
Excellence Programme 2020 supported by the National Research, Development, and Innovation Office
(TKP2020-IKA-05) and by the ÚNKP-22-5 New National Excellence Program of the Ministry of Culture and
Innovation from the source of the National Research, Development and Innovation Fund. is research was
supported in part by the Intramural Research Program of the National Human Genome Research Institute
(ZIAHG000183-22) for SB. LO and ISz were supported by the Frontline Research Excellence Grant of the NRDI
(KKP 140353). MV is a János Bolyai fellow of the Hungarian Academy of Sciences.
Conceptualization: Á.M., L.O., S.B., M.V. Data curation: J.O., M.V., S.B. Funding acquisition: Á.M., S.B., M.V.
Investigation: E.F., J.O., N.S., K.S., D.C., A.R., S.K., A.R. Methodology: E.F., J.O., N.S., D.C., I.S.z., L.O., M.V., S.K.,
A.R. Project administration and supervision: S.B., M.V. Writing – original and revised text: E.F., J.O., S.B., M.V.
Open access funding provided by the National Institutes of Health.
e authors declare no competing interests.
Supplementary information e online version contains supplementary material available at https://doi.org/
10.1038/s41597-024-03277-1.
Correspondence and requests for materials should be addressed to M.V. or S.M.B.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
is is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may
apply 2024
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com