ArticlePDF Available

Abstract and Figures

Amongst fishes, zebrafish (Danio rerio) has gained popularity as a model system over most other species and while their value as a model is well documented, their usefulness is limited in certain fields of research such as behavior. By embracing other, less conventional experimental organisms, opportunities arise to gain broader insights into evolution and development, as well as studying behavioral aspects not available in current popular model systems. The anabantoid paradise fish (Macropodus opercularis), an “air-breather” species has a highly complex behavioral repertoire and has been the subject of many ethological investigations but lacks genomic resources. Here we report the reference genome assembly of M. opercularis using long-read sequences at 150-fold coverage. The final assembly consisted of 483,077,705 base pairs (~483 Mb) on 152 contigs. Within the assembled genome we identified and annotated 20,157 protein coding genes and assigned ~90% of them to orthogroups.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
The reference genome of
Macropodus opercularis (the
paradise sh)
Erika Fodordo

Czimer
LowLiew
Rhie



 ✉ 
 ✉
Danio rerio




(Macropodus opercularis

M. opercularis



During the 20th century experimental biology gained increased inuence over descriptive biology and concom-
itantly most research eorts began to narrow into a small number of “model” species. ese organisms were not
only selected because they were considered to be representative models for the examined phenomena but were
also easy and cheap to maintain in laboratory conditions1,2. Working with these convenient experimental models
had several advantages and made a rapid accumulation of knowledge possible. It enabled scientists to compare
and build on each other’s ndings eciently as well as to share valuable data and resources that accelerated dis-
covery. As a result of this, a handful of model species have dominated the eld of biomedical studies.
Despite their broad success, these models also brought limitations. As Bolker pointed out: “e extraordinary
resolving power of core models comes with the same trade-o as a high-magnication lens: a much-reduced
eld of view”3. In the case of zebrash research this trade-o has been perhaps most apparent for behavioral
studies. Zebrash are an inherently social (shoaling) species, but most behavioral studies use them in solitary
settings, which arguably is a non-natural environment for them. erefore, the use of other teleost species with
more solitary behavioral proles is warranted for studies of individual behaviors.
Paradise sh (Macropodus opercularis Linnaeus, 1758) are a relatively small (8–11 cm long) freshwater
sh native to East Asia, Southern China, Northern Vietnam, and Laos where they are commonly found in
shallow waters with dense vegetation and reduced dissolved oxygen4. Similar to all other members of the sub-
order Anabantoidae, they are characterized by the capacity to take up oxygen directly from the air through
a highly vascularized structure covered with respiratory epithelium, the labyrinth organ (LO)5. e ability
to “air-breathe” allows anabantoids to inhabit swamps and small ponds with low levels of dissolved oxygen
that would be impossible for other sh species, therefore the LO can be considered an adaptation to hypoxic
1Department of Genetics, ELTE Eötvös Loránd University, Budapest, Hungary. 2Translational and Functional
Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA. 3Frontline Fish Genomics
Research Group, Department of Applied Fish Biology, Institute of Aquaculture and Environmental Safety, Hungarian
University of Agriculture and Life Sciences, Georgikon Campus, Keszthely, Hungary. 4Science Unit, Lingnan
University, Hong Kong, China. 5Computational and Statistical Genomics Branch, National Human Genome Research
Institute, Bethesda, MD, USA. 6 Department of Ethology, ELTE Eötvös Loránd University, Budapest, Hungary. 7These
authors contributed equally: Erika Fodor, Javan Okendo. e-mail: mvarga@ttk.elte.hu; burgess@mail.nih.gov


Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
conditions6. e evolution of the LO has also improved hearing in some species7,8, and may have led to the
emergence of novel and elaborate mating behaviors, including courtship, territorial display, and parental care6,9.
Another interesting behavior that these sh possess is they build egg “nests” by blowing bubbles on the sur-
face of the water6,10. ese types of intricate and complex behaviors sh made them an important ethological
model during the 1970–80 s, which resulted in a detailed ethogram of the species11,12.
We propose that with recently developed husbandry protocols13 and the advent of novel molecular tech-
niques for genome editing and transgenesis, paradise sh could become an important complementary model
species for neurogenetic studies14. Furthermore, several genomes are now available for the Siamese ghting sh
(Betta splendens), a closely related species to the paradise sh1517, so a good quality genome sequence of para-
dise sh would enable comparative ecological and evolutionary (eco-evo) studies.
While the mitochondrial genome was already available for this species18 a full genome sequence was lack-
ing. Here, we provide a brief description and characterization of a high quality, de novo paradise sh reference
genome and transcriptome assembly.

 e paradise sh used to establish our colony and the source of
the transcriptome samples were purchased from a local pet store (Trioker Ltd., Érd, Hungary). Adult paradise
sh were kept in aerated glass aquariums in the animal facility of the Institute of Biology at ELTE Eötvös Loránd
University. Husbandry conditions were specied previously13. Embryos were raised at 28.5 °C and staged as
described before19. All experimental procedures were approved by the Hungarian National Food Chain Safety
Oce (Permit Number: PE/EA/406—7/2020). Animal experiments in Hungarian academic research centres
are regulated by decree no. 40/2013 (14.II.) issued by the Hungarian Government, which was draed based
on Directive 2010/63/EU on the protection of animals used for scientic purposes. e research on paradise
sh in Dr. Vargas laboratory was made possible by permit no. PE/EA/406-7/2020, issued by the Pest County
Government Oce on the basis of the above-mentioned government regulation. Wild-caught adult paradise sh
were captured in the areas surrounding Hong Kong and the specimens were handled in accordance to protocols
outlines in the Research Ethics Approval Application via Lingnan University (Reference number: EC051/2021).
Permission to collect wild specimens were also granted in a permit obtained from the AFCD (Agriculture,
Fisheries, and Conservation Department). e permit number is “AF GR CON 11/17 Pt. 7”.
 RNA samples were collected from a mix
of embryonic stages (stage 9 – 5 days post fertilization), from caudal tail blastema taken at 3- and 5-days post
amputation, from the kidney, heart, brain, ovaries of an adult female, and the brain and testis of an adult male
paradise sh, respectively. Total RNA was isolated using TRIzol (Invitrogen, 15596026), following the manufac-
turer’s protocol. Samples were puried twice with ethanol and eluted in water. Quality and integrity of the samples
was tested on an agarose gel, by Nanodrop, and using an Agilent 2100. Ribosomal RNA (rRNA) was removed
using the Illumina Ribo-Zero kit and paired-end (PE) libraries were prepared using standard Illumina protocols.
Samples were processed on an Illumina NovaSeq PE150 platform, and a total of 218,715,409 PE reads (2x 150 bp)
were sequenced, resulting in ~65 Gbp of raw transcriptomic data.
Genomic DNA samples were isolated from the tail n of the parental F0 male and female paradise sh using
the Qiagen DNeasy Blood and Tissue Kit (cat no: 69504). Samples were eluted in TE and sent for library prepa-
ration and sequencing. Sample quality-checks were performed using standard agarose gel electrophoresis and
with a Qubit 2.0 instrument. For Illumina short-read sequencing a size-selected 150 bp insert DNA library was
prepared and processed on the Illumina NovaSeq. 5000 platform. Approximately 100 million PE reads (2 × 150
bp) were sequenced for each parent, resulting in approximately 60X coverage for each genome. For PacBio HiFi
long-read single molecule real-time (SMRT) sequencing libraries, genomic DNA was prepared using whole
tissue from the 6 month old F1 ospring and the Circulomics Nanobind tissue kit. Sequence libraries were pre-
pared using the PacBio SMRTbell Template Preparation Kit and HiFi sequenced on a Sequel II platform. A total
of 4,885,238 reads (average length: 15.5 kbp) resulted in ~73 Gbp of raw genomic sequence data.
 All software versions used are listed in Supplementary TableS4. The raw data
pre-processing was conducted by doing quality control, adapter trimming, and ltration of the low-quality reads
using trim galore wrapper around FASTQC and Cutadapt20. e genome assembly was generated with the hiasm
genome assembler21 using the High-Performance Computing facility at the National Institute of Health. For the
assembly, 32 cores processing units and 512 Gb of memory was used. e lower and the upper bound binned
K-mers was set to 25 and 75, respectively. e estimated haploid genome size used for inferring reads depth was
set to 0.5Gbp. e rest of the hiasm default settings were used to assemble the homozygous genome with the
build-in duplication purging parameter set to -l1. e primary assembly Graphical Fragment Assembly (GFA) le
was converted to FASTA le format using the awk command.
 e Trinity assembler22 was used to create a set of RNA transcripts from the bulk
RNA-seq data. To aid in gene prediction, we downloaded the reviewed Swissprot/Uniprot vertebrate proteins
(Download date 12/01/2022; entries 97,804 proteins) for homology comparisons in annotation pipelines. Gene
prediction was done using the AUGUSTUS23 and GeneMark-ES24 sowares as part of the BRAKER pipeline25 to
train the AUGUSTUS parameters. Final annotation using the assembled transcripts and the vertebrate proteins
database was done using the MAKER pipeline26 with the EvidenceModeller27 tool switched-on to improve gene
structure annotation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
 Sources for the reference data used to create Figs. 2, 3: refs. 2834.
We performed OrthoFinder analysis35,36 with default parameters, using predicted peptides (including all alter-
native splice versions) of the zebrash genome assembly GRCz11, medaka genome assembly ASM223467v1 and
B. splendens genome assembly fBetSpl5.3. Sequences were downloaded from the ENSEMBL and NIH/NCBI
Assembly homepages, respectively.
 e Illumina short read les (accession number ERR3332352) were downloaded from
the Vertebrate Genome Project (VGP) database (https://vertebrategenomesproject.org/). Illumina short read
sequencing was also performed on genomic DNA obtained from the tails of 3 wild-caught sh from the Hong
Kong region. Trim galore version 0.6.10 (https://github.com/FelixKrueger/TrimGalore), a wrapper around cut-
adapt and fastqc was then used to trim the illumina adapter sequences and to discard reads less than 25 bps.
DRAGMAP version 1.3.0 (https://github.com/Illumina/DRAGMAP) was used to map the reads to the reference
genome. e resultant sequence alignment map (SAM) le was then converted to binary alignment map (BAM),
sorted, and indexed using samtools. e Picard was then used to add the read groups information in BAM le.
e genome analysis tool kit (GATK) was then used in calling the variants by turning on the dragen mode. e
bamtools stats and the plot-vcfstats was used in the downstream analysis and visualization of the genomic vari-
ants in the variant call le (VCF).

e assembly and all DNA and RNA raw reads have been deposited in the NCBI under the BioProject study
accession PRJNA824432. Within that project there is the GenBank assembly macOpe2 (GCA_030770545.1), 9
RNA-seq raw sequence data les (SRX20729884, SRX15898419, SRX15898418, SRX15898417, SRX15898416,
SRX15898415, SRX15898414, SRX15898413, SRX15898412), one PacBio HiFi genomic raw sequence data le
(SRX15948463), one Illumina PE short read genomic raw sequence data le for the assembly (SRX15948462)
and Illumina short read genomic raw sequence data les for the 3 wild-caught samples (SAMN39260618,
SAMN39260619, SAMN39260620)37,38. e variant data for this study have been deposited in the European
Variation Archive (EVA) at EMBL-EBI under accession number PRJEB7448139.

 We generated the de novo reference genome sequence for this
species using 150X coverage of PacBio SMRT HiFi long-read sequencing and the hiasm genome assembly pipe-
line21. e nal assembly consisted of 483,077,705 base pairs (bp) on 152 contigs (Supplementary TableS1). e
assembled genome demonstrated a very high contiguity with an N50 of 19.2 megabases (Mb) in 12 contigs. e
largest contig was 24,022,457 bp and the shortest contig was 14,205 bp. More than 98% of the canonical k-mers
were 1x copy number indicating that our genome assembly is of very good quality (Fig.1a). e paradise sh
genome repeat content is estimated to be ~10.4%. e “trio binning”40 mode of Hiasm was attempted using sin-
gle nucleotide variant (SNV) data collected from short read sequencing of the F0 parents, however the heterozy-
gosity rate from the lab raised sh was very low at ~0.07% making it impossible to eciently separate maternal
and paternal haplotypes. e resulting assembled reference genome is therefore a pseudohaplotype. e sequence
of the mitochondrial genome (mtDNA) was essentially identical to the previously published mtDNA sequence
for this species (16,495/16,496 identities)18. We followed the B. splendens example and numbered the M. operculis
chromosomes based on their similarity to medaka chromosomes resulting in the chromosomes being numbered
1–19, and 21–24. We performed a whole genome alignment to a recent Betta splendens assembly34 21 chromo-
somes had a 1 to 1 relationship with B. splendens chromosome 9 aligning to two separate paradise sh contigs
(Fig.1b) and M. opercularis chromosome 18 having no signicant homology to a B. splendens chromosome.
is is explained by the number of chromosomes for each species with B. splendens having 21 and M. opercularis
reportedly having 23 chromosomes41. We have not determined whether the B. splendens chromosomes fused or
if the M. opercularis chromosomes split.
e genome is relatively compressed in size and has relatively small introns (mean paradise sh intron
length = 566 bp, whereas mean average teleost intron size = 1,21428) (Fig.2) and shorter intergenic regions.
e N90 for our assembly consists of 23 contigs suggesting that most chromosomes are primarily represented
by a single contig from the de novo assembly, even without any scaolding performed. Searching the contigs
with zebrash telomeric sequences revealed “telomere-to-telomere” assemblies, i.e. contigs that had telomeric
sequences at both ends in the correct orientation, for contigs ptg000004l, ptg000010l, ptg000024l, ptg000026l,
ptg000028l, and ptg000030l representing chromosomes 3, 8, 9, 15, 17 and 21, respectively (Supplementary
TableS3). ese contigs have vertebrate telomeres at both ends while the remaining contigs have one or no
stretches of telomeric sequence at the end of the contig.
Benchmarking Universal Single-Copy Orthologs (BUSCO) was used to evaluate the completeness of our
reference genome assembly with the Actinopterygii_odb10 dataset42,43. e result showed that 98.5% of the
sequence in the reference dataset had a complete ortholog in our genome including 97.3% complete and
single-copy genes and 1.2% complete and duplicate genes. Additionally, 1.2% of the genes were reported as
fragmented and 0.3% of the genes were completely missing.
 Using RepeatMasker44, we ana-
lysed and characterized the repeat content in our reference genome assembly. By using a custom-built repeat
prediction library, we identied 32,955,420 bp (6.78%) in retroelements and 11,076,209 bp (2.8%) in DNA trans-
posons (Supplementary TableS2). e retroelements were further categorised into repeat families which were
made up of short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), or long
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
terminal repeats (LTR) (Supplementary TableS3). e LINEs were the most abundant repetitive sequence in
the retroelement family at 3.38% (16,447,763 bp) followed by LTRs, 3.19% (15,490,642 bp), and SINEs occurred
Fig. 1 Basic genome assessment. (a) K-mer comparison plot showing copy number of the k-mers as a
stacked histogram colored by the copy numbers found in the paradise sh dra genome assembly. e y-axis
represents the number of distinct k-mers, and the x-axis shows the k-mer multiplicity (coverage). Most
k-mers are represented once (red peak) indicating high quality for the genomic assembly. (b) Whole genome
alignment of the B. splendens genome assembly to the M. opercularis assembly. B. splendens is on the X-axis
and M. opercularis on the Y-axis. B. splendens chromosomes 9 appears to map to two separate M. opercularis
chromosomes and M. opercularis chromosome 18 does not appear to have a clear homologous chromosome in
B. splendens. Chromosome numbering is based on B. splendens alignment to medaka chromosomes14,20.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
at a lowest frequency (0.21%) (Supplementary TableS2). In the LINEs sub-family, we identied L2/CR1/Rex
as the most abundant repetitive sequence (2.15%) followed closely with the retroviral (1.73%) LTR sub-family
(Supplementary TableS2). e proportion of the DNA transposons was estimated to be 11,076,209 bp (2.28%).
Overall, the proportion of retroelements (6.78%) was much higher in the genome compared to that of DNA
transposons (2.28%).
e Vertebrate Genome Project (VGP)45 had performed short read sequencing on a single paradise sh
purchased from a German pet shop (NCBI accession: PRJEB19273), and we captured 9 “wild” samples from the
New Territories in Hong Kong. We performed short read sequencing to 20X coverage for 3 of the wild-caught
sh and used the data in combination with the VGP eort to establish the SNP rates within the paradise sh
populations. We identied 5,867,521 variants having a quality score of greater or equal to 30 (Table1) across
4 individual sh. e transition/transversion rate was 1.41. Our analysis identied a total of 663,781 insertions
or deletions ranging from 1 to 60 bps. e rate of SNPs and the indels were 0.5% and 0.1%, respectively.
 e Trinity transcriptome assembler was used to
assemble the Illumina short reads from the RNA-sequencing data22 into predicted transcripts. e transcriptome
assembly consisted of 366,029 contigs in 20,157 loci. e integrity of the transcriptome assembly was evaluated
by mapping the Illumina short reads back to the assembled transcriptome using bowtie246; a 98.4% overall align-
ment rate was achieved. e BUSCO analysis conrmed 99.6% completeness with 8.2% single copy orthologs
and 91.4% duplicated genes (i.e. multiple isoforms). A total of 0.4% of the genes were fragmented and 0.0% were
missing completely.
Genome annotation. We analyzed the predicted genes using OrthoFinder35 compared to the Betta splendens17,
medaka47 and zebrash48 genomes (Fig.3). Our analysis shows that 89.6% of the predicted genes (18,057/20,157)
of paradise fish could be assigned to orthogroups (Fig.3b), of which only a very low percentage – 2.5%
(511/20,517) – were present in species-specic orthogroups (Fig.3c). A vast majority of the annotated genes
Daniorerio
Gasterosteus aculeatus
Oryzias latipes
Takifugu rubripres
Te traodon nigroviridis
Cyprinus carpio
Macropodus opercularis
Polyodon spathula
Ictaluruspunctatus
Latescalcarifer
Oreochromisniloticus
Kryptolebiasmarmoratus
Mola mola
Poecilia mexicana
Xiphophorusmaculatus
Fundulus heteroclitus
Astyanax mexicanus
Paedocypris micromegethes
Betta
splendens
500
1000
3000
5000
0.3 0.6 0.9 1.2 1.5
(Estimated) genomesize(Gb)
Mean intron size (bp)
Gene count
20,000 30,000 40,000 50,000
Fig. 2 Distribution of intron sizes for various species of sh. M. opercularis is at the low end of the spectrum
for sh genome size, similar to puer sh species. e line denotes the linear regression line tted over the
data. Grey areas denote the 0.95 condence interval of the t. e diameter of the circles correlates to the
approximate gene count (20 to 50 thousand).
SNV*Indels***
n ts/tv** n
5,867,521 (0.3%) 1.35 633,781 (0.03%)
Tab le 1. Paradise sh variants call summary statistics from four sh compared to the reference assembly. *SNV
- single nucleotide variants. **ts/tv - transitions to transversions ratio. ***Indels - insertions/deletions.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
(17,546/20,517) had orthologs in at least one of the analyzed species, with 70% (14,067/20,517) having orthologs
in all the other species (Fig.3d). e ratio of shared orthogroups also supports the expected phylogeny (Fig.3e).

No custom code was generated for this project. All software with parameters are listed in Supplementary
information.
Received: 29 August 2023; Accepted: 18 April 2024;
Published: xx xx xxxx

1. Aneny, . A. & Leonelli, S. What’s so special about model organisms? Stud Hist Philosophy Sci Part 42, 313–323 (2011).
2. Farris, S. M. e rise to dominance of genetic model organisms and the decline of curiosity-driven organismal research. Plos One
15, e0243088 (2020).
3. Boler, J. ere’s more to life than rats and ies. Nature 491, 31–33 (2012).
4. Ward, . W. Ethology of the Paradise Fish, Macropodus opercularis I. Dierences between Domestic and Wild Fish. Copeia 1967, 809
(1967).
5. Peters, H. M. On the mechanism of air ventilaton in anabantoids (Pisces: Teleostei). Zoomorphologie 89, 93–123 (1978).
6. Tate, M., McGoran, . E., White, C. . & Portugal, S. J. Life in a bubble: the role of the labyrinth organ in determining territory,
mating and aggressive behaviours in anabantoids. J Fish Biol 91, 723–749 (2017).
7. Ladich, F. & Yan, H. Y. Correlation between auditory sensitivity and vocalization in anabantoid shes. J Comp Physiology 182,
737–746 (1998).
8. Schneider, H. Die Bedeutung der Atemhöhle der Labyrinthsche für ihr Hörvermögen. Zeitschri Für Vergleichende Physiologie 29,
172–194 (1942).
9. über, L., Britz, . & Zardoya, . Molecular Phylogenetics and Evolutionary Diversication of Labyrinth Fishes (Perciformes:
Anabantoidei). Systematic Biol 55, 374–397 (2006).
10. Szabó, N. et al . e paradise sh, an advanced animal model for behavioral genetics and evolutionary developmental biology. J. Exp.
Zoöl. Part B: Mol. Dev. Evol. https://doi.org/10.1002/jez.b.23223 (2023).
11. Hall, D. D. A Qualitative Analysis of Courtship and eproductive Behavior in the Paradise Fish, Macropodus opercularis (Linnaeus).
Zeitschri Für Tierpsychologie 25, 834–842 (1968).
12. Csányi, V., Tóth, P., Altbacer, V., Dóa, A. & Gerlai, J. Behavioral elements of the paradise fish (Macropodus opercularis). I.
egularities of defensive behaviour. Acta biologica Hungarica 36, 93–114 (1985).
13. ácz, A. et al. Housing, Husbandry and Welfare of a “Classic” Fish Model, the Paradise Fish (Macropodus opercularis). Animals 11,
786 (2021).
14. Matthews, B. J., Vosshall, L. B., Dicinson, M. H. & Dow, J. A. T. How to turn an organism into a model organism in 10 ‘easy’ steps.
J Exp Biol 223, jeb218198 (2020).
15. Fan, G. et al. Chromosome-level reference genome of the Siamese ghting sh Betta splendens, a model species for the study of
aggression. Gigascience 7, giy087 (2018).
16. Wang, L. et al. Genomic Basis of Striing Fin Shapes and Colors in the Fighting Fish. Mol Biol Evol 38, msab110 (2021).
17. won, Y. M. et al. Genomic consequences of domestication of the Siamese ghting sh. Sci Adv 8, eabm4950 (2022).
18. Wang, M., Zhong, L., Bian, W., Qin, Q. & Chen, X. Complete mitochondrial genome of paradise sh Macropodus opercularis
(Perciformes: Macropodusinae). Mitochondr Dna 27, 1–3 (2015).
19. Yu, T. & Guo, Y. Early Normal Development of the Paradise Fish Macropodus opercularis. Russ J Dev Biol 49, 240–244 (2018).
20. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. Embnet J 17, 10–12 (2011).
Danio_rerio
Oryzias_latipes
Betta_splendens
Macropodus_opercularis
0255075 100 0 1000 2000 010203040
Genesin
orthogroups
Species-specific
orthogroups
Genes with
orthologs
a
bc d
x103
(%)
all species
any species
Shared:
orthogroups
e
M. opercularis
B. splendens
O. latipes
D. rerio
orthologs
25 30 35 40
x103
x10
3
14
16
18
20
Fig. 3 Comparison of predicted proteins across 4 species shows the close relationships between M. opercularis
and B. splendens. (a) Evolutionary relationship between paradise sh, Siamese ghting sh, zebrash, and
Japanese medaka. (b) e number of genes per species that could be placed in an orthogroup (the set of genes
descended from a single gene in the last common ancestor of all the species). is value is close to 90% for all
four species. (c) e number of orthogroups that are specic to each species. (d) e total number of transcripts
with orthologs in at least one other species. (e) Heat map of the orthogroups for each species pair (top) and
orthologs between each species (bottom). e relative evolutionary distances between the dierent species are
consistent with the overall levels of gene conservation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
21. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with
hiasm. Nat Methods 18, 170–175 (2021).
22. Grabherr, M. G. et al. Full-length transcriptome assembly from NA-Seq data without a reference genome. Nat Biotechnol 29,
644–652 (2011).
23. Stane, M., Diehans, M., Baertsch, . & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo
gene nding. Bioinformatics 24, 637–644 (2008).
24. Borodovsy, M. & Lomsadze, A. Euaryotic Gene Prediction Using GeneMar.hmmE and GeneMarES. Curr. Protoc. Bioinform.
35, 4.6.1–4.6.10 (2011).
25. Brůna, T., Ho, . J., Lomsadze, A., Stane, M. & Borodovsy, M. BAE2: automatic euaryotic genome annotation with
GeneMar-EP + and AUGUSTUS supported by a protein database. NAR Genom. Bioinform. 3, lqaa108- (2021).
26. Cantarel, B. L. et al. MAE: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18,
188–196 (2008).
27. Haas, B. J. et al. Automated euaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced
Alignments. Genome Biol. 9, 7 (2008).
28. Moss, S. P., Joyce, D. A., Humphries, S., Tindall, . J. & Lunt, D. H. Comparative Analysis of Teleost Genome Sequences eveals an
Ancient Intron Size Expansion in the Zebrash Lineage. Genome Biol Evol 3, 1187–1196 (2011).
29. Xu, P. et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nat Genet 46, 1212–1219 (2014).
30. Gregory, T. . et al. Euaryotic genome size databases. Nucleic Acids Res 35, D332–D338 (2007).
31. Cheng, P. et al. e American Paddlesh Genome Provides Novel Insights into Chromosomal Evolution and Bone Mineralization
in Early Vertebrates. Mol Biol Evol 38, 1595–1607 (2020).
32. Jat, L. M., Dubin, A. & Johansen, S. D. Intron size minimisation in teleosts. BMC Genom. 23, 628 (2022).
33. Malmstrøm, M. et al. e Most Developmentally Truncated Fishes Show Extensive Hox Gene Loss and Miniaturized Genomes.
Genome Biol. Evol. 10, 1088–1103 (2018).
34. Zhang, W. et al. e genetic architecture of phenotypic diversity in the Betta sh (Betta splendens). Sci Adv 8, eabm4955 (2022).
35. Emms, D. M. & elly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019).
36. Emms, D. M. & elly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup
inference accuracy. Genome Biol 16, 157 (2015).
37. NCBI Sequence Read Archive. https://identiers.org/ncbi/insdc.sra:SP383622 (2023).
38. Fodor, E. et al. Macropodus opercularis isolate:MV0001. Genbank https://identiers.org/ncbi/insdc.gca:GCA_030770545.1 (2023).
39. ENA European Nucleotide Archive. https://identiers.org/ena.embl:PJEB74481 (2024).
40. oren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 36, 1174–1182 (2018).
41. Abe, S. aryotypes of 6 species of anabantoid shes. CIS 5–7 (1975).
42. Manni, M., Bereley, M. ., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Worows along
with Broader and Deeper Phylogenetic Coverage for Scoring of Euaryotic, Proaryotic, and Viral Genomes. Mol Biol Evol 38,
4647–4654 (2021).
43. Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol. Biol.
(Clion, NJ) 1962, 227–245 (2019).
44. Smit, A., Hubley, . & Green, P. RepeatMasker Open-4.0.
45. hie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
46. L angmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).
47. Ichiawa, . et al . Centromere evolution and CpG methylation during vertebrate speciation. Nat Commun 8, 1833 (2017).
48. Howe, . et al. e zebrash reference genome sequence and its relationship to the human genome. Nature 496, 498–503 (2013).

e authors thank Lars Martin Jakt and his team for early access to their unpublished data. is work utilized
the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We would like to thank Adam
Phillippy and Brandon Pickett for helpful discussions. e research project was part of the ELTE ematic
Excellence Programme 2020 supported by the National Research, Development, and Innovation Office
(TKP2020-IKA-05) and by the ÚNKP-22-5 New National Excellence Program of the Ministry of Culture and
Innovation from the source of the National Research, Development and Innovation Fund. is research was
supported in part by the Intramural Research Program of the National Human Genome Research Institute
(ZIAHG000183-22) for SB. LO and ISz were supported by the Frontline Research Excellence Grant of the NRDI
(KKP 140353). MV is a János Bolyai fellow of the Hungarian Academy of Sciences.

Conceptualization: Á.M., L.O., S.B., M.V. Data curation: J.O., M.V., S.B. Funding acquisition: Á.M., S.B., M.V.
Investigation: E.F., J.O., N.S., K.S., D.C., A.R., S.K., A.R. Methodology: E.F., J.O., N.S., D.C., I.S.z., L.O., M.V., S.K.,
A.R. Project administration and supervision: S.B., M.V. Writing – original and revised text: E.F., J.O., S.B., M.V.

Open access funding provided by the National Institutes of Health.

e authors declare no competing interests.

Supplementary information e online version contains supplementary material available at https://doi.org/
10.1038/s41597-024-03277-1.
Correspondence and requests for materials should be addressed to M.V. or S.M.B.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC DATA | (2024) 11:540 | https://doi.org/10.1038/s41597-024-03277-1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
is is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may
apply 2024
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Paradise fish (Macropodus opercularis) is an air-breathing freshwater fish species with a signature labyrinth organ capable of extracting oxygen from the air that helps these fish to survive in hypoxic environments. The appearance of this evolutionary innovation in anabantoids resulted in a rewired circulatory system, but also in the emergence of species-specific behaviors, such as territorial display, courtship and parental care in the case of the paradise fish. Early zoologists were intrigued by the structure and function of the labyrinth apparatus and a series of detailed descriptive histological studies at the beginning of the 20th century revealed the ontogenesis and function of this specialized system. A few decades later, these fish became the subject of numerous ethological studies, and detailed ethograms of their behavior were constructed. These latter studies also demonstrated a strong genetic component underlying their behavior, but due to lack of adequate molecular tools, the fine genetic dissection of the behavior was not possible at the time. The technological breakthroughs that transformed developmental biology and behavioral genetics in the past decades, however, give us now a unique opportunity to revisit these old questions. Building on the classic descriptive studies, the new methodologies will allow us to follow the development of the labyrinth apparatus at a cellular resolution, reveal the genes involved in this process and also the genetic architecture behind the complex behaviors that we can observe in this species. K E Y W O R D S behavioral genetics, evo-devo, paradise fish
Article
Full-text available
The Betta fish displays a remarkable variety of phenotypes selected during domestication. However, the genetic basis underlying these traits remains largely unexplored. Here, we report a high-quality genome assembly and resequencing of 727 individuals representing diverse morphotypes of the Betta fish. We show that current breeds have a complex domestication history with extensive introgression with wild species. Using a genome-wide association study, we identify the genetic basis of multiple traits, including coloration patterns, the "Dumbo" phenotype with pectoral fin outgrowth, extraordinary enlargement of body size that we map to a major locus on chromosome 8, the sex determination locus that we map to dmrt1, and the long-fin phenotype that maps to the locus containing kcnj15. We also identify a polygenic signal related to aggression, involving multiple neural system-related genes such as esyt2, apbb2, and pank2. Our study provides a resource for developing the Betta fish as a genetic model for morphological and behavioral research in vertebrates.
Article
Full-text available
Background Spliceosomal introns are parts of primary transcripts that are removed by RNA splicing. Although introns apparently do not contribute to the function of the mature transcript, in vertebrates they comprise the majority of the transcribed region increasing the metabolic cost of transcription. The persistence of long introns across evolutionary time suggests functional roles that can offset this metabolic cost. The teleosts comprise one of the largest vertebrate clades. They have unusually compact and variable genome sizes and provide a suitable system for analysing intron evolution. Results We have analysed intron lengths in 172 vertebrate genomes and show that teleost intron lengths are relatively short, highly variable and bimodally distributed. Introns that were long in teleosts were also found to be long in mammals and were more likely to be found in regulatory genes and to contain conserved sequences. Our results argue that intron length has decreased in parallel in a non-random manner throughout teleost evolution and represent a deviation from the ancestral state. Conclusion Our observations indicate an accelerated rate of intron size evolution in the teleosts and that teleost introns can be divided into two classes by their length. Teleost intron sizes have evolved primarily as a side-effect of genome size evolution and small genomes are dominated by short introns (<256 base pairs). However, a non-random subset of introns has resisted this process across the teleosts and these are more likely have functional roles in all vertebrate clades.
Article
Full-text available
Siamese fighting (betta) fish are among the most popular and morphologically diverse pet fish, but the genetic bases of their domestication and phenotypic diversification are largely unknown. We assembled de novo the genome of a wild Betta splendens and whole-genome sequenced 98 individuals across five closely related species. We find evidence of bidirectional hybridization between domesticated ornamental betta and other wild Betta species. We discover dmrt1 as the main sex determination gene in ornamental betta and that it has lower penetrance in wild B. splendens . Furthermore, we find genes with signatures of recent, strong selection that have large effects on color in specific parts of the body or on the shape of individual fins and that most are unlinked. Our results demonstrate how simple genetic architectures paired with anatomical modularity can lead to vast phenotypic diversity generated during animal domestication and launch betta as a powerful new system for evolutionary genetics.
Article
Full-text available
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
Article
Full-text available
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Article
Full-text available
Resolving the genomic basis underlying phenotypic variations is a question of great importance in evolutionary biology. However, understanding how genotypes determine the phenotypes is still challenging. Centuries of artificial selective breeding for beauty and aggression resulted in a plethora of colors, long fin varieties, and hyper-aggressive behavior in the air-breathing Siamese fighting fish (Betta splendens), supplying an excellent system for studying the genomic basis of phenotypic variations. Combining whole genome sequencing, QTL mapping, genome-wide association studies and genome editing, we investigated the genomic basis of huge morphological variation in fins and striking differences in coloration in the fighting fish. Results revealed that the double tail, elephant ear, albino and fin spot mutants each were determined by single major-effect loci. The elephant ear phenotype was likely related to differential expression of a potassium ion channel gene, kcnh8. The albinotic phenotype was likely linked to a cis-regulatory element acting on the mitfa gene and the double tail mutant was suggested to be caused by a deletion in a zic1/zic4 co-enhancer. Our data highlight that major loci and cis-regulatory elements play important roles in bringing about phenotypic innovations and establish Bettas as new powerful model to study the genomic basis of evolved changes.
Article
Full-text available
Simple Summary Paradise fish (Macropodus opercularis) has been a favored subject of behavioral research during the last decades of the 20th century. Lately, however, with a massively expanding genetic toolkit and a well annotated, fully sequenced genome, zebrafish (Danio rerio) became a central model of recent behavioral research. But, as the zebrafish behavioral repertoire is less complex than that of the paradise fish, the focus on zebrafish is a compromise. With the advent of novel methodologies, we think it is time to bring back paradise fish and develop it into a modern model of behavioral and evolutionary developmental biology (evo-devo) studies. The first step is to define the housing and husbandry conditions that can make a paradise fish a relevant and trustworthy model. Here, we define the relevant welfare parameters for keeping a healthy population of paradise fish and provide a detailed description of our recent experience in raising and successfully breeding this species under laboratory conditions. Abstract Thanks to its small size, external fertilization and fecundity, over the past four decades, zebrafish (Danio rerio) has become the dominant fish model species in biological and biomedical research. Multiple lines of evidence, however, suggest that the reliance on only a handful of genetic model organisms is problematic, as their unique evolutionary histories makes them less than ideal to study biological questions unrelated to their historically contingent adaptations. Therefore, a need has emerged to develop novel model species, better suited for studying particular problems. The paradise fish (Macropodus opercularis) has a much more complex behavioral repertoire than zebrafish and has been a favored model animal in ethological research during the last decades of the previous century. We believe that with currently available, easily adaptable genetic toolkits, this species could be easily developed into a popular model of behavioral genetics. Despite its earlier popularity, however, the description of a detailed housing and husbandry protocol for this species is still missing from scientific literature. We present here a detailed description of how to raise and breed paradise fish successfully under laboratory conditions, and also discuss some of the challenges we faced while creating a stable breeding population for this species in our facility.
Article
Full-text available
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly. Hifiasm is a haplotype-resolved de novo genome assembler for long-read high-fidelity sequencing data based on phased assembly graphs.
Article
Full-text available
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.