PreprintPDF Available

Haplotype-resolved genome assembly of the tetraploid potato cultivar Desiree

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Cultivar Desiree is an important model for potato functional genomics studies to assist breeding strategies. Here, we present a haplotype-resolved genome assembly of Desiree, achieved by assembling PacBio HiFi reads and Hi-C scaffolding, resulting in a high-contiguity chromosome-level assembly. We implemented a comprehensive annotation pipeline incorporating gene models and functional annotations from the Solanum tuberosum Phureja DM reference genome alongside RNA-seq reads to provide high-quality gene and transcript annotations. Additionally, we investigated the genome-wide DNA methylation profile using Oxford Nanopore reads, providing insights into potato epigenetics. The assembled genome, annotations, methylation and expression data are visualised in a publicly accessible genome browser, providing a valuable resource for the potato research community.
Content may be subject to copyright.
Haplotype-resolved genome assembly of the tetraploid potato cultivar Désirée
Tim Godec*
1,2
, Sebastian Beier
3
, Natalia Yaneth Rodriguez-Granados
4
, Rashmi Sasidharan
4
,
Lamis Abdelhakim
5
, Markus Teige
6
, Björn Usadel
3,7
, Kristina Gruden
1
, Marko Petek
1
1 National Institute of Biology, Department of Biotechnology and Systems Biology, Ljubljana,
Slovenia
2 Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
3 Institute of Bio- and Geosciences (IBG-4 Bioinformatics), Bioeconomy Science Center
(BioSC), CEPLAS, Forschungszentrum Jülich GmbH, Jülich, Germany
4 Plant Stress Resilience, Institute of Environmental Biology, Utrecht University, Utrecht, The
Netherlands
5 PSI (Photon Systems Instruments), Drásov, Czech Republic
6 Molecular Systems Biology (MOSYS), Department of Functional and Evolutionary Ecology,
University Vienna, Vienna, Austria
7 Faculty of Mathematics and Natural Sciences, Institute for Biological Data Science, Cluster
of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf,
Düsseldorf, Germany
* corresponding author
corresponding author email: tim.godec@nib.si
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Abstract
Cultivar Désirée is an important model for potato functional genomics studies to assist
breeding strategies. Here, we present a haplotype-resolved genome assembly of Désirée,
achieved by assembling PacBio HiFi reads and Hi-C scaffolding, resulting in a
high-contiguity chromosome-level assembly. We implemented a comprehensive annotation
pipeline incorporating gene models and functional annotations from the Solanum tuberosum
Phureja DM reference genome alongside RNA-seq reads to provide high-quality gene and
transcript annotations. Additionally, we provide a genome-wide DNA methylation profile
using Oxford Nanopore reads, enabling insights into potato epigenetics. The assembled
genome, annotations, methylation and expression data are visualised in a publicly
accessible genome browser ( https://desiree.nib.si ), providing a valuable resource for the
potato research community.
1
24
25
26
27
28
29
30
31
32
33
34
35
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Background & Summary
Potato ( Solanum tuberosum ) is one of the most important and widely cultivated crops
worldwide, with a significant role in global food security and agricultural research. Despite its
significance, many studies still rely on the genome of the double monoploid (DM) clone of
group Phureja DM1–3 516 R44
1,2 which lacks a substantial portion of the gene repertoire
and variability found in cultivated tetraploid potato varieties.
The potato cultivar sirée is a red-skinned late-season potato variety, originally bred in the
Netherlands in 1962 by crossing parent cultivars Urgenta and Depesche (Potato Pedigree
Database)
3
. It is still cultivated due to its favourable agronomic traits, such as predictable
yields and high tolerance to drought and some pathogens
4
. It has also been used in
breeding programs, yet a genome assembly for the Désirée cultivar has not been available.
In research, it has been propagated in tissue cultures, and used for genetic manipulation
including gene overexpression
5
, gene silencing
6
, and Crispr-Cas gene editing
7
.
Although haplotype-resolved genome assemblies are becoming common in diploid
organisms, the high heterozygosity rate, extensive repeat content, and the autopolyploid
nature of cultivated potatoes still present significant challenges for generating high-quality
haplotype-resolved assemblies. Currently, five haplotype-resolved genomes of autotetraploid
potato cultivars are publicly available
8–12 as well as several phased diploid genomes
13–15
. The
recently published haplotype-resolved tetraploid potato assemblies rely on labour-intensive
techniques such as single-pollen sequencing
10 or the use of parental and crossing material
11
,
which may not always be available.
Adding to existing publicly available genomes, we provide a reference quality (CRAQ overall
AQI of 97.5) haplotype-resolved genome assembly of the tetraploid cultivar Désirée,
assembled using solely PacBio HiFi and Illumina Hi-C data. Our assembly is accompanied
by a comprehensive structural and functional gene annotation reaching 99.4 % BUSCO
completeness for Solanaceae, accompanied by orthology to DM genes. For the potato
research community, we provide an online resource featuring a genome browser and
downloadable genomic assembly and annotation files, providing a valuable tool for studies
involving allele-specific expression or promoter analysis.
2
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Methods
Sample preparation and sequencing
Leaves from 4-week old S. tuberosum cv. Désirée plants were collected and flash-frozen.
High molecular weight genomic DNA (HMW gDNA) used for PacBio HiFi, Illumina and
Oxford Nanopore Technologies (ONT) sequencing was extracted from the leaf tissues using
a modified CTAB method
16
. The concentration and quality of the extracted DNA were
assessed using a NanoDrop spectrophotometer.
PacBio HiFi
HMW gDNA was sent to National Genomics Infrastructure (NGI) Sweden for library
preparation and sequencing on the PacBio Sequel II platform. We obtained 79.4 Gbp of raw
data, consisting of 4.1 million reads.
Illumina Hi-C
Leaves from 4-week old S. tuberosum cv. Désirée plants were collected, flash-frozen in
liquid nitrogen and ground using mortar and pestle. Hi-C library prep using the Omni-C kit
(Dovetail Genomics) and sequencing were performed on an Illumina NovaSeq 6000 platform
by NGI Sweden. Sequencing generated 2018.4 million paired-end (2 × 150 bp) reads.
ONT
The HMW gDNA was used for ONT DNA library prep using the SQK-LSK110 kit and
sequenced on a MinION using the FLO-MIN106 flow cell. Reads were basecalled using
Dorado (v0.7.2) with the model dna_r9.4.1_e8_sup@v3.3 which generated 5.8 Gbp. The
reads with methylation-related tags were converted to bedMethyl format using modkit
(v0.4.1).
Illumina short reads
Illumina short-read library was constructed from the HMW gDNA and sequenced on Illumina
NextSeq 2000 by ELIXIR Slovenia node to generate 150 bp paired-end reads. The
short-read sequencing generated approximately 138 Gbp of raw data, consisting of 460.1
million paired-end (2 × 150 bp) reads.
Genome size and heterozygosity estimation
The genome characteristics of S. tuberosum cv. Désirée, including genome size,
heterozygosity, and repeat content, were estimated using Illumina short-read data and a
k-mer based approach. A 21-mer frequency distribution was generated with Jellyfish
(v2.2.10), and the genome's key features were inferred using GenomeScope2 (v2.0). The
haploid genome size was estimated at 669.6 Mbp, with a heterozygosity rate estimated at
3.8–5.7%.
3
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
De novo genome assembly, Hi-C scaffolding and quality
assessment
PacBio HiFi and Illumina Hi-C reads were initially assembled into four sets of
haplotype-resolved contigs using Hifiasm (v0.19.8-r603)
17–19
. Hifiasm primary unitigs were
searched against DM genome assembly with blastn (v2.5.0)
20 and best matches were
visualised on Graphical Fragment Assembly with Bandage (v0.8.1, Fig. 1a)
21
. We performed
quality control of the contigs using Merqury (v1.3, Fig. 1b)
22 k-mer spectra and BUSCO
completeness scores (v5.4.7, solanales_odb10 dataset)
23
. The length of haplotype draft
assemblies ranged from 761.6 Mbp to 888.4 Mbp with contig N50 sizes ranging from
7.0 Mbp to 13.7 Mbp (Table 1).
Contigs identified as contaminants were removed based on blastn (v0.8.1) searches against
a custom-built contaminant database, which includes Solanum plastid and mitochondrial
sequences and bacterial NCBI RefSeq sequences.
Decontaminated scaffolds were anchored to chromosomes by mapping Hi-C reads to each
haplotype set separately following the manufacturer’s recommended pipeline for Omni-C
data ( https://omni-c.readthedocs.io ). Briefly, Hi-C reads were mapped using BWA-MEM
(v0.7.17-r1188)
24 then the mappings were parsed with pairtools (v0.3.0)
25 followed by
samtools (v1.3.1)
26 to identify and extract valid pairs. Valid pairs were used to anchor and
orient scaffolds into chromosomes using YaHS (v1.2a.1)
27 and Juicebox Assembly Tools
(v2.17.00)
28,29
.
Chromosomes 11 and 12 of haplotype 4 lacked ~20 Mbp and ~30 Mbp part of the
pericentromeric region, respectively, and haplotype 1 contained two additional unplaced
scaffolds (scaffold_22 and scaffold_23). Alignment of these scaffolds to reference genome
(DM v6.1) and inspection of Hi-C contacts suggested that these scaffolds are the missing
regions of chromosomes 11 and 12 in haplotype 4. Therefore, we remapped Hi-C reads and
incorporated these two scaffolds in haplotype 4 using Juicebox Assembly Tools (v2.17.00).
The final scaffolded assembly size amounts to 3.3 Gbp, with individual haplotypes ranging
between 762 and 888 Mb. As expected, one haplotype is highly similar to the DM haplotype,
whereas other haplotypes can be more dissimilar (Fig. 1c). A comparison of Merqury k-mer
spectra between the initial contigs and the scaffolded chromosomes (Fig. 1a) reveals that
many apparent duplications in the contigs are resolved during scaffolding. A small proportion
of sequences remains missing from the chromosomes and those can be found in the whole
genome FASTA.
The haplotype assemblies were sequentially aligned using minimap2 (v2.28) and analyzed
with SyRi (1.7.0) to identify syntenic regions and structural rearrangements which were
visualized using plotsr (v1.1.1, Fig. 1d).
4
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
haplotype 1
haplotype 2
haplotype 3
haplotype 4
all haplotypes
Genome length (Mb)
888.4
862.7
761.6
858.5
3371.2
GC content (%)
35.31
35.27
35.12
35.47
35.3
Contig N50 (Mb)
11.5
13.7
11.7
7.0
10.8
Number of contigs
1126
867
1048
2695
5736
Chromosome length
(Mb)
721.9
729.9
698.5
709.4
2859.6
Scaffold N50 (Mb)
56.9
61.4
60.1
57.1
58.0
Number of scaffolds
705
496
523
1350
3074
Complete BUSCO (%)
96.2%
96.1%
96.6%
95.7%
99.6%
Size of repeat
sequences (Mb)
514.2
534.1
489.3
503.6
2041.2
Total gene number
76903
81184
75816
75550
309453
Table 1. Summary of the four haplotypes of the Désirée genome assembly.
5
151
152
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Fig. 1 General characteristics of Désirée genome assembly a) Assembly graph of primary
unitigs coloured by best match to DM chromosomes (also designated with numbers on the
graph). b) Merqury k-mer spectra for initial contigs and scaffolded chromosomes. The k = 21
was used. K-mers are categorized as read-only (grey), unique (red), and shared (blue,
green, purple, orange). Peaks corresponding to higher multiplicities indicate the presence of
highly repeated k-mers. c) Dot plot comparing cv. Désirée chromosome-anchored contigs
6
153
154
155
156
157
158
159
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
with DM v8.1 chromosomes. The colour designates contig identity. d) Genomic synteny of
cv. Désirée haplotype-resolved assembly.
Genome annotation
Repeat elements in the S. tuberosum cv. Désirée genome were identified using the
Extensive de novo TE Annotator (EDTA, v2.2.1)
30
. Repetitive sequences cover 489 - 534
Mbp per haplotype, representing more than 70% of the genome (Table 2).
The prediction of protein-coding genes in the assembled S. tuberosum cv. Désirée was
determined using five complementary approaches: de novo , homology-based,
transcriptome-based, deep-learning, and reference-based predictions (Fig. 2).
Fig. 2 Workflow overview of S. tuberosum cv. Désirée genome annotation.
For transcriptome-based prediction, two methods were applied for short reads and Iso-Seq
reads, respectively. Short reads from multiple tissues were aligned to each haplotype using
STAR (2.7.10a)
31
, and transcripts were assembled with StringTie2 (v2.2.1)
32
, followed by
Portcullis (v1.2.4)
33 for junction validation. Iso-Seq reads from five S. tuberosum cultivars
were mapped to both haplotypes using minimap2 (v2.28)
34
, and transcripts were generated
using IsoQuant (v3.3.1)
35 and TAMA Collapse (tc_version_date_2023_03_28) 36
.
BRAKER3 (v3.0.8)
37 was used in ETP mode to predict gene models by integrating de novo ,
homology-based, and transcriptome-based predictions. Repeat masking of the assembly
was performed with RepeatMasker (v4.1.2), using EDTA annotations. Protein sequences
from OrthoDB (green plant orthologs) were provided as evidence, and short-read STAR
alignments with invalid junctions removed were included.
7
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Helixer (v0.3.3)
38,39 was used for deep-learning-based gene prediction via its web interface
( https://www.plabipd.de/helixer_main.html ). Gene models from the S. tuberosum reference
genome (DM v6.1, UniTato annotation) were transferred to the Désirée assembly using
Liftoff (v1.6.3)
40
. All five transcript or gene model sets were consolidated using Mikado
(v2.3.4)
41 to generate a non-redundant set of transcripts. Protein-coding gene completeness
was assessed using BUSCO (Table 2, v5.4.7, solanales_odb10 dataset) and OMArk (v0.3.0,
omamer v2.0.2)
42
.
The predicted protein-coding genes were functionally annotated using EggNOG Mapper
(v2.1.11)
43 with the EggNOG database (version 5.0.2)
44 for the Viridiplantae subset. This
included categories such as gene names, Gene Ontologies (GOs), enzyme functions (EC),
and KEGG pathways, reactions, and modules, along with CAZy families, PFAM domains,
and more. Additionally, functional land-plant protein annotations were predicted using
Mercator4 (v7)
45 via the web platform ( https://www.plabipd.de/mercator_main.html ).
Annotations from EggNOG and Mercator4 were combined into the final GFF3 annotation file.
Orthologous groups between haplotypes and UniTato genes were identified using
OrthoFinder (v2.5.5)
46
. Across haplotypes, 55.3% of orthogroups contained genes from all
four haplotypes, 22.9% from three haplotypes, 19.2% from two haplotypes, and 2.7% from a
single haplotype. When comparing the Désirée annotation to UniTato, 17.24% of genes were
specific to the Désirée annotation.
8
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Type
haplotype 1
haplotype 2
haplotype 3
haplotype 4
DNA
46.6 Mbp (6.4%)
54.1 Mbp (7.4%)
44.6 Mbp (6.4%)
45.9 Mbp (6.5%)
Helitron
36.1 Mbp (5.0%)
38.0 Mbp (5.2%)
33.6 Mbp (4.8%)
42.3 Mbp (6.0%)
LINE
12.4 Mbp (1.7%)
8.1 Mbp (1.1%)
7.5 Mbp (1.1%)
8.1 Mbp (1.1%)
LTR
176.5 Mbp (24.4%)
188.0 Mbp (25.8%)
165.9 Mbp (23.7%)
193.6 Mbp (27.3%)
LTR/Copia
16.8 Mbp (2.3%)
19.3 Mbp (2.6%)
20.3 Mbp (2.9%)
23.9 Mbp (3.4%)
LTR/Gypsy
136.2 Mbp (18.9%)
133.6 Mbp (18.3%)
130.0 Mbp (18.6%)
102.8 Mbp (14.5%)
MITE
11.9 Mbp (1.6%)
10.2 Mbp (1.4%)
13.0 Mbp (1.9%)
10.6 Mbp (1.5%)
Other
72.8 Mbp (10.1%)
76.2 Mbp (10.4%)
69.7 Mbp (10.0%)
71.3 Mbp (10.1%)
SINE
5.1 Mbp (0.7%)
6.6 Mbp (0.9%)
4.7 Mbp (0.7%)
4.9 Mbp (0.7%)
Total
514.2 Mbp (71.2%)
534.1 Mbp (73.2%)
489.3 Mbp (70.1%)
503.6 Mbp (71.0%)
Total gene number
76903
81184
75816
75550
Mean gene length (bp)
1695.85
1610.97
1687.71
1677.79
Mean CDS length (bp)
1062.59
1032.74
1060.23
1061.68
Mean exon number
5.28
5.04
5.31
5.28
Mean intron number
4.28
4.04
4.31
4.28
Complete BUSCO (%)
94.1%
93.3%
95.4%
93.7%
Single Omark HOGs
82.9%
82.5%
84.3%
82.8%
Duplicated Omark HOGs
11.6%
11.6%
11.5%
11.9%
Missing Omark HOGs
5.5%
5.9%
4.2%
5.4%
Mercator4 proteins annotated
(%)
93.5%
93.5%
93.7%
93.5%
Mercator4 proteins classified
(%)
50.5%
46.5%
50.7%
50.0%
Mercator4 bins occupied (%)
94.2%
93.9%
94.6%
94.3%
Table 2. Summary of genome annotations for each haplotype.
Data Records
The raw sequencing data, including Illumina Hi-C, Illumina paired-end, PacBio HiFi, and
ONT reads, have been deposited at the National Center for Biotechnology Information
(NCBI) Sequence Read Archive (SRA) under BioProject number PRJNA1185028. Plastid,
9
210
211
212
213
214
215
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
mitochondrial and bacterial sequences used for removal of contaminant contigs were
downloaded from NCBI RefSeq release 218. Transcriptomic data used for gene annotation
was downloaded from public repositories: SRA under accessions PRJNA1192223,
PRJNA1186376, PRJNA718240, PRJNA803222, PRJNA1209787 and PRJNA1191209; the
Gene Expression Omnibus (GEO) under accession GSE232028; and the National Genomics
Data Center (NGDC) under accession CRA006012. Existing gene models used in the gene
annotation pipeline were downloaded from https://unitato.nib.si and https://spuddb.uga.edu .
The genome assemblies of the four haplotypes have been submitted to NCBI GenBank
under the BioProject accessions PRJNA1196677, PRJNA1196678, PRJNA1196679 and
PRJNA1196680. The assembled genome, including annotations, methylation profile and
identified orthologs, is hosted in a Zenodo repository under DOI: 10.5281/zenodo.14609304
and is also accessible via an interactive genome browser at https://desiree.nib.si .
Technical Validation
We assessed the assembly quality and completeness using DNA sequencing read mapping,
CRAQ, BUSCO analysis, and Merqury k-mer based evaluation. Illumina reads were mapped
with BWA (v0.7.17), while PacBio and ONT reads were aligned using minimap2 (v2.28).
Mapping rates were 99.90%, 100.00%, and 99.74% for Illumina paired-end, PacBio, and
ONT reads, respectively. CRAQ (v1.0.9)
47 analysis of PacBio and Illumina mappings yielded
a regional AQI of 96.3 and an overall AQI of 97.5, classifying the assembly as reference
quality (AQI > 90). Assembly completeness was assessed with BUSCO (v5.4.7) using the
solanales_odb10 lineage database, identifying 5930 (99.6%) of the 5950 BUSCO
orthologous groups in both the whole genome and chromosome-only assemblies (Table 1).
Merqury (v1.3) analysis, using a Meryl (v1.3) database constructed from Illumina reads,
estimated genome completeness at 98.57% for the whole genome and 95.73% for the
chromosomes. The estimated QV values were 54.30 and 58.53 for the whole genome and
chromosomes, respectively.
Completeness of gene annotation was assessed using OMArk (v0.3.0, omamer v2.0.2),
BUSCO (v5.4.7) and Mercator4 (v7). OMArk analysis demonstrated that our annotation
captured 94.1%-94.6% of Hierarchical Orthologous Groups (HOGs) per haplotype, with
duplication rates ranging from 11.5% to 11.9% (Fig. 3a). When combining genes from all
haplotypes, the proportion of complete HOGs reaches 99.3%, meaning that not all
conserved genes are present in all haplotypes. Similarly, BUSCO analysis reported a
haplotype completeness range of 93.3%–95.4% (Table 2), while the whole genome
annotation achieved 99.4% completeness. Protein classification via Mercator4 revealed that
93.9%–94.6% of Mercator bins were occupied per haplotype, increasing to 97.5% when
combining all proteins (Table 2). As expected, the Mercator bin with the largest proportion of
missing proteins was associated with clade-specific metabolism (Fig. 3b). Additionally, the
classified proteins showed no significant deviation from the median protein length,
confirming consistency in annotation quality (Fig. 3c).
10
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Fig. 3 Validation of gene annotation. a) OMArk quality assessment showing consistency,
completeness and count of proteins across all four haplotypes. b) Histogram showing the
percentage of Mercator4 functional bins occupied by the Désirée proteins. c) Histogram
displaying the distribution of proteins grouped by their percentage deviation from the median
protein length.
Usage Notes
The presented Désirée genome assembly is of high contiguity, completeness and phasing
quality and presents a valuable resource for haplotype-aware transcriptomics, proteomics
and epigenomics analyses. The transfer of UniTato annotations
48 provides translation of
gene identifiers from the DM to the Désirée genome. The RNA-seq datasets used to
supplement gene model annotation are predominantly from mature leaf and root tissue, thus
genes specifically expressed in other tissue and developmental stages may not be fully
captured in the current annotation.
The genome was produced from a plant propagated in tissue culture for over a decade. A
recent pangenome study
49 found that in vitro propagated plants of the Solanum section
Petota have greater numbers of TEs in their genomes. While this seems to hold for LTR
elements and DNA transposons in the Désirée genome, overall TE expansion is not evident.
Examining the DNA methylation profile available in the Désirée genome browser might
provide more insight into specific transposable element expansion in this cultivar.
Recently, efforts were made to generate potato pangenomes
9,49
. However, the number of
included phased tetraploid genomes is still limited. Including Désirée and more phased
tetraploid genomes will improve the completeness of potato pangenome. This will bridge
knowledge gaps in potato genomics and give potato breeders a powerful toolkit for
developing more resilient and productive cultivars.
11
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Code Availability
The code, scripts and command-line tool commands used for genome assembly, annotation
and quality control are freely available in the GitHub repository
https://github.com/NIB-SI/desiree-genome .
Acknowledgement
This work benefits from resources and services provided by ELIXIR, a distributed
infrastructure for life science data, funded by national governments and the European
Commission, particularly the Elixir-SI node for performing Illumina paired-end sequencing.
Funding for this work was provided by the European Union's Horizon 2020 research and
innovation programme project ADAPT (grant agreement No GA 2020 862-858), Slovenian
Research and Innovation Agency (ARIS) project grants P4-0165, P4-0431, and J4-3089. SB
and BU are supported by the German Federal Ministry of Education and Research (BMBF)
in the frame of the German Network for Bioinformatics Infrastructure (de.NBI).
Author contributions
TG : Methodology, Data curation, Investigation, Visualization, Writing - Original Draft. SB :
Investigation, Writing - Review & Editing. BU : Writing - Review & Editing. NYRG : Resources,
Writing - Review & Editing. RS : Resources, Writing - Review & Editing. LA : Resources,
Writing - Review & Editing. MT : Funding acquisition, Writing - Review & Editing. KG : Funding
acquisition, Conceptualization, Writing - Review & Editing. MP : Conceptualization, Validation,
Resources, Supervision, Project administration, Writing - Review & Editing.
Competing interests
The author(s) declare no competing interests.
12
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
References
1. Yang, X. et al. The gap-free potato genome assembly reveals large tandem gene
clusters of agronomical importance in highly repeated genomic regions. Molecular Plant
16 , 314–317 (2023).
2. Pham, G. M. et al. Construction of a chromosome-scale long-read reference genome
assembly for potato. GigaScience 9 , giaa100 (2020).
3. van Berloo, R., Hutten, R. C. B., van Eck, H. J. & Visser, R. G. F. An Online Potato
Pedigree Database Resource. Potato Res. 50 , 45–57 (2007).
4. The European Cultivated Potato Database.
https://www.europotato.org/varieties/view/Desiree-E.
5. Tomaž, Š. et al. A mini-TGA protein modulates gene expression through heterogeneous
association with transcription factors. Plant Physiology 191 , 1934–1952 (2023).
6. Halim, V. A. et al. PAMP-induced defense responses in potato require both salicylic acid
and jasmonic acid. The Plant Journal 57 , 230–242 (2009).
7. Lukan, T. et al. CRISPR/Cas9-mediated fine-tuning of miRNA expression in tetraploid
potato. Horticulture Research 9 , uhac147 (2022).
8. Bao, Z. et al. Genome architecture and tetrasomic inheritance of autotetraploid potato.
Molecular Plant 15 , 1211–1226 (2022).
9. Hoopes, G. et al. Phased, chromosome-scale genome assemblies of tetraploid potato
reveal a complex genome, transcriptome, and predicted proteome landscape
underpinning genetic diversity. Molecular Plant 15 , 520–536 (2022).
10. Sun, H. et al. Chromosome-scale and haplotype-resolved genome assembly of a
tetraploid potato cultivar. Nat Genet 54 , 342–348 (2022).
11. Serra Mari, R. et al. Haplotype-resolved assembly of a tetraploid potato genome using
long reads and low-depth offspring data. Genome Biology 25 , 26 (2024).
12. Reyes-Herrera, P. H. et al. Chromosome-scale genome assembly and annotation of the
tetraploid potato cultivar Diacol Capiro adapted to the Andean region. G3
Genes|Genomes|Genetics 14 , jkae139 (2024).
13. Freire, R. et al. Chromosome-scale reference genome assembly of a diploid potato
clone derived from an elite variety. G3 Genes|Genomes|Genetics 11 , jkab330 (2021).
14. van Lieshout, N. et al. Solyntus, the New Highly Contiguous Reference Genome for
Potato (Solanum tuberosum). G3 Genes|Genomes|Genetics 10 , 3489–3495 (2020).
15. Zhou, Q. et al. Haplotype-resolved genome analyses of a heterozygous diploid potato.
Nat Genet 52 , 1018–1023 (2020).
16. Doyle, J. DNA extraction by using DTAB-CTAB procedures. Phytochemical Bulletin 19 ,
11–17 (1987).
17. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo
assembly using phased assembly graphs with hifiasm. Nat Methods 18 , 170–175
(2021).
18. Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data.
Nat Biotechnol 40 , 1332–1335 (2022).
19. Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. Scalable telomere-to-telomere
assembly for diploid and polyploid genomes with double graph. Nat Methods 21 ,
967–970 (2024).
20. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10 , 421
(2009).
21. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of
de novo genome assemblies. Bioinformatics 31 , 3350–3352 (2015).
22. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality,
completeness, and phasing assessment for genome assemblies. Genome Biology 21 ,
245 (2020).
23. Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update:
Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic
13
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology
and Evolution 38 , 4647–4654 (2021).
24. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv:1303.3997 [q-bio] (2013).
25. Open2C et al. Pairtools: From sequencing data to chromosome contacts. PLOS
Computational Biology 20 , e1012164 (2024).
26. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10 , giab008
(2021).
27. Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool.
Bioinformatics 39 , btac808 (2023).
28. Dudchenko, O. et al. The Juicebox Assembly Tools module facilitates de novo assembly
of mammalian genomes with chromosome-length scaffolds for under $1000. 254797
Preprint at https://doi.org/10.1101/254797 (2018).
29. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom.
Cell Systems 3 , 99–101 (2016).
30. Ou, S. et al. Benchmarking transposable element annotation methods for creation of a
streamlined, comprehensive pipeline. Genome Biology 20 , 275 (2019).
31. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 , 15–21
(2013).
32. Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using
a hybrid of long and short reads with StringTie. PLOS Computational Biology 18 ,
e1009730 (2022).
33. Mapleson, D., Venturini, L. & Swarbreck, D. EI-CoreBioinformatics/portcullis.
EI-CoreBioinformatics (2024).
34. Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37 ,
4572–4574 (2021).
35. Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat
Biotechnol 41 , 915–918 (2023).
36. Kuo, R. I. et al. Illuminating the dark side of the human transcriptome with long read
transcript sequencing. BMC Genomics 21 , 751 (2020).
37. Gabriel, L. et al. BRAKER3: Fully automated genome annotation using RNA-seq and
protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 34 ,
769–777 (2024).
38. Holst, F. et al. Helixer–de novo Prediction of Primary Eukaryotic Gene Models
Combining Deep Learning and a Hidden Markov Model. 2023.02.06.527280 Preprint at
https://doi.org/10.1101/2023.02.06.527280 (2023).
39. Stiehler, F. et al. Helixer: cross-species gene annotation of large eukaryotic genomes
using deep learning. Bioinformatics 36 , 5291–5298 (2021).
40. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations.
Bioinformatics 37 , 1639–1643 (2021).
41. Venturini, L., Caim, S., Kaithakottil, G. G., Mapleson, D. L. & Swarbreck, D. Leveraging
multiple transcriptome assembly methods for improved gene structure annotation.
GigaScience 7 , giy093 (2018).
42. Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk. Nat
Biotechnol 1–10 (2024) doi:10.1038/s41587-024-02147-w.
43. Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J.
eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain
Prediction at the Metagenomic Scale. Molecular Biology and Evolution 38 , 5825–5829
(2021).
44. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically
annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids
Research 47 , D309–D314 (2019).
45. MapMan4: A Refined Protein Classification and Annotation Framework Applicable to
Multi-Omics Data Analysis. Molecular Plant 12 , 879–892 (2019).
46. Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative
14
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
genomics. Genome Biology 20 , 238 (2019).
47. Li, K., Xu, P., Wang, J., Yi, X. & Jiao, Y. Identification of errors in draft genome
assemblies at single-nucleotide resolution for quality assessment and improvement. Nat
Commun 14 , 6556 (2023).
48. Zagorščak, M. et al. Evidence-based unification of potato gene models with the UniTato
collaborative genome browser. Front. Plant Sci. 15 , (2024).
49. Bozan, I. et al. Pangenome analyses reveal impact of transposable elements and ploidy
on the evolution of potato species. Proceedings of the National Academy of Sciences
120 , e2211117120 (2023).
15
418
419
420
421
422
423
424
425
426
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 14, 2025. ; https://doi.org/10.1101/2025.01.14.631659doi: bioRxiv preprint
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Potato (Solanum tuberosum) is an essential crop for food security and is ranked as the third most important crop worldwide for human consumption. The Diacol Capiro cultivar holds the dominant position in Colombian cultivation, primarily catering to the food processing industry. This highly heterozygous, autotetraploid cultivar belongs to the Andigenum group and it stands out for its adaptation to a wide variety of environments spanning altitudes from 1,800 to 3,200 meters above sea level. Here, a chromosome-scale assembly, referred to as DC, is presented for this cultivar. The assembly was generated by combining circular consensus sequencing with proximity ligation Hi-C for the scaffolding and represents 2.369 Gb with 48 pseudochromosomes covering 2,091 Gb and an anchor rate of 88.26%. The reference genome metrics, including an N50 of 50.5 Mb, a BUSCO (Benchmarking Universal Single-Copy Orthologue) score of 99.38%, and an Long Terminal Repeat Assembly Index score of 13.53, collectively signal the achieved high assembly quality. A comprehensive annotation yielded a total of 154,114 genes, and the associated BUSCO score of 95.78% for the annotated sequences attests to their completeness. The number of predicted NLR (Nucleotide-Binding and Leucine-Rich-Repeat genes) was 2107 with a large representation of NBARC (for nucleotide binding domain shared by Apaf-1, certain R gene products, and CED-4) containing domains (99.85%). Further comparative analysis of the proposed annotation-based assembly with high-quality known potato genomes, showed a similar genome metrics with differences in total gene numbers related to the ploidy status. The genome assembly and annotation of DC presented in this study represent a valuable asset for comprehending potato genetics. This resource aids in targeted breeding initiatives and contributes to the creation of enhanced, resilient, and more productive potato varieties, particularly beneficial for countries in Latin America.
Article
Full-text available
In the era of biodiversity genomics, it is crucial to ensure that annotations of protein-coding gene repertoires are accurate. State-of-the-art tools to assess genome annotations measure the completeness of a gene repertoire but are blind to other errors, such as gene overprediction or contamination. We introduce OMArk, a software package that relies on fast, alignment-free sequence comparisons between a query proteome and precomputed gene families across the tree of life. OMArk assesses not only the completeness but also the consistency of the gene repertoire as a whole relative to closely related species and reports likely contamination events. Analysis of 1,805 UniProt Eukaryotic Reference Proteomes with OMArk demonstrated strong evidence of contamination in 73 proteomes and identified error propagation in avian gene annotation resulting from the use of a fragmented zebra finch proteome as a reference. This study illustrates the importance of comparing and prioritizing proteomes based on their quality measures.
Article
Full-text available
Potato is one of the world’s major staple crops, and like many important crop plants, it has a polyploid genome. Polyploid haplotype assembly poses a major computational challenge. We introduce a novel strategy for the assembly of polyploid genomes and present an assembly of the autotetraploid potato cultivar Altus. Our method uses low-depth sequencing data from an offspring population to achieve chromosomal clustering and haplotype phasing on the assembly graph. Our approach generates high-quality assemblies of individual chromosomes with haplotype-specific sequence resolution of whole chromosome arms and can be applied in common breeding scenarios where collections of offspring are available. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03160-z.
Article
Full-text available
Despite advances in long-read sequencing technologies, constructing a near telomere-to-telomere assembly is still computationally demanding. Here we present hifiasm (UL), an efficient de novo assembly algorithm combining multiple sequencing technologies to scale up population-wide near telomere-to-telomere assemblies. Applied to 22 human and two plant genomes, our algorithm produces better diploid assemblies at a cost of an order of magnitude lower than existing methods, and it also works with polyploid genomes.
Preprint
Full-text available
Gene structural annotation is a critical step in obtaining biological knowledge from genome sequences yet remains a major challenge in genomics projects. Current de novo Hidden Markov Models are limited in their capacity to model biological complexity; while current pipelines are resource-intensive and their results vary in quality with the available extrinsic data. Here, we build on our previous work in applying Deep Learning to gene calling to make a fully applicable, fast and user friendly tool for predicting primary gene models from DNA sequence alone. The quality is state-of-the-art, with predictions scoring closer by most measures to the references than to predictions from other de novo tools. Helixer's predictions can be used as is or could be integrated in pipelines to boost quality further. Moreover, there is substantial potential for further improvements and advancements in gene calling with Deep Learning. Helixer is open source and available at https://github.com/weberlab-hhu/Helixer A web interface is available at https://www.plabipd.de/helixer_main.html
Article
Full-text available
Annotating newly sequenced genomes and determining alternative isoforms from long-read RNA data are complex and incompletely solved problems. Here we present IsoQuant—a computational tool using intron graphs that accurately reconstructs transcripts both with and without reference genome annotation. For novel transcript discovery, IsoQuant reduces the false-positive rate fivefold and 2.5-fold for Oxford Nanopore reference-based or reference-free mode, respectively. IsoQuant also improves performance for Pacific Biosciences data.
Article
Full-text available
We present YaHS, a user-friendly command-line tool for construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools, and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity. Availability and implementation: YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.