PreprintPDF Available

The Blepharisma stoltei macronuclear genome: towards the origins of whole genome reorganization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The germ-soma distinction is a defining feature of multicellular eukaryotes. Analogous to this, ciliates, a ubiquitous microbial eukaryote lineage, have morphologically and functionally distinct nuclei, but within single cells: the germline micronucleus (MIC) and somatic macronucleus (MAC). The origins and mechanisms of the MIC to MAC transformation, especially the extensive elimination of abundant internally eliminated sequences (IESs) and transposons during genome reorganization, are great biological mysteries. Blepharisma represents one of the two earliest diverging ciliate classes, and has unique, dual pathways of MAC development, making it ideal for investigating the functioning, origins and evolution of these processes. Here, we report the MAC genome assembly of Blepharisma stoltei strain ATCC 30299 (41 Mb), arranged as numerous alternative telomere-capped minichromosomes, tens to hundreds of kilobases long. The B. stoltei MAC genome encodes eight PiggyBac transposase homologs liberated from transposons. All are subject to purifying selection, but just one, the putative Blepharisma IES excisase, has a complete catalytic amino acid triad. Numerous genes encoding other domesticated transposases are present in B. stoltei, and often are comparably strongly upregulated in a similar timeframe to model ciliate genome reorganization homologs. Our phylogenetic investigations suggest the PiggyBac homologs may have been ancestral ciliate IES excisases. The B. stoltei MAC genome, together with the upcoming MIC genome, highlights the evolution and complex interplay between transposons, domesticated transposases, and genome reorganization in the context of germline-soma differentiation within single cells.
1
Full title:
1
The Blepharisma stoltei macronuclear genome:
2
towards the origins of whole genome
3
reorganization
4
Short title:
5
Macronuclear genome of Blepharisma stoltei
6
strain ATCC 30299
7
Minakshi Singh1, Kwee Boon Brandon Seah1, Christiane Emmerich1, Aditi Singh1, Christian
8
Woehle2, Bruno Huettel2, Adam Byerly3, Naomi Alexandra Stover4, Mayumi Sugiura5, Terue
9
Harumoto5, Estienne Carl Swart1,*
10
11
1. Max Planck Institute for Biology, Tuebingen, Germany
12
2. Max Planck Genome Center Cologne, Max Planck Institute for Plant Breeding, Cologne,
13
Germany
14
3. Department of Computer Science and Information Systems, Bradley University, Peoria
15
IL, USA
16
4. Department of Biology, Bradley University, Peoria, IL, USA
17
5. Nara Women’s University, Nara, Japan
18
* Corresponding author
19
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
2
Abstract
20
The germ-soma distinction is a defining feature of multicellular eukaryotes. Analogous to this,
21
ciliates, a ubiquitous microbial eukaryote lineage, have morphologically and functionally distinct
22
nuclei, but within single cells: the germline micronucleus (MIC) and somatic macronucleus
23
(MAC). The origins and mechanisms of the MIC to MAC transformation, especially the extensive
24
elimination of abundant internally eliminated sequences (IESs) and transposons during genome
25
reorganization, are great biological mysteries. Blepharisma represents one of the two earliest
26
diverging ciliate classes, and has unique, dual pathways of MAC development, making it ideal
27
for investigating the functioning, origins and evolution of these processes. Here, we report the
28
MAC genome assembly of Blepharisma stoltei strain ATCC 30299 (41 Mb), arranged as
29
numerous alternative telomere-capped minichromosomes, tens to hundreds of kilobases long.
30
The B. stoltei MAC genome encodes eight PiggyBac transposase homologs liberated from
31
transposons. All are subject to purifying selection, but just one, the putative Blepharisma IES
32
excisase, has a complete catalytic amino acid triad. Numerous genes encoding other
33
domesticated transposases are present in B. stoltei, and often are comparably strongly
34
upregulated in a similar timeframe to model ciliate genome reorganization homologs. Our
35
phylogenetic investigations suggest the PiggyBac homologs may have been ancestral ciliate
36
IES excisases. The B. stoltei MAC genome, together with the upcoming MIC genome, highlights
37
the evolution and complex interplay between transposons, domesticated transposases, and
38
genome reorganization in the context of germline-soma differentiation within single cells.
39
40
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
3
Abbreviations
41
MIC - micronucleus
42
MAC - macronucleus
43
IES - interspersed eliminated sequence
44
MDS - macronuclear-destined sequence
45
PacBio - Pacific Biosciences
46
CLR - continuous long read (PacBio)
47
CCS - circular consensus sequence (PacBio)
48
HiFi - High-fidelity read (PacBio)
49
ATAS - alternative telomere addition site
50
PBLE - PiggyBac-like element
51
PGBD - PiggyBac element-derived
52
Pgm - PiggyMac
53
PgmL - PiggyMac-like
54
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
4
Introduction
55
Acquisition of new organelles defines and is responsible for the origins of eukaryotes [1]. Ciliates
56
represent one of a couple of eukaryotic lineages with more than one kind of nucleus within
57
individual cells [2]. Profound differences between the ciliate nuclei and the genomes they
58
contain now exist, with some parallels to the germline-soma distinction in multicellular organisms
59
[3]. Ciliate nuclear and genomic dimorphism provides a special opportunity to observe the
60
consequences of evolution of a specialized organelle clearly derived from a general one and for
61
studying the evolution of new, functionally differentiated genes that enable this unusual situation.
62
63
The processes responsible for large-scale developmental DNA elimination [4] in ciliate nuclei
64
are under active study, but what the responsible ancestral molecules were remains to be
65
determined. Knowledge of the molecules responsible for ciliate genome reorganization is
66
dominated by Tetrahymena and Paramecium (class Oligohymenophorea), with additional input
67
from Oxytricha, Stylonychia and Euplotes (class Spirotrichea) [5,6], whereas this remains to be
68
investigated in the remaining nine or so classes. To gain fresh insights into genome
69
reorganization and tackle questions about its origin we focused on the ciliate species
70
Blepharisma stoltei. Together with its sister-class, Karyorelictea, the class Heterotrichea, to
71
which this B. stoltei belongs, represent the earliest branching ciliate lineages, more distantly
72
related to current model ciliates than those models are to each other [7]. Furthermore,
73
Blepharisma exhibits distinctive alternative somatic developmental pathways, which have
74
potential utility in disentangling processes involved in genome reorganization from preceding
75
ones that may indirectly influence it.
76
77
Blepharisma is a distinctive genus of single-celled ciliates (Figure 1) known for the red, light-
78
sensitive pigment, blepharismin, in the sub-pellicular membranes of representative species [8],
79
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
5
and unusual nuclear/developmental biology (Figure 2) [9]. To date molecular investigations and
80
genomics of ciliates have predominantly focused on oligohymenophoreans and spirotrichs
81
(Figure 3, Table S1). In recent years, publication of a draft genome for the heterotrich ciliate,
82
Stentor, has facilitated revival of this genus for investigations of cellular regeneration [1012].
83
However, significant hurdles still need to be overcome to investigate genome reorganization in
84
Stentor coeruleus since requisite cell mating has not been observed in the reference somatic
85
genome strain (personal communication, Mark Slabodnick), and very high lethality has been
86
reported for other strains in which mating occurred [13]. We therefore focused on Blepharisma
87
which is amenable to such investigations, with controlled induction of mating, and, critically,
88
procedures for investigating cellular and nuclear development have been established from more
89
than a century of meticulous cytological research [8,1425].
90
91
The Blepharisma stoltei strains used in the present study were originally isolated in Germany
92
(strain ATCC 30299) and Japan (strain HT-IV), with the former continuously cultured for over
93
fifty years, and the latter for over a decade. The cells are comparatively straightforward to
94
maintain, e.g., stable cultures can be established in a simple salt medium on a few grains of rice
95
passaged every few months. Due to their distinctive pigmentation and large size several
96
Blepharisma species are excellent subjects for introducing cell biology concepts to non-
97
specialists, and are thus readily available for educational purposes from commercial suppliers
98
like Carolina Biological Supply Company (USA). They are ideal subjects for behavioural and
99
developmental investigations, e.g., as voracious predators of smaller ciliates and other
100
unicellular species, and also exhibit pronounced phenotypic plasticity, including forming cysts
101
and giant, cannibal cells under suitable conditions [8].
102
103
Like all ciliates [3], Blepharisma cells have two types of nuclei: a macronucleus (MAC) which is
104
very large and transcriptionally active during vegetative growth, and a small, generally
105
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
6
transcriptionally inactive micronucleus (MIC), which serves as the germline. Each Blepharisma
106
cell has one MAC and several MICs (Figure 1A, B). In vegetative propagation (asexual
107
replication) of Blepharisma, cell fission results in about half of the MAC pinching off which is
108
then distributed to each of the resulting daughter cells together with the mitotically divided MICs.
109
Upon starvation, Blepharisma cells, like other ciliates, are also capable of conjugation and the
110
associated sexual processes. Essential for developmental investigations, the intricate ballet of
111
nuclear movements and morphological changes occurring during Blepharisma conjugation is
112
well-documented [9] (Figure 2). During this process about half of the MICs in each of the cells
113
undergo meiosis (meiotic MICs) and the rest do not (somatic MICs) (Figure 2B). One of the
114
meiotic MICs eventually gives rise to two haploid gametic nuclei. One gametic MIC (the
115
migratory nucleus) from each conjugating cell is exchanged with that of its partner. In parallel in
116
partnered cells, subsequent fusion of the migratory and stationary haploid nuclei generates a
117
zygotic nucleus (synkaryon), and after successive mitotic divisions gives rise to both new MICs
118
and new MACs (known as anlagen). The new MACs continue to mature, eventually growing in
119
size and DNA content [9].
120
121
Conveniently for investigations of development and genome reorganization, Blepharisma is one
122
of only two ciliate genera, along with Euplotes [2629], where conjugation has been shown to be
123
mediated through pheromone-like substances called gamones. Blepharisma has two mating
124
types, distinguished by their gamone production. Mating type I cells release gamone 1, a ~30
125
kDa tyrosine-rich [30] glycoprotein [31,32]; mating type II cells release gamone 2, calcium-3-(2’-
126
formylamino-5’-hydroxybenzoyl) lactate, a small-molecule effector [33]. Blepharisma cells
127
commit to conjugation when complementary mating types recognize each other's gamones, with
128
the cells remaining paired while meiosis and fertilization occur and eventually new MACs begin
129
to form.
130
131
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
7
As in model ciliates, during Blepharisma anlagen development, MIC-specific sequences are
132
removed from a MIC genome copy to form a functional MAC genome (Seah, et al. in prep.). The
133
new MAC genomes of model ciliate species are largely free of mobile elements and other forms
134
of “junk” DNA contained in the MIC genome [34]. In the best studied ciliates, a form of genome
135
reorganization assisted by small RNAs (sRNAs) is thought to occur [5]. Specific genome
136
segments limited to the MIC, termed internally eliminated sequences (IESs), are excised by
137
domesticated transposases, and may have originated from transposons [3,5,35,36]. Large scale
138
genome-wide DNA amplification accompanies the genome reorganization process, producing
139
thousands of copies in mature MACs of larger ciliate species [3,34].
140
141
We were motivated to investigate genome reorganization in Blepharisma, as, unlike model
142
ciliates, these cells can produce two kinds of anlagen, and particularly because one of the two
143
pathways of their development skips the complex series of mitoses, meioses, nuclear
144
exchanges and fertilization [9] (Figure 2). Primary anlagen mature in the conventional manner
145
from zygotic nuclei. Somatic MICs which have not undergone meiosis can give rise to secondary
146
anlagen, which can develop into mature macronuclei [9]. This occurs frequently in strains with a
147
high frequency of selfing (conjugation among cells within a clonal population), in preference to
148
development of primary macronuclear anlagen [9]. This alternative pathway of MAC
149
development has also been observed experimentally after removal of primary macronuclear
150
anlagen by microsurgery [9]. As conjugation progresses, the old (maternal) MACs are
151
progressively degraded [9]. Since the B. stoltei MIC genome has numerous IESs which interrupt
152
genes (Seah et al. in prep.), in principle, to produce functional, mature MAC genomes, it is
153
essential that reorganization of DNA occurs in both primary and secondary anlagen.
154
155
Here we provide essential macronuclear genome and developmental, transcriptomic resources
156
for B. stoltei and present the first investigations of possible molecules involved in its genome
157
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
8
reorganization. Like Stentor coeruleus [10], Blepharisma stoltei genes have the shortest known
158
spliceosomal introns, predominantly 15 or 16 nt long. From long-read sequencing the B. stoltei
159
MAC genome appears to be organized in the form of numerous minichromosomes. Among the
160
MAC-encoded transposase genes we identified in Blepharisma were PiggyBac transposase
161
homologs, which, as far as we are aware, have not been reported in any ciliates other than
162
Paramecium and Tetrahymena. A few Blepharisma PiggyBac homologs are substantially
163
upregulated in MAC development, including one with a complete catalytic triad which is the main
164
candidate IES excisase. Raising the possibility of additional IES excisases, transposases from a
165
few different classes are also present in the MAC genome, including some with complete
166
catalytic triads. Consistent with ancient origins of genome reorganization in ciliates, Blepharisma
167
shares pronounced development-specific upregulation of homologs known to be involved in this
168
process, notably Dicer-like and Piwi proteins responsible for sRNA biogenesis. Blepharisma
169
therefore represents an invaluable outgroup for investigations of the evolution of genome
170
reorganization.
171
172
Results
173
A compact, extensively fragmented genome
174
The draft Blepharisma stoltei ATCC 30299 MAC genome is compact, at 41 Mb, and AT rich
175
(66%), like most sequenced ciliate MAC genomes, and relatively complete (Figure 3; Table S1,
176
2, Figure S1). From joint variant calling of reads from strains ATCC 30299 and HT-IV mapped to
177
this assembly, strain ATCC 30299 appears to be virtually homozygous, with only 1277
178
heterozygous single-nucleotide polymorphisms (SNPs) observed compared to 193725 in strain
179
HT-IV (i.e., individual heterozygosity of 3.08 × 10-5 vs. 4.67 × 10-3 respectively). Low SNP levels
180
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
9
were likely beneficial for the overall assembly contiguity, since heterozygosity poses significant
181
algorithmic challenges for genome assembly software [37]. For brevity’s sake, we refer to this
182
draft genome as the Blepharisma MAC genome (and likewise “Blepharisma” for the associated
183
strain). Though the final assembly comprises 64 telomere-to-telomere sequences, it is not
184
possible to define conventional chromosome ends given the extensive natural fragmentation of
185
the Blepharisma MAC genome (characterized in the next section), hence we refer to them
186
simply as “contigs”.
187
188
The basic telomere unit of B. stoltei is a permutation of CCCTAACA, like its heterotrich relative
189
Stentor coeruleus [10] (Figure S2). However, a compelling candidate for a telomerase ncRNA
190
(TERC) could not be found in either Blepharisma or Stentor using Infernal [38] and RFAM
191
models (RF00025 - ciliate TERC; RF00024 - vertebrate TERC). Thus, it was not possible to
192
delimit the ends of the repeat. Heterotrichs may use a different kind of ncRNA, or one which is
193
very divergent from known TERCs. In contrast to the extremely short (20 bp) macronuclear
194
telomeres of spirotrichs like Oxytricha with extreme MAC genome fragmentation [39],
195
sequenced Blepharisma macronuclear telomeres are moderately long (Figure S2A), with a
196
mode of 209 bp (i.e., 26 repeats of the 8 bp motif), extending to a few kilobases.
197
198
Approximately one in eight reads in the Blepharisma HiFi library were telomere-bearing and
199
distributed across the entire genome (Figure 4A), using a moderately strict definition of
200
possessing at least three consecutive telomeric repeats. In contrast, the telomere-bearing reads
201
of the model ciliate Tetrahymena thermophila predominantly map to chromosome ends (Figure
202
S3A), and only one in fifty nine Tetrahymena CLR reads are telomere-bearing, as identified by
203
three consecutive telomeric subunit repeats (i.e., 3×CCCCAA).
204
205
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
10
Typically, only a minority of mapped reads are telomere-bearing at individual internal contig
206
positions, and so we term them alternative telomere addition sites (ATASs) (Figure 4A). We
207
identified a total of 46705 potential ATASs in Blepharisma by mapping PacBio HiFi reads onto
208
the MAC reference, and searching for partially mapped reads that contained telomeric repeats in
209
the unmapped (clipped) segments without any intervening sequence between the mapped
210
segment and the beginning/end of the telomeric repeat. The majority of these sites (38686) were
211
represented by only one mapped HiFi read.
212
213
The expected distance between telomeres, and hence the average MAC DNA molecule length,
214
is about 130 kb. This is consistent with the raw input MAC DNA lengths, which were mostly
215
longer than 10 kb and as long as 1.5 Mb (Figure S3C, D), and our observation that a small
216
fraction (1.3%) of Blepharisma’s HiFi reads are bound by telomeres on both ends. Excluding the
217
length of the telomeres, these telomere-bound reads may be as short as 4 kb (Figure S2B).
218
Given the frequency of telomere-bearing reads, we expect many additional two-telomere DNA
219
molecules longer than 12 kb, the approximate maximum length of the HiFi reads (Figure S3C,
220
D).
221
222
Since the lengths of the sequenced two-telomere DNA molecules on average imply that they
223
encode multiple genes, we propose classifying them as “minichromosomes”. This places them
224
between the length of “nanochromosomes” of ciliates like Oxytricha and Stylonychia, which
225
typically encode single genes and are on the order of a few kilobases long [39,40], and MAC
226
chromosomes of Paramecium tetraurelia and Tetrahymena thermophila which are hundreds of
227
kilobases to megabases long [4143]. Recently it was reported that the Paramecium bursaria
228
MAC genome is considerably more fragmented than those of other previously examined
229
Paramecium species [44]. The DNA molecules from P. bursaria have thus also been classified
230
as minichromosomes [44].
231
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
11
232
Alternative telomere addition sites in the MAC genome tend to be intergenic in model ciliates like
233
Oxytricha trifallax [39]. In Blepharisma, we found more intergenic ATASs (28309) than intragenic
234
ones (18396). As intergenic regions only make up 10.1 Mb of the assembly, the intergenic
235
frequency of ATASs is about five-fold higher (2.81 per 1 kb) than intragenic frequency (0.562 per
236
1 kb). The presence of intragenic ATASs raises the question how the cell tolerates or deals with
237
mRNAs encoding partial proteins transcribed from 3’ truncated genes. Since the sequence data
238
was from a clonal population, it is not possible to tell how much ATAS variability there is within
239
individual cells. However, it is conceivable that their positional variation in single cells reflects
240
that of the population. In this case, together with redundancy from massive DNA amplification
241
there would likely be sufficient intact copies of every gene.
242
243
Beyond the first 2-5 bp corresponding to the junction sequences, the average base composition
244
on the chromosome flanking ATAS junctions shows an asymmetrical bias (Figure 4B). From
245
position +6 onwards there is an enrichment of T to about 40% and A to 35-39%, compared to
246
the genome-wide frequencies of 33% each. At position +19 to +23, there is a slight decrease in
247
T to 37-39%. AT values gradually decline back to about 35% each by position +150.
248
Correspondingly, G and C are depleted downstream of ATAS junctions, dropping to a minimum
249
of 8.6% and 11% respectively around position +37, compared to the genome-wide average of
250
17% each. AT enrichment and GC depletion upstream of ATAS junctions are less pronounced.
251
252
If breakage and chromosome healing were random, we would not expect such an asymmetry.
253
This suggests that there is a nucleotide bias, whether in the initiation of breaks, telomere
254
addition, or in the processing of breaks before telomere addition. However, we have not yet
255
identified any conserved motif like the 15 bp chromosome breakage site (CBS) in Tetrahymena
256
[45] nor a short 10-bp sequence periodicity in base composition like in Oxytricha trifallax [46].
257
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
12
Therefore, telomere addition in B. stoltei appears to involve base-pairing of short segments of
258
about 2 bp between the telomere and chromosome, with a bias centered on the “CT” in the
259
telomere unit, and an asymmetrical preference for AT-rich sequences on the chromosomal side
260
of the junction.
261
Tiny spliceosomal introns
262
The Blepharisma stoltei ATCC 30299 macronuclear genome is gene-dense (25,711 predicted
263
genes), with short intergenic regions, tiny introns and untranslated regions (UTRs) (Figure 5A).
264
In contrast to Stentor which, unusually for ciliates, uses the standard genetic code, B. stoltei
265
uses an alternative genetic code with UGA codons reassigned from stops to tryptophan (Figure
266
S5).
267
268
Like Stentor [10], most (82%) Blepharisma genes have no introns. In line with genome
269
compactness, during our inspections we also observed numerous overlapping poly(A)-tailed
270
RNA-seq reads on opposite strands derived from convergently transcribed gene pairs. The
271
correlation of the lengths of different noncoding region classes (intergenic regions, introns and
272
UTRs) can be explained by them being subject to common, neutral evolutionary processes [47].
273
274
Blepharisma introns are mostly (97%) 15 or 16 nucleotides (nt) long, like those of Stentor
275
(Figure 5B). Though intron reduction (7389 introns predicted in the reference B. stoltei MAC
276
genome, i.e., 0.29 introns per gene) is not as extreme as some other microbial eukaryotes, like
277
Giardia lamblia [48], where almost all have been lost, both Blepharisma and Stentor have much
278
fewer introns relative to other ciliates (e.g., intron densities of 1.6, 2.3 and 4.8 introns per gene
279
in Paramecium, Oxytricha and Tetrahymena, respectively [49]) and to the putative, relatively
280
intron-rich eukaryotic common ancestor [50], along with their extreme length reduction.
281
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
13
282
Blepharisma 15 nt introns possess a characteristic branch-point “A”, as would be expected in
283
classical models of lariat formation during mRNA splicing (Figure 5C). 16 nt introns almost
284
invariably have an “A” at either 10 or 11 nt downstream of the donor site (i.e., only one of 499
285
does not, but has “A” at 9 nt), although this is not obvious in the consensus sequence logo
286
because the position is variable (Figure S6D). Similarly, 17 nt introns all possess “A” at 10-12 nt
287
downstream of the donor site. Only a few intron bases, 5-8 and 12, of Blepharisma’s 15 nt
288
introns are relatively unconstrained (Figure 5C). This leaves little room for the presence of any
289
additional regulatory elements in the mRNA or underlying DNA.
290
Extensive duplications of transmembrane protein genes
291
A notable extended ~220 kb region encoding 53 genes belonging to a single orthologous group
292
(orthogroup), OG0000085 is present on Contig_1 (Figure 5A). Four additional OG0000085
293
genes are present at the opposite end of Contig_1, and 24 copies are found on other contigs,
294
often clustered together (Figure 5D). The DNA coverage across this region is lower (74×) than
295
the rest of Contig_1 (185×). Though there is uncertainty in the exact extent, given the sheer
296
volume of reads involved, the assembled sequences certainly correspond to highly repetitive
297
regions of the MAC genome. At the junction between the lower and higher coverage regions
298
more than 30 HiFi reads link the two regions of coverage, and a similar number of telomere-
299
bearing reads are in close proximity. At the junction we also observe at least two potential
300
locations of IESs, corresponding to regions that may be partially IES/partially MDS.
301
302
Large clusters of genes from particular orthogroups can be found on additional contigs (Figure
303
5D). In total 551 (2%) of predicted B. stoltei genes belong to the orthogroups with the largest
304
clusters per contig. Some of the largest contiguous clusters of genes from these orthogroups are
305
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
14
situated at the ends of contigs, suggesting they may have caused assembly breaks beyond
306
them. One contig, split off from other connected components in the assembly graph,
307
predominantly encodes genes from a single orthogroup (contig_64, 43× coverage; Figure 5D;
308
Figure S7). Further increases in read length and accuracy may allow assemblers to fully resolve
309
these in future. Curiously, all the orthogroups corresponding to the largest contiguous clusters of
310
genes appear to be transmembrane proteins, or decayed remnants thereof. The nature of these
311
proteins is described in Supplemental Text (“Properties of proteins encoded by extensive
312
duplications”).
313
Features of gene expression during new MAC development
314
To gain an overview of the molecular processes during genome reorganization in Blepharisma
315
we examined gene expression trends across a developmental time series. Complementary
316
mating strains of B. stoltei were treated with gamones of the opposite mating type, before mixing
317
to initiate conjugation [9,51]. Samples for morphological staging and RNA-seq were taken at
318
intervals from the time of mixing ("0 hour" time point) up to 38 hours (Figure S8). The
319
progression of nuclear morphological development and the proportion of cells in each stage at
320
various time points is illustrated in Figure S9.
321
322
During conjugation in Blepharisma, meiosis begins around 2 h after conjugating cell pairs form
323
and continues up to around 18 h, by when gametic nuclei generated by meiosis have been
324
exchanged (Figure 2A, B, Figure S9). This is followed by karyogamy and mitotic multiplication of
325
the zygotic nucleus (22 hours). At the 26 h time point, new, developing primary MACs can be
326
observed in the conjugating pairs, in the form of large, irregular bodies (Figure 2A, Figure S9).
327
These macronuclear anlagen mature into the new MACs of the exconjugant cell by 38 h, after
328
which cell division generates two daughter cells. Smaller secondary MACs, derived directly from
329
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
15
MICs without all the intermediate nuclear stages, can also be seen from 22 hours to 30 hours,
330
but eventually disappear, giving sole way to the primary MACs (Figure 2A, Figure S9).
331
332
Examining gene expression at 26 hours, when the majority of cells are forming a new MAC
333
(Figure S9), we observe two broad trends: relatively stable constitutive gene expression (Table
334
S5; Data S3), e.g., an actin homolog (ENA accession: BSTOLATCC_MAC19444) and a
335
bacteria-like globin protein (BSTOLATCC_MAC21846), versus pronounced development-
336
specific upregulation (Table S6; Data S3), e.g., a histone (BSTOLATCC_MAC21995) an HMG
337
box protein (BSTOLATCC_MAC14030), and a translation initiation factor (eIF4E,
338
BSTOLATCC_MAC5291).
339
340
We eschewed a crude, large-scale Gene Ontology (GO) enrichment analysis in favour of close
341
scrutiny of a smaller subset of genes strongly upregulated during new MAC formation which we
342
expect are more relevant to molecular biologists. For this, computational gene annotations in
343
combination with BLASTP searches and examination of literature associated with homologs was
344
used. Ranking the relative gene expression at 26 hours vs. the average expression of starved,
345
gamone treated, and 0 hour cells, in descending order, reveals numerous genes of interest in
346
the context of genome reorganization, including homologs of proteins known to be involved in
347
genome reorganization in model ciliate species (Table S6).
348
349
Among the top 100 genes ranked this way (69× to 825× upregulation) nine contain PFAM
350
transposase domains: DDE_Tnp_1_7, DDE_3, MULE and DDE_Tnp_IS1595 (e.g.,
351
BSTOLATCC_MAC2188, BSTOLATCC_MAC14490, BSTOLATCC_MAC18054,
352
BSTOLATCC_MAC18052, respectively). We also observe small RNA (sRNA) biogenesis and
353
transport proteins, i.e., a Piwi protein (BSTOLATCC_MAC5406) and a Dicer-like protein
354
(BSTOLATCC_MAC1138; see also “Supplemental text”, “Homologs of small RNA-related
355
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
16
proteins involved in ciliate genome reorganization” and Figure S10), and a POT1 telomere-
356
binding protein homolog (POT1.4; BSTOLATCC_MAC1496; Supplemental Text “Telomere-
357
binding protein paralogs”). Numerous homologs of genes involved in DNA repair and chromatin
358
are also present among these highly developmentally upregulated genes (“Supplemental text”,
359
“Development-specific upregulation of proteins associated with DNA repair and chromatin” and
360
“Supplemental text”, “Development-specific histone variant upregulation”). The presence of
361
proteins involved in either transcription initiation or translation initiation among these highly
362
upregulated genes suggests a possible manner to coordinate the regulation of development-
363
specific gene expression (“Supplemental text”, “Development-specific upregulation of proteins
364
associated with initiation of transcription and translation”).
365
PiggyBac homologs in Blepharisma are candidate IES excisases
366
In Paramecium tetraurelia and Tetrahymena thermophila, PiggyBac transposases are
367
responsible for IES excision during genome reorganization occurring in the developing new
368
MAC [52,53]. These transposases appear to have been domesticated, i.e., their genes are no
369
longer contained in transposons but are encoded in the somatic genome where they provide a
370
necessary function in genome development [52,53]. PiggyBac belongs to the DDE/D-
371
superfamily of transposases due to the presence of a protein domain containing the catalytic
372
triad of 3 aspartic acid residues (DDD), where the third residue can also be a glutamate (E).
373
PiggyBac homologs typically have a DDD catalytic triad, rather than the more common DDE
374
triad of other DDE/D transposases [54]. The DDD catalytic motif is present in the PiggyMac
375
(Pgm) of Paramecium and the Tetrahymena PiggyBac homologs Tpb1 and Tpb2 [53,55].
376
Among ciliates, domesticated PiggyBac transposases have so far only been reported in the
377
model oligohymenophorean genera Paramecium and Tetrahymena, and have not been
378
detected in either the MAC or MIC genomes of the spirotrich Oxytricha trifallax [39,56].
379
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
17
380
We detected more transposase domains (9 distinct PFAM identifiers) in B. stoltei than any of the
381
other ciliate species we examined (Figure 6A). Seven of these domains are also found in at
382
least one other ciliate species. Using HMMER searches with the PFAM domain characteristic of
383
PiggyBac homologs, DDE_Tnp_1_7 (PF13843), we found eight homologs in B. stoltei ATCC,
384
none of which were flanked by terminal repeats identified by RepeatModeler. We also found
385
PiggyBac homologs in B. stoltei HT-IV and B. japonicum R1072. Reminiscent of Paramecium
386
tetraurelia, which, among ten PiggyMac homologs, has just one homolog with a complete
387
catalytic triad [55], the DDD triad is preserved in just a single B. stoltei ATCC 30299 PiggyBac
388
homolog (Figure 6B; Contig_49.g1063, BSTOLATCC_MAC17466). This gene is strongly
389
upregulated during development from 22 to 38 hours after formation of heterotypic pairs, when
390
new MACs develop and IES excision is required (Figure 6B). Furthermore, there are significant
391
similarities in the basic properties of Blepharisma and Paramecium IESs, which will be reported
392
in detail in the subsequent B. stoltei MIC genome paper (Seah et al., in prep.). Consequently,
393
adopting the Paramecium nomenclature, we refer to our primary candidate IES excisase as
394
Blepharisma PiggyMac (BPgm) and the other homologs as BPgm-Likes (BPgmLs).
395
396
PiggyMac homologs are also present in other heterotrich ciliates but have not yet been
397
described because of genome assembly or annotation challenges. Using BPgm as a query
398
sequence, we found convincing homologs containing the conserved catalytic DDD-motif in a
399
genome assembly of the heterotrichous ciliate Condylostoma magnum (TBLASTN e-value 2e-
400
24 to 2e-37). While we failed to detect the DDE_Tnp_1_7 domain in predicted genes of the
401
heterotrich Stentor coeruleus, we were able to detect a relatively weak adjacent TBLASTN
402
matches split across two frames in its draft MAC genome (e-value 7e-15; SteCoe_contig_741
403
positions 6558-5475). After joining ORFs corresponding to this region and translating them, we
404
obtained a more convincing DDE_Tnp_1_7 match with HMMER3 (e-value 2e-24). This either
405
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
18
corresponds to a pseudogene or a poorly assembled genomic region. It is also possible that
406
additional PiggyBac homologs were missed in unassembled Stentor MAC genome regions.
407
408
In addition, we searched for PiggyMac homologs in the MAC genome of the pathogenic
409
oligohymenophorean ciliate Ichthyophthirius multifiliis [57]. TBLASTN searches using the T.
410
thermophila Tpb2 as a query returned no hits. A HMMER search using hmmscan with a six-
411
frame translation of the I. multifiliis MAC genome against the PFAM-A database also did not
412
return any matches with independent E-values (i-E-value) less than 1. We note that based on
413
BUSCO analyses (Figure S1) the I. multifiliis genome appears to be less complete than other
414
ciliates we examined. So, a better genome assembly will be needed to investigate the possibility
415
that PiggyBac homologs are encoded elsewhere in this MAC genome.
416
417
Other than the PFAM DDE_Tnp_1_7 domain, three of the Blepharisma PiggyBac homologs also
418
possess a short, characteristic cysteine-rich domain (CRD) (Figure 6C). This domain is essential
419
for Pgm activity and Paramecium IES excision [58]. PiggyBac CRDs have been classified into
420
three different groups [58]. In Blepharisma, the CRD consists of five cysteine residues arranged
421
as CxxC-CxxCxxxxH-Cxxx(Y)H (where C, H, Y and x respectively denote cysteine, histidine,
422
tyrosine and any other residue). Two Blepharisma homologs possess this CRD without the
423
penultimate tyrosine residue, while the third contains a tyrosine residue before the final histidine.
424
This -YH feature towards the end of the CxxC-CxxCxxxxH-Cxxx(Y)H CRD is shared by all the
425
PiggyBac homologs we found in Condylostoma, as well as the PiggyBac-like element (PBLE)
426
from the bat Myotis lucifugus (Piggybat) and the human PiggyBac element-derived (PGBD)
427
proteins PGBD2 and PGBD3. In contrast, PiggyBac homologs from Paramecium and
428
Tetrahymena have a CRD with six cysteine residues arranged in the variants of the motif CxxC-
429
CxxC-Cx{2-7}Cx{3,4}H, and group together with human PGDB4 and Spodoptera frugiperda
430
PBLE (Figure 6C).
431
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
19
432
Previous experiments involving individual or paired gene knockdowns of most of the ten
433
Paramecium tetraurelia PiggyMac(-like) paralogs lead to substantial IES retention, even though
434
only one PiggyMac gene (PGM) has the complete catalytic triad, indicating that all these
435
proteins are functional [55]. To examine whether PiggyMac homologs in Paramecium are
436
functionally constrained we examined non-synonymous (dN) to synonymous substitution rate
437
(dS) ratios (ω = dN/dS) for pairwise codon sequence alignments within and between species, with
438
the latter using two closely related Paramecium species (P. tetraurelia and P. octaurelia). All the
439
dN/dS values for pairwise comparisons of each of the catalytically incomplete P. tetraurelia
440
PgmLs versus the complete Pgm, were less than 1, ranging from 0.01 to 0.25 (Table S7). All
441
dN/dS values for pairwise comparisons between P. tetraurelia and P. octaurelia PiggyBac
442
orthologs were also substantially less than 1, ranging from 0.02 to 0.11 (Table S8). Since dN/dS=
443
1 indicates genes evolving neutrally [59], none of these genes are likely pseudogenes, and all
444
are subject to similar strong purifying selection.
445
446
Since only one of the eight homologs has the complete DDD catalytic triad characteristic of
447
functional PiggyBac transposases, this is the only possible catalytically active PiggyMac in B.
448
stoltei. The in vitro catalytic activity of the T. ni PiggyBac transposase is abolished when any of
449
the catalytic residues are substituted by alanine [60]. Conversely, we expect all Blepharisma
450
PiggyMac homologs with an incomplete catalytic triad to be catalytically inactive. In pairwise
451
comparisons of each of the catalytically incomplete homologs versus the complete one dN/dS
452
ranges from 0.0076 to 0.1351 (Table S9). As these estimates fall in a similar range to those
453
obtained for Paramecium PiggyMac/PiggyMac-likes, this indicates that the B. stoltei PiggyMac
454
homologs are predominantly subject to comparable purifying selection and unlikely to be
455
pseudogenes, but may have differentiated and acquired new functional roles.
456
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
20
A common origin for most ciliate PiggyBac homologs
457
To determine whether the Blepharisma PiggyBac homologs share a common ciliate ancestor, or
458
whether they arose from independent acquisitions in major ciliate groups, we explored a range
459
of phylogenetic methods. We created a phylogeny of 155 sequences of DDE_Tnp_1_7 domains
460
of PiggyBac homologs which are representatives of putative domesticated transposases from
461
the genus Blepharisma (B. stoltei ATCC 30299, B. stoltei HT-IV, B. japonicum), Condylostoma
462
magnum, Paramecium spp., Tetrahymena thermophila, together with PiggyBac-like element
463
(PBLE [61]) transposases from a stramenopile species, an archaeplastid species, several
464
opisthokont species and an amoebozoan, homologs of the human PiggyBac-derived 1 gene
465
(PGBD1), PGBD3, PGBD4 and PGBD5 homologs (Figure 7; Data S1). The three BPgms from
466
the three Blepharisma strains whose MAC genomes we sequenced form a monophyletic clade,
467
with the BPgmLs 1 and 2 as outgroups. The BPgmLs 3,4,5,6 and 7 are grouped together with
468
the Condylostoma Pgms (each of which appears to have a complete catalytic triad). These two
469
clades form a heterotrich-specific group in the phylogeny. We observe that the ciliate Pgms and
470
PgmLs largely cluster to form a single clade. T. thermophila Tpb7 is an exception which appears
471
to group together with homologs of PGBD5 from H. sapiens, G. gallus and D. rerio but has an
472
aBayes support value of 0.84, much lower than the mode (1.0) and average (0.91) for this
473
phylogeny. The other exceptions are two fungal PBLEs and two red algal PBLEs with low
474
aBayes support within the predominantly monophyletic ciliate clade.
475
476
We applied tree topology tests to ascertain the likelihood of a monophyletic clade of ciliate Pgms
477
and PgmLs. Constrained trees were compared by different methods including bootstrap
478
proportion using the RELL method [62], a one-sided Kishino-Hasegawa test, a Shimodaira-
479
Hasegawa test [63,64], estimating confidence using expected likelihood weights [65], and an
480
approximately unbiased test [66] (Table S10). A tree with the constraint of monophyly for all
481
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
21
ciliate Pgms and PgmLs lacks support in comparison to an unconstrained tree with all the ciliate
482
Pgms and PgmLs and eukaryotic PBLEs. After excluding T. thermophila Tpb7, a tree
483
constrained with a clade consisting of the remaining ciliate Pgms and PgmLs is supported by all
484
tests. In other words, the ciliate PiggyBac homologs except Tpb7 likely originated in a ciliate
485
common ancestor.
486
Additional domesticated transposases
487
Three B. stoltei ATCC30299 MAC genome-encoded proteins possess PFAM domain DDE_1
488
(PF03184; Figure 8A). In PFAM version 35 the most common domain combinations
489
(architectures) with DDE_1, aside from proteins with just this domain detected (5898
490
sequences), are with an N-terminal PFAM domain HTH_Tnp_Tc5 (PF03221) alone (2240
491
sequences), and both an N-terminal CENP-B_N domain (PF04218) and central HTH_Tnp_Tc5
492
domain (1255 sequences). All the B. stoltei proteins possess the HTH_Tnp_Tc5 domain.
493
Though pairwise sequence identity is low amongst the Blepharisma proteins (avg. 28.3%) in
494
their multiple sequence alignment, the CENP-B_N domain in one of them appears to align
495
reasonably well to corresponding regions in the two proteins lacking this domain, suggesting
496
that it deteriorated in this pair, beyond the recognition capabilities of HMMER3 with the given
497
PFAM domain model and search parameters.
498
499
The N-terminal domains of Homo sapiens CENP-B proteins (corresponding to PFAM: CENP-
500
B_N, PF04218, and HTH_Tnp_Tc5) are able to bind 17 bp centromeric DNA repeats and are
501
involved in centromeric binding during chromosome segregation [67]. Potential convergent
502
domestication of transposases with these domains has been proposed for mammalian and
503
fission yeast lineages [68]. The PDC2 protein in budding yeast Saccharomyces cerevisiae,
504
which regulates pyruvate decarboxylase transcription [69], is considered structurally
505
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
22
homologous to human CENP-B and shares the same pair of domains, but has not been
506
reported to play a role in centromere binding [68,70]. The CENP-B_N domain abbreviation is
507
somewhat unfortunate, since the domain is also characteristic of numerous transposases with
508
no known or expected role in centromeric function, notably the Tigger and PogoR families, and
509
so presumably originated as a transposase domain [71]. As judged from searches of the UniProt
510
database, the characteristic N-terminal CENP-B DNA-binding domain (PF04218) is detectable in
511
a few protist clades (e.g., Rhodophyta and Cryptophyceae), but, among alveolates, only in
512
Stentor coeruleus. BLASTp matches for all three proteins in GenBank are annotated either as
513
Jerky or Tigger homologs (Jerky transposases belong to the Tigger transposase family [71]).
514
Given that none of the B. stoltei DDE_1 domain proteins appears to have a complete catalytic
515
triad, we think it is unlikely they are involved in transposition or IES excision.
516
517
Six MAC-encoded transposases containing the DDE_3 (PF13358) domain are present in B.
518
stoltei, all of which are substantially upregulated in MAC development (Figure 8B). Five of these
519
possess the complete DDE catalytic triad. The DDE_3 domain is characteristic of DDE
520
transposases encoded by the Telomere-Bearing Element transposons (TBEs) of Oxytricha
521
trifallax [72,73], which, despite being MIC genome-limited, are thought to be involved in IES
522
excision, rather than a domesticated PiggyMac [74]. Other DDE_3-containing transposons,
523
called Tec elements, are found in another spirotrichous ciliate, Euplotes crassus, but no role in
524
genome reorganization has been established for these [75]. TBEs and Tec elements do not
525
share obvious features with one another, other than both possessing an encoded protein
526
belonging to the IS630-Tc1 transposase (super)-family [76]. All six DDE_3 genes in B. stoltei
527
ATCC 30299 have at least 150× HiFi read coverage, consistent with the regions encoding these
528
being bona fide macronuclear DNA rather than MIC DNA contaminants.
529
530
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
23
The B. stoltei DDE_3 domain transposases appear to be more closely related to the IS630
531
family than to Oxytricha TBE transposases and Euplotes Tec transposases, and thus acquired
532
independently of them. BLASTP searches of GenBank NR with these proteins as queries,
533
returned bacterial or fungal proteins for most of the hundred top hits, and among these the most
534
common classifications are “IS630 family” transposases. One of the top hits is a MIC genome-
535
encoded protein in Oxytricha trifallax with a DDE_3 domain which is not a TBE transposase
536
(GenBank accession: KEJ83017.1). IS630 transposases diverge considerably from Tc1-Mariner
537
transposases, and hence are considered an outgroup to them [77]. On the other hand, IS630-
538
related transposases encoded by Anchois transposons have been detected in the Paramecium
539
tetraurelia MIC genome [35]. Since all but one of the B. stoltei paralogs appear to possess
540
complete catalytic triads, the possibility that they may excise a subset of IESs needs to be
541
considered.
542
543
Among other ciliates with draft MAC genomes we examined, the PFAM IS1595- and MULE
544
transposase-like domains (PF12762 and PF10551) have so far only been observed in the
545
spirotrichs Oxytricha and Stylonychia [39,40]. DDE_Tnp_IS1595 domains are characteristic of
546
the Merlin transposon superfamily and MULE is part of the Mutator transposon superfamily [54].
547
Currently no particular functions have been ascribed to these proteins in these ciliates, but they
548
are substantially upregulated during the development of these ciliates [39,56]. Both
549
transposase-like domains are found in MAC-encoded proteins in B. stoltei ATCC 30299 and the
550
underlying genes are also upregulated during Blepharisma MAC development (Figure 9A, B).
551
The genes encoding DDE_Tnp_IS1595 and MULE proteins, like the genes encoding the DDE_1
552
and DDE_3 proteins, appear to lack the flanking terminal inverted repeats characteristic of a
553
functional transposon, as identified by RepeatModeler, consistent with transposase
554
domestication. A number of members of both IS1595 and MULE transposases, including those
555
that are upregulated during new MAC formation, also appear to have complete catalytic triads.
556
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
24
557
In addition to the transposases, we also detected a family (> 30 copies) of APE-type non-LTR
558
retrotransposase genes with the two domains characteristic of such retrotransposases, i.e., an
559
APE endonuclease domain (PFAM “exo_endo_phos_2”; PF14529) and a reverse transcriptase
560
domain (PFAM “RVT_1”; PF00078) present on adjacent genes. Unlike the conventional
561
transposase-derived genes in B. stoltei, the expression of all these genes throughout the
562
conditions we examined is very low to negligible, and some of them also appear to be truncated
563
pseudogenes (Data S3; workbook “RVT1 + exo_endo_phos_2”). Since it is necessary to
564
understand the impact of the presence of IESs in some of these genes, their detailed analyses
565
will be reported in the context of the Blepharisma stoltei MIC genome, where their nature and
566
origin will be more evident.
567
Discussion
568
The genus Blepharisma represents one of the earliest diverging ciliate lineages, the
569
heterotrichs, forming an outgroup to the best-studied and deeply divergent oligohymenophorean
570
and spirotrich ciliates [7]. Blepharisma species thus provide an excellent vantage point to
571
compare unique processes that have accompanied the evolution of nuclear and genomic
572
dimorphism in ciliates, particularly the extensive genomic reorganization that occurs during the
573
new macronucleus (MAC) development. Comparisons between Blepharisma and other ciliates
574
also enable deeper consideration of the potential state of the ciliate common ancestor. The
575
annotated draft B. stoltei ATCC 30299 MAC genome and associated transcriptomic data provide
576
the basis for comparative studies of key characteristics of genome reorganization processes.
577
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
25
A minichromosomal architecture with pronounced telomere
578
addition site variation
579
Ciliate MAC genome architectures span from longer DNA molecules in the
580
oligohymenophoreans Paramecium tetraurelia and Tetrahymena thermophila, which are more
581
similar to conventional eukaryotic chromosomes, to numerous, predominantly single-gene
582
nanochromosomes of spirotrichs like Oxytricha trifallax and Euplotes crassus [3]. The B. stoltei
583
ATCC 30299 MAC genome organization (and like that of Paramecium bursaria [44]) lies
584
between these two extremes with many minichromosomes representing numerous alternative
585
telomere addition sites. This form of natural sequence heterogeneity creates significant new
586
challenges for generating single representative consensus sequences, because genome
587
assembly methods assume chromosome boundaries are generally consistent, as they are in
588
multicellular eukaryote models, down to the terminology used to describe them [78]. Fortunately,
589
the reference Blepharisma MAC genome is largely homozygous, promoting contiguity in the
590
final, 41 Mb assembly of 64 two-telomere-capped contigs. Nevertheless, plenty of room for
591
refinement remains, including development of new ways to better assemble, describe and
592
organize the constituents of genomes with such extreme sequence variability.
593
594
Despite the abundance of Blepharisma MAC genome telomeres, we did not detect a typical
595
ncRNA gene corresponding to the telomerase RNA component (TERC) of the ribozyme
596
responsible for telomere synthesis in the MAC genome. We suspect this is due to ncRNAs
597
presenting a far greater challenge to detect than protein-coding genes and the presence of
598
highly divergent ncRNA with insufficient similarity to the handful of taxonomically-restricted
599
TERCs identified in oligohymenophorean and spirotrich ciliates and other eukaryotes so far. On
600
the other hand, five homologs of POT1, the canonical telomere-binding protein, are present
601
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
26
(Figure S4). One of these is highly upregulated when the new MAC genome is forming, and
602
presumably like model ciliates the DNA is fragmented and replicated, requiring further telomere
603
synthesis and telomere-binding proteins.
604
A potentially catalytically active PiggyMac is present in
605
Blepharisma
606
In current models of IES excision, MIC-limited sequences are demarcated by deposition of
607
methylation marks on histones in an sRNA-dependent process [5]. These sequences are then
608
recognized by domesticated transposases whose excision is supported by additional proteins,
609
including those that recognize these marks [5]. Together with MIC sequencing and from the
610
same time course we employed in the present study for RNA-seq we have observed abundant,
611
development-specific sRNA production in Blepharisma like that in other ciliates (Seah et al. in
612
prep), and so we examined the expression of potential genes responsible for their biogenesis,
613
along with other genes highly upregulated during new MAC formation. Homologs of proteins
614
implicated in ciliate genome reorganization were present among the genes most highly
615
differentially upregulated during new MAC development, notably including Dicer-like and Piwi
616
proteins, and domesticated transposases.
617
618
Following this study, we will report detailed analyses of a draft B. stoltei ATCC 30299 MIC
619
genome, which possesses abundant and typically TA-bound IESs (Seah et al. in prep). In
620
oligohymenophorean ciliates Tetrahymena and Paramecium there is a considerable body of
621
evidence that PiggyBac homologs are responsible for IES excision [35,52,53,55,79]. The
622
responsible IES excisases in the less-studied spirotrich ciliates, Oxytricha, Stylonychia and
623
Euplotes, are not as evident. In Oxytricha the TBE transposases are considered to be involved
624
in IES excision, but, unlike the MAC genome-encoded primary IES excisase (Tpb2) in
625
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
27
Tetrahymena and all the Paramecium PiggyMacs and PiggyMac-likes, are encoded by full-
626
length transposons present in the MIC genome, but absent from the MAC [74]. The considerable
627
heterogeneity in Oxytricha IES boundaries and “unscrambling” of a large subset of them [56],
628
together with the observation of pronounced developmental upregulation of numerous additional
629
MAC- and MIC-encoded transposases raises the possibility that additional transposases other
630
than those of TBEs could also be involved in IES excision [39,56]. Knowledge of IESs in other
631
ciliates is sparse (primarily confined to the phyllopharyngean Chilodonella uncinata [80,81]),
632
and, as far as we are aware, no specific IES excisases have been proposed for them.
633
634
Since the oligohymenophorean ciliate PiggyBac homologs are clear IES excisases, we sought
635
and found eight homologs of these genes in the B. stoltei ATCC 30299 MAC genome.
636
Blepharisma is the first ciliate genus aside from Tetrahymena and Paramecium in which such
637
proteins have been reported. Additional searches revealed clear PiggyBac homologs in
638
Condylostoma magnum, and a weaker pair of matches in Stentor coeruleus, suggesting that
639
these are a common feature of heterotrich ciliates. Reminiscent of Paramecium tetraurelia, in
640
which just one of the nine PiggyBac homologs, PiggyMac, has a complete DDD catalytic triad
641
[55], a single B. stoltei PiggyBac homolog has a complete DDD catalytic triad. The gene
642
encoding this protein in B. stoltei ATCC 30299 is highly upregulated during MAC development,
643
together with two other homologs that lack a complete catalytic triad. As is characteristic of
644
PiggyBac homologs, each of these three homologs also has a C-terminal, cysteine-rich, zinc
645
finger domain, which most closely resemble those of Condylostoma magnum homologs. The
646
organization of the heterotrich PiggyBac homolog zinc finger domains appears to be more
647
similar to that of comparable domains of Homo sapiens PGBD2 and PGBD3 homologs than that
648
of the zinc finger domains in Paramecium and Tetrahymena PiggyBac homologs (which are
649
more similar to the Homo sapiens PGBD4 zinc fingers [58]).
650
651
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
28
In Paramecium, since the discovery of multiple PiggyBac homologs (PiggyMac-likes), there
652
have been questions about their role, since, aside from PiggyMac, they all have incomplete
653
catalytic triads, and are thus likely catalytically inactive, but nevertheless their gene knockdowns
654
lead to pronounced IES retention [55]. It has therefore been proposed that the PiggyMac-likes
655
may function as heteromeric multi-subunit complexes in conjunction with PiggyMac during DNA
656
excision [55]. On the other hand cryo-EM structures available for moth PiggyBac transposase
657
support a model in which these proteins function as a homodimeric complex in vitro [82].
658
Furthermore, the primary Tetrahymena PiggyBac, Tpb2, is able to perform cleavage in vitro,
659
without the assistance of other PiggyBac homologs [53]. In other eukaryotes, domesticated
660
PiggyBacs without complete catalytic triads are thought to be retained by virtue of useful DNA-
661
binding roles [83]. One possibility for such purely DNA-binding transposases in ciliates could be
662
in competitively regulating (taming) the excision of DNA by the catalytically active transposases.
663
Future experimental analyses of the BPgm and the BPgm-likes could aid in resolving the
664
conundrums and understanding of possible interactions between catalytically active and inactive
665
transposases.
666
Domesticated PiggyBac homologs are the main candidate IES
667
excisases in Blepharisma
668
In addition to the PiggyBac homologs, we also found MAC genome-encoded transposases with
669
the PFAM domains “DDE_1”, “DDE_3”, “DDE_Tnp_IS1595” and “MULE” in Blepharisma. All the
670
genes encoding proteins with these domains lack flanking terminal repeats characteristic of
671
active transposons, suggesting these genes are further classes of domesticated transposases.
672
In Blepharisma and numerous other organisms, the DDE_1 domains co-occur with CENPB
673
domains. Two such proteins represent totally different proposed exaptations in mammals
674
(centromere-binding protein) and fission yeast (regulatory protein) [6870]. Given the great
675
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
29
evolutionary distances involved, there is no specific reason to expect that the Blepharisma
676
homologs have either function. None of the three proteins with co-occurring DDE_1 and CENPB
677
domains have a complete catalytic triad, making it unlikely that these are active transposases or
678
IES excisases, though all three are noticeably upregulated during MAC development. Six
679
proteins with the PFAM domain DDE_3 are also encoded by Blepharisma MAC genes, of which
680
five possess a complete catalytic triad. DDE_3 domains are also characteristic of TBE
681
transposases in Oxytricha and Tec transposases in Euplotes. All the “DDE_3” protein genes are
682
upregulated during conjugation in B. stoltei, and are expressed at particularly high levels during
683
development of the new MAC. A number of DDE_Tnp_IS1595 and MULE domain-containing
684
proteins have complete catalytic triads and also show pronounced upregulation during B. stoltei
685
MAC development.
686
687
Upon excision, classical cut-and-paste transposases in eukaryotes typically leave behind
688
additional bases, notably including those of the target-site duplication that arose when they were
689
inserted, forming a “footprint” [84]. PiggyBac homologs are unique in performing precise,
690
“seamless” excision in eukaryotes [85], conserving the number of bases at the site of
691
transposon insertion after excision, a property that makes them popular for genetic engineering
692
[82]. Tetrahymena Tpb2 is the one exception among PiggyBac homologs associated with
693
imprecise excision in this eukaryote [53]. Since intragenic IESs are abundant in Blepharisma,
694
like Paramecium and unlike Tetrahymena, it is essential that these are excised precisely without
695
addition or removal of nucleotides, which, at best, could result in translation of additional amino
696
acids and, at worst, translation frameshifts.
697
698
Though there are clearly numerous additional domesticated transposases with complete
699
catalytic triads and whose genes are substantially upregulated during Blepharisma
700
development, whether they are capable of excision, and if this is precise, needs to be
701
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
30
established. Tetrahymena has distinct domesticated transposases that excise different subsets
702
of IESs, namely those that are predominant, imprecisely excised and intergenic (by Tpb2) [53],
703
versus those that are rare, precisely excised and intragenic (by Tpb1 and Tpb6) [79,86]. We
704
could envisage if the additional Blepharisma domesticated transposases are still capable of
705
excision, but not a precise form, an involvement in excision of a subset of the numerous
706
intergenic IESs. Thus, while we cannot rule out the involvement of other possible domesticated
707
transposases and PiggyMac-likes, currently B. stoltei PiggyMac is the primary candidate IES
708
excisase.
709
A possible PiggyBac homolog ciliate common ancestor
710
In light of the discovery of PiggyMacs in heterotrichous ciliates, we investigated whether the last
711
common ancestor of ciliates also possessed a PiggyMac. Phylogenetic analyses, together with
712
tree topology tests, indicate that the oligohymenophorean and heterotrichous PiggyMacs and
713
PiggyMac-likes form a monophyletic clade. The lack of PiggyBac homologs in other ciliate
714
classes such as the Spirotrichea, and potentially the oligohymenophorean Ichthyophthirius
715
multifiliis (with the caveat of apparent lower genome completeness), raises the question whether
716
the PiggyMacs were lost in these lineages or were gained independently from the same source
717
by both heterotrichs and a subset of oligohymenophoreans. We think the former is more likely.
718
However, the alternative cannot be completely dismissed, because non-model ciliates, where
719
the genome assembly quality allows reliable gene and domain annotations, have only been
720
sparsely sampled.
721
Future directions
722
The B. stoltei ATCC 30299 MAC genome together with a corresponding MIC genome (Seah et
723
al., in prep.) pave the way for future investigations of a peculiar, direct pathway to new MAC
724
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
31
genome development which skips the upstream complexity of meiotic and fertilization-
725
associated nuclear developmental processes of the standard pathway [9]. The pair of B. stoltei
726
strains we have are both now low frequency selfers, in which the conventional, indirect MAC
727
development pathway dominates. Comparisons with fresh, high frequency Blepharisma selfers
728
collected from the wild will facilitate comparative gene expression analyses with the direct MAC
729
development pathway, which will assist in distinguishing expression upregulation due to meiotic
730
and mitotic processes preceding indirect new MAC development. As a representative of one of
731
the two deepest branching ciliate lineages, the assembled somatic genome will facilitate
732
investigations beyond those reported here into the enigmatic origins of nuclear and genomic
733
dualism within single cells.
734
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
32
Materials and Methods
735
Strains and localities
736
The strains used and their original isolation localities were: Blepharisma stoltei ATCC 30299,
737
Lake Federsee, Germany [24]; Blepharisma stoltei HT-IV, Aichi prefecture, Japan; Blepharisma
738
japonicum R1072, from an isolate from Bangalore, India [22].
739
Cell cultivation, harvesting and cleanup
740
For genomic DNA isolation B. stoltei ATCC 30299 and HT-IV cells were cultured in Synthetic
741
Medium for Blepharisma (SMB) [87] at 27˚C. Belpharismas were fed Chlorogonium elongatum
742
grown in Tris-acetate phosphate (TAP) medium [88] at room temperature. Chlorogonium cells
743
were pelleted at 1500 g at room temperature for 3 minutes to remove most of the TAP medium,
744
and resuspended in 50 mL SMB. 50 ml of dense Chlorogonium was used to feed 1 litre of
745
Blepharisma culture once every three days.
746
747
Blepharisma stoltei ATCC 30299 and HT-IV cells used for RNA extraction were cultured in
748
Lettuce medium inoculated with Enterbacter aerogenes and maintained at 25˚C [89].
749
750
Blepharisma cultures were concentrated by centrifugation in pear-shaped flasks at 100 g for 2
751
minutes using a Hettich Rotanta 460 centrifuge with swing out buckets. Pelleted cells were
752
washed with SMB and centrifuged again at 100 g for 2 minutes. The washed pellet was then
753
transferred to a cylindrical tube capped with a 100 µm-pore nylon membrane at the base and
754
immersed in SMB to filter residual algal debris from the washed cells. The cells were allowed to
755
diffuse through the membrane overnight into the surrounding medium. The next day, the
756
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
33
cylinder with the membrane was carefully removed while attempting to minimize dislodging any
757
debris collected on the membrane. Cell density after harvesting was determined by cell counting
758
under the microscope.
759
DNA isolation, library preparation and sequencing
760
B. stoltei macronuclei were isolated by sucrose gradient centrifugation [4]. DNA was isolated
761
with a Qiagen 20/G genomic-tip kit according to the manufacturer’s instructions. Purified DNA
762
from the isolated MACs was fragmented, size selected and used to prepare libraries according
763
to standard PacBio HiFi SMRTbell protocols. The libraries were sequenced in circular
764
consensus mode to generate HiFi reads.
765
766
Total genomic DNA from B. stoltei HT-IV and B. stoltei ATCC 30299 was isolated with the
767
SigmaAldrich GenElute Mammalian genomic DNA kit. A sequencing library was prepared with a
768
NEBnext FS DNA Library Prep Kit for Illumina and sequenced on an Illumina HiSeq 3000
769
sequencer, generating 150 bp paired-end reads.
770
771
Total genomic DNA from B. japonicum was isolated with the Qiagen MagAttract HMW DNA kit.
772
A long-read PacBio sequencing library was prepared using the SMRTbell® Express Template
773
Preparation Kit 2.0 according to the manufacturers’ instructions and sequenced on an PacBio
774
Sequel platform with 1 SMRT cell. Independently, total genomic DNA form B. japonicum was
775
isolated with the SigmaAldrich GenElute Mammalian genomic DNA kit and an sequencing
776
library was prepared with the TruSeq Nano DNA Library Prep Kit (Illumina) and sequenced on
777
an Illumina NovaSeq6000 to generate 150 bp paired-end reads.
778
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
34
Gamone 1/ Cell-Free Fluid (CFF) isolation and conjugation activity
779
assay
780
B. stoltei ATCC 30299 cells were cultured and harvested and concentrated to a density of 2000
781
cells/mL according to the procedure described in “Cell cultivation, Harvesting and Cleanup”. This
782
concentrated cell culture was incubated overnight at 27˚C. The next day, the cells were
783
harvested, and the supernatant collected and preserved at 4˚C at all times after extraction. The
784
supernatant was then filtered through a 0.22 µm-pore filter. BSA (10 mg/mL) was added to
785
produce the final CFF at a final BSA concentration of 0.01%.
786
787
To assess the activity of the CFF, serial dilutions of the CFF were made to obtain the gamone
788
activity in terms of units (U) [90].The activity of the isolated CFF was 210 U.
789
Conjugation time course and RNA isolation for high-throughput
790
sequencing
791
B. stoltei cells for the complementary strains, ATCC 30299 and HT-IV, were cultivated and
792
harvested by gentle centrifugation to achieve a final cell concentration of 2000 cells/ml for each
793
strain. Non-gamone treated ATCC 30299 (A1) and HT-IV cells (H1) were collected (time point: -
794
3 hours). Strain ATCC 30299 cells were then treated with synthetic gamone 2 (final
795
concentration 1.5 µg/mL) and strain HT-IV cells were treated with cell-free fluid with a gamone 1
796
activity of ~210 U/ml for three hours (Figure S8).
797
798
Homotypic pair formation in both cultures was checked after three hours. More than 75% of the
799
cells in both cultures formed homotypic pairs. At this point the samples A2 (ATCC 30299) and
800
H2 (HT-IV) were independently isolated for RNA extraction as gamone-treated control cells just
801
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
35
before mixing. For the rest of the culture, homotypic pairs in both cultures were separated by
802
pipetting them gently with a wide-bore pipette tip. Once all pairs had been separated, the two
803
cultures were mixed together. This constitutes the experiment’s 0-h time point. The conjugating
804
culture was observed and samples collected for RNA isolation or cell fixation at 2 h, 6 h, 14 h,
805
18 h, 22 h, 26 h, 30 h and 38 h (Figure S8). Further details of the sample staging approach are
806
described in [9] and [51]. At each time point including samples A1, H1, A2 and H2, 7 mL of
807
culture was harvested for RNA-extraction using Trizol. The total RNA obtained was then
808
separated into a small RNA fraction < 200 nt and a fraction with RNA fragments > 200 nt using
809
the Zymo RNA Clean and Concentrator-5 kit according to the manufacturer's instructions. RNA-
810
seq libraries were prepared by BGI according to their standard protocols and sequenced on a
811
BGISeq 500 instrument.
812
813
Separate 2 mL aliquots of cells at each time point for which RNA was extracted were
814
concentrated by centrifuging gently at 100 rcf. 50 µL of the concentrated cells were fixed with
815
Carnoy’s fixative (ethanol:acetic acid, 6:1), stained with DAPI and imaged to determine the state
816
of nuclear development [9].
817
818
Cell fixation and imaging
819
B. stoltei cells were harvested as above (“Cell cultivation”), and fixed with an equal volume of
820
“ZFAE” fixative, containing zinc sulfate (0.25 M, Sigma Aldrich), formalin, glacial acetic acid and
821
ethanol (Carl Roth), freshly prepared by mixing in a ratio of 10:2:2:5. Fixed cells were pelleted
822
(1000 g; 1 min), resuspended in 1% TritonX-100 in PHEM buffer to permeabilize (5 min; room
823
temperature), pelleted and resuspended in 2% (w/v) formaldehyde in PHEM buffer to fix further
824
(10 min; room temp.), then pelleted and washed twice with 3% (w/v) BSA in TBSTEM buffer
825
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
36
(~10 min; room temp.). For indirect immunofluorescence, washed cells were incubated with
826
primary antibody rat anti-alpha tubulin (Abcam, ab6161; 1:100 dilution in 3% w/v BSA/TBSTEM;
827
60 min; room temp.) then secondary antibody goat anti-rat IgG H&L labeled with AlexaFluor 488
828
(Abcam, ab150157, 1:500 dilution in 3% w/v BSA/TBSTEM; 20 min; room temp.). Nuclei were
829
counterstained with DAPI (1 µg/mL) in 3% (w/v) BSA/TBSTEM. A z-stack of images was
830
acquired using a confocal laser scanning microscope (Leica TCS SP8), equipped with a HC PL
831
APO 40× 1.30 Oil CS2 objective and a 1 photomultiplier tube and 3 HyD detectors, for DAPI
832
(405 nm excitation, 420-470 nm emission) and Alexa Fluor 488 (488 nm excitation, 510-530 nm
833
emission). Scanning was performed in sequential exposure mode. Spatial sampling was
834
achieved according to Nyquist criteria. ImageJ (Fiji) [91] was used to adjust image contrast and
835
brightness and overlay the DAPI and AlexaFluor 488 channels. The z-stack was temporally
836
color-coded.
837
838
For a nuclear 3D reconstruction (Figure 1B), cells were fixed in 1% (w/v) formaldehyde and
839
0.25% (w/v) glutaraldehyde. Nuclei were stained with Hoechst 33342 (Invitrogen) (5 µM in the
840
culture media), and imaged with a confocal laser scanning microscope (Zeiss, LSM780)
841
equipped with an LD C-Apochromat 40x/1,1 W Korr objective and a 32 channel GaAsP array
842
detector, with 405 nm excitation and 420-470 nm emission. Spatial sampling was achieved
843
according to Nyquist criteria. The IMARIS (Bitplane) software v8.0.2 was used for three-
844
dimensional reconstructions and contrast adjustments.
845
846
For scanning electron microscopy (SEM) cells were fixed in 0.5% osmium tetroxide/2%
847
formaldehyde/2.5% glutaraldehyde in 0.06x PHEM for 5-10 minutes at room temperature. Cells
848
were post-fixed with 2% formaldehyde/2.5% glutaraldehyde in 0.06x PHEM for 4-6 hours at
849
room temperature. Subsequently, samples were dehydrated in a graded ethanol series with 1
850
day per step followed by critical point drying (Polaron) with CO2. Finally, the cells were sputter-
851
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
37
coated with a 4 nm thick layer of platinum (CCU-010, Safematic) and examined with a field
852
emission scanning electron microscope (Regulus 8230, Hitachi High Technologies) at an
853
accelerating voltage of 3 kV.
854
855
Genome assembly
856
Two MAC genome assemblies for B. stoltei ATCC 30299 (70× and 76× coverage) were
857
produced with Flye (version 2.7-b1585) [92] for the two separate PacBio Sequel II libraries
858
(independent replicates) using default parameters and the switches: --pacbio-hifi -g 45m. The
859
approximate genome assembly size was chosen based on preliminary Illumina genome
860
assemblies of approximately 40 Mb. Additional assemblies using the combined coverage (145×)
861
of the two libraries were produced using either Flye version 2.7-b1585 or 2.8.1-b1676, and the
862
same parameters. Two rounds of extension and merging were then used, first comparing the
863
70× and 76× assemblies to each other, then comparing the 145× assembly to the former
864
merged assembly. Assembly graphs were all relatively simple, with few tangles to be resolved
865
(Figure S7). Minimap2 [93] was used for pairwise comparison of the assemblies using the
866
parameters: -x asm5 --frag=yes --secondary=no, and the resultant aligned sequences were
867
visually inspected and manually merged or extended where possible using Geneious (version
868
2020.1.2) [94].
869
870
Visual inspection of read mapping to the combined assembly was then used to trim off contig
871
ends where there was little correspondence between the assembly consensus and the mapped
872
reads - which we classify as "cruft". Read mapping to cruft regions was often lower or uneven,
873
suggestive of repeats. Alternatively, these features could be due to trace MIC sequences, or
874
sites of alternative chromosome breakage during development which lead to sequences that are
875
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
38
neither purely MAC nor MIC. A few contigs with similar dubious mapping of reads at internal
876
locations, which were also clear sites of chromosome fragmentation (evident by abundant
877
telomere-bearing reads in the vicinity) were split apart and trimmed back as for the contig ends.
878
Telomere-bearing reads mapped to the non-trimmed region nearest to the trimmed site were
879
then used to define contig ends, adding representative telomeric repeats from one of the
880
underlying sequences mapped to each of the ends. The main genome assembly with gene
881
predictions can be obtained from the European Nucleotide Archive (ENA) (PRJEB40285;
882
accession GCA_905310155). “Cruft” sequences are also available from the same project
883
accession.
884
885
Two separate assemblies were generated for Blepharisma japonicum. A genome assembly for
886
Blepharisma japonicum strain R1072 was generated from Illumina reads, using SPAdes
887
genome assembler (v3.14.0) [95]. An assembly with PacBio Sequel long reads was produced
888
with Ra (v0.2.1) [96], which uses the Overlap-Layout-Consensus paradigm. The assembly
889
produced with Ra was more contiguous, with 268 contigs, in comparison to 1510 contigs in the
890
SPAdes assembly, and was chosen as the reference assembly for Blepharisma japonicum
891
(ENA accession: ERR6474383).
892
893
Condylostoma magnum genomic reads (study accession PRJEB9019) from a previous study
894
[97] were reassembled to improve contiguity and remove bacterial contamination. Reads were
895
trimmed with bbduk.sh from the BBmap package v38.22
896
(https://sourceforge.net/projects/bbmap/), using minimum PHRED quality score 2 (both ends)
897
and k-mer trimming for Illumina adapters and Phi-X phage sequence (right end), retaining only
898
reads ≥25 bp. Trimmed reads were error-corrected and reassembled with SPAdes v3.13.0 [95]
899
using k-mer values 21, 33, 55, 77, 99. To identify potential contaminants, the unassembled
900
reads were screened with phyloFlash v3.3b1 [98] against SILVA v132 [99]; the coding density
901
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
39
under the standard genetic code and prokaryotic gene model were also estimated using
902
Prodigal v2.6.3 [100]. Plotting the coverage vs. GC% of the initial assembly showed that most of
903
the likely bacterial contigs (high prokaryotic coding density, lower coverage, presence of
904
bacterial SSU rRNA sequences) had >=40% GC, so we retained only contigs with <40% GC as
905
the final C. magnum genome bin. The final assembly is available from the ENA bioproject
906
PRJEB48875 (accession GCA_920105805).
907
908
All assemblies were inspected with the quality assessment tool QUAST [101].
909
Variant calling
910
Illumina total genomic DNA-seq libraries for B. stoltei strains ATCC 30299 (ENA accession:
911
ERR6061285) and HT-IV (ERR6064674) were mapped to the ATCC 30299 reference assembly
912
with bowtie2 v2.4.2 [102]. Alignments were tagged with the MC tag (CIGAR string for mate/next
913
segment) using samtools [103] fixmate. The BAM file was sorted and indexed, read groups were
914
added with bamaddrg (commit 9baba65, https://github.com/ekg/bamaddrg), and duplicate reads
915
were removed with Picard MarkDuplicates v2.25.1 (http://broadinstitute.github.io/picard/).
916
Variants were called from the combined BAM file with freebayes v1.3.2 [104] in diploid mode,
917
with maximum coverage 1000 (option -g). The resultant VCF file was combined and indexed
918
with bcftools v1.12 [103], then filtered to retain only SNPs with quality score > 20, and at least
919
one alternate allele.
920
Annotation of alternative telomere addition sites
921
Alternative telomere addition sites (ATASs) were annotated by mapping PacBio HiFi reads to
922
the curated reference MAC assembly described above, using minimap2 and the following flags:
923
-x asm20 --secondary=no --MD. We expect reads representing alternative telomere additions to
924
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
40
have one portion mapping to the assembly (excluding telomeric regions), with the other portion
925
containing telomeric repeats being soft-clipped in the BAM record. For each mapped read with a
926
soft-clipped segment, we extracted the clipped sequence, and the coordinates and orientation of
927
the clip relative to the reference. We searched for ≥ 24 bp tandem direct repeats of the telomere
928
unit (i.e., ≥3 repeats of the 8 bp unit) in the clipped segment with NCRF v1.01.02 [105], which
929
can detect tandem repeats in the presence of noise, e.g., from sequencing error. The orientation
930
of the telomere sequence, the distance from the end of the telomeric repeat to the clip junction
931
(‘gap’), and the number of telomere-bearing reads vs. total mapped reads at each junction were
932
also recorded. Junctions with zero gap between telomere repeat and clip junction were
933
annotated as ATASs. The above procedure was implemented in the MILTEL module of the
934
software package BleTIES v0.1.3 [106].
935
936
MILTEL output was processed with Python scripts depending on Biopython [107], pybedtools
937
[108], Bedtools [109], and Matplotlib [110], to summarize statistics of junction sequences and
938
telomere permutations at ATAS junctions, and to extract genomic sequences flanking ATASs for
939
sequence logos. Logos were drawn with Weblogo v3.7.5 [111], with sequences oriented such
940
that the telomere would be added on the 5’ end of the ATAS junctions.
941
942
To calculate the expected minichromosome length, we assumed that ATASs were independent
943
and identically distributed in the genome following a Poisson distribution. About 47×103 ATASs
944
were annotated, supported on average by a single read. Given a genome of 42 Mbp at 145×
945
coverage, the expected rate of encountering an ATAS is 47×103 / (145 × 42 Mbp), so the
946
distance between ATASs (i.e., the minichromosome length) is exponentially distributed with
947
expectation (145 × 42 Mbp) / 47×103 = 130 kbp.
948
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
41
RNA-seq read mapping
949
To permit correct mapping of tiny introns RNA-seq data was mapped to the B. stoltei ATCC
950
30299 MAC genome using a version of HISAT2 [112] with modified source code, with the static
951
variable minIntronLen in hisat2.cpp lowered to 9 from 20 (change available in the HISAT2 github
952
fork: https://github.com/Swart-lab/hisat2/; commit hash 86527b9). HISAT2 was run with default
953
parameters and parameters --min-intronlen 9 --max-intronlen 500. It should be noted that RNA-
954
seq from timepoints in which B. stoltei ATCC 30299 and B. stoltei HT-IV cells were mixed
955
together were only mapped to the former genome assembly, and so reads for up to three alleles
956
may map to each of the genes in this assembly.
957
Genetic code prediction
958
We used the program PORC (Prediction Of Reassigned Codons; available from
959
https://github.com/Swart-lab/PORC) previously written to predict genetic codes in protist
960
transcriptomes [97] to predict the B. stoltei genetic code. This program was used to translate the
961
draft B. stoltei ATCC 30299 genome assembly in all six frames (with the standard genetic
962
code). Like the program FACIL [113] that inspired PORC, the frequencies of amino acids in
963
PFAM (version 34.0) protein domain profiles aligned to the six frame translation by HMMER
964
3.1b2 [114] (default search parameters; domains used for prediction with conditional E-values <
965
1e-20), and correspondingly also to the underlying codon, are used to infer the most likely amino
966
acid encoded by each codon (Figure S5).
967
Gene prediction
968
We created a wrapper program, Intronarrator, to predict genes in Blepharisma and other
969
heterotrichs, accommodating their tiny introns. Intronarrator can be downloaded and installed
970
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
42
together with dependencies via Conda from GitHub (https://github.com/Swart-lab/Intronarrator).
971
Intronarrator directly infers introns from spliced RNA-seq reads mapped by HISAT2 from the
972
entire developmental time course we generated. RNA-seq reads densely cover almost the entire
973
Blepharisma MAC genome, aside from intergenic regions, and most potential protein-coding
974
genes (Figure 5A). After predicting the introns and removing them to create an intron-minus
975
genome, Intronarrator runs AUGUSTUS (version 3.3.3) using its intronless model. It then adds
976
back the introns to the intronless gene predictions to produce the final gene predictions.
977
978
Introns are inferred from “CIGAR” string annotations in mapped RNA-seq BAM files, using the
979
regular expression “[0-9]+M([0-9][0-9])N[0-9]+M” to select spliced reads. For intron inference we
980
only used primary alignments with: MAPQ >= 10; just a single “N”, indicating one potential
981
intron, per read; and at least 6 mapped bases flanking both the 5’ and 3’ intron boundaries (to
982
limit spurious chance matches of a few bases that might otherwise lead to incorrect intron
983
prediction). The most important parameters for Intronarrator are a cut-off of 0.2 for the fraction of
984
spliced reads covering a potential intron, and a minimum of 10 or more spliced reads to call an
985
intron. The splicing fraction cut-off was chosen based on the overall distribution of splicing
986
(Figure S6A-C). From our visual examination of mapped RNA-seq reads and gene predictions,
987
values less than this were typically “cryptic” excision events [115] which remove potentially
988
essential protein-coding sequences, rather than genuine introns. Intronarrator classifies an
989
intron as sense (7389 in total, excluding alternative splicing), when the majority of reads
990
(irrespective of splicing) mapping to the intron are the same strand, and antisense (554 in total)
991
when they are not. The most frequently spliced intron was chosen in rare cases of overlapping
992
alternative intron splicing.
993
994
To eliminate spurious prediction of protein-coding genes overlapping ncRNA genes, we also
995
incorporated ncRNA prediction in Intronarrator. Infernal [38] (default parameters; e-value < 1e-6)
996
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 16, 2021. ; https://doi.org/10.1101/2021.12.14.471607doi: bioRxiv preprint
43
was used to predict a restricted set of conserved ncRNAs models (i.e., tRNAs, rRNAs, SRP, and
997
spliceosomal RNAs) from RFAM 14.0 [116]. These ncRNAs were hard-masked (with “N”
998
characters) before AUGUSTUS gene prediction. Both Infernal ncRNA predictions (excluding
999
tRNAs) and tRNA-scan SE 2.0 [117] (default parameters) tRNA predictions are annotated in the
1000
B. stoltei ATCC 30299 assembly deposited in the European Nucleotide Archive.
1001
1002
Since we found that Blepharisma stoltei, like Blepharisma japonicum [97], uses a non-standard
1003
genetic code, with UGA codon translated as tryptophan, gene predictions use the “The Mold,
1004
Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
1005
(transl_table=4)” from the NCBI genetic codes. The default AUGUSTUS gene prediction
1006
parameters override alternative (mitochondrial) start codons permitted by NCBI genetic code 4,
1007
other than ATG. So, all predicted B. stoltei gene coding sequences begin with ATG.
1008
1009
RNA-seq read mapping relative to gene predictions of Contig_1 of B. stoltei ATCC30299 was
1010
visualized with PyGenomeTracks [118].
1011
Assessment of genome completeness
1012
A BUSCO (version 4.0.2) [119] analysis of the assembled MAC genomes of B. stoltei and B.
1013
japonicum was performed on the set of predicted proteins (BUSCO mode -prot) using the
1014
BUSCO Alveolata database. The completeness of the Blepharisma genomes was compared to
1015
the protein-level BUSCO analysis of the published genome assemblies of ciliates T.
1016
thermophila, P. tetraurelia, S. coeruleus and I. multifiliis (Figure S1).
1017</