Accessing the genomic information of unculturable oceanic picoeukaryotes by combining multiple single cells

Article (PDF Available)inScientific Reports 7:41498 · January 2017with 434 Reads
DOI: 10.1038/srep41498
Cite this publication
Abstract
Pico-sized eukaryotes play key roles in the functioning of marine ecosystems, but we still have a limited knowledge on their ecology and evolution. The MAST-4 lineage is of particular interest, since it is widespread in surface oceans, presents ecotypic differentiation and has defied culturing efforts so far. Single cell genomics (SCG) are promising tools to retrieve genomic information from these uncultured organisms. However, SCG are based on whole genome amplification, which normally introduces amplification biases that limit the amount of genomic data retrieved from a single cell. Here, we increase the recovery of genomic information from two MAST-4 lineages by co-assembling short reads from multiple Single Amplified Genomes (SAGs) belonging to evolutionary closely related cells. We found that complementary genomic information is retrieved from different SAGs, generating co-assembly that features >74% of genome recovery, against about 20% when assembled individually. Even though this approach is not aimed at generating high-quality draft genomes, it allows accessing to the genomic information of microbes that would otherwise remain unreachable. Since most of the picoeukaryotes still remain uncultured, our work serves as a proof-of-concept that can be applied to other taxa in order to extract genomic data and address new ecological and evolutionary questions.
1
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
www.nature.com/scientificreports
Accessing the genomic information
of unculturable oceanic
picoeukaryotes by combining
multiple single cells
Jean-François Mangot1, Ramiro Logares1, Pablo Sánchez1, Fran Latorre1,
Yoann Seeleuthner2,3,4, Samuel Mondy2,3,4, Michael E. Sieracki5,6, Olivier Jaillon2,3,4,
Patrick Wincker2,3,4, Colomban de Vargas7,8 & Ramon Massana1
Pico-sized eukaryotes play key roles in the functioning of marine ecosystems, but we still have a
limited knowledge on their ecology and evolution. The MAST-4 lineage is of particular interest, since
it is widespread in surface oceans, presents ecotypic dierentiation and has deed culturing eorts
so far. Single cell genomics (SCG) are promising tools to retrieve genomic information from these
uncultured organisms. However, SCG are based on whole genome amplication, which normally
introduces amplication biases that limit the amount of genomic data retrieved from a single cell. Here,
we increase the recovery of genomic information from two MAST-4 lineages by co-assembling short
reads from multiple Single Amplied Genomes (SAGs) belonging to evolutionary closely related cells.
We found that complementary genomic information is retrieved from dierent SAGs, generating co-
assembly that features >74% of genome recovery, against about 20% when assembled individually.
Even though this approach is not aimed at generating high-quality draft genomes, it allows accessing
to the genomic information of microbes that would otherwise remain unreachable. Since most of the
picoeukaryotes still remain uncultured, our work serves as a proof-of-concept that can be applied to
other taxa in order to extract genomic data and address new ecological and evolutionary questions.
Most marine biodiversity is constituted by microbes that dominate in biomass and are fundamental for ecosystem
functioning and biogeochemical processes1–3. Among them, microbial eukaryotes play signicant roles in pri-
mary production4, nutrient cycling, and food-web dynamics as grazers and parasites5,6. In particular, pico- and
nano-sized Heterotrophic Flagellates (HF; 1–5 μ m) are important mortality agents of planktonic prokaryotes.
Furthermore, HF constitute a key link in the transfer of organic carbon to upper trophic levels7. For a long time,
marine HF were studied as homogeneous assemblages, but molecular surveys have revealed that they include
evolutionary very diverse groups6. A notable component of HF assemblages are the MArine STramenopiles or
MASTs, which are constituted by at least 18 groups8 with widespread distributions9–12. MASTs may reach up to
35% of cells in the HF assemblage13 and one group in particular, the geographically widespread MAST-4, can
present cell abundances averaging 9% of HF13 in the marine euphotic layer. us, MAST-4 may be one of the
most abundant HF in the oceans. Unfortunately, MAST species have escaped cultivation so far, with only one
exception within the clade MAST-314. Despite the obvious importance of MAST cells, both in terms of abundance
and diversity, little is known about their biology and evolution. To address the latter, genome sequencing appears
1Department of Marine Biology and Oceanography, Institute of Marine Sciences (ICM)–CSIC, Pg. Marítim de la
Barceloneta, 37-49, Barcelona E-08003, Spain. 2CEA, Institut de Génomique, Génoscope, 2 Rue Gaston Crémieux,
Evry F-91000, France. 3CNRS, UMR 8030, CP5706, Evry, F-91000, France. 4Université d’Evry, UMR 8030, CP5706,
Evry, F-91000, France. 5National Science Foundation, 4201 Wilson Boulevard, Arlington, VA 22230, USA. 6Bigelow
Laboratory for Ocean Sciences, 60 Bigelow Drive, East Boothbay, ME 04544, USA. 7CNRS, UMR 7144, Station
Biologique de Rosco, Place Georges Teissier, Rosco, F-29680, France. 8Sorbonne Universités, UPMC Université
Paris 06, UMR 7144, Station Biologique de Rosco, Place Georges Teissier, Rosco, F-29680, France. Correspondence
and requests for materials should be addressed to J.-F.M. (email: jean-francois.mangot@wanadoo.fr) or R.M. (email:
ramonm@icm.csic.es)
Received: 30 August 2016
Accepted: 21 December 2016
Published: 27 January 2017
OPEN
www.nature.com/scientificreports/
2
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
as a powerful approach, but the lack of sucient genomic DNA material due to cells’ unculturability prevents
traditional shotgun sequencing and genome assembly.
Currently, there still is a big knowledge gap in protist genomics caused by the reluctance of many species
to grow in culture. Indeed, only 15% of the completed or ongoing projects in the Genomes OnLine Database
(GOLD; https://gold.jgi.doe.gov)15 concern protistan taxa, and most of them come from cultured phototro-
phic16–18 or parasitic species19, resulting in a biased view of the full eukaryotic diversity20. In this context, a nat-
ural option is single cell genomics (SCG), which produce Single Amplied Genomes (SAGs) that can later be
sequenced. is technology was initially used to produce genomic information from single prokaryotic cells
collected from the ocean21. e rst SCG studies targeting microeukaryotes aimed at getting an accurate assess-
ment of community composition22, or exploring complex biotic interactions at the single-cell level, such as the
presence of prey or pathogens within a host cell23,24. To the best of our knowledge, only three studies have applied
SCG to retrieve the genomes of uncultured microeukaryotes including Picozoa (3 SAGs)23, Paulinella ovalis
(6 SAGs)25 and MAST-4 (1 SAG)26. Whereas the genome completeness in SCG studies of prokaryotes range
from 10% to 100%21, the genomes obtained in the previous eukaryotic studies are still partial. For instance, Roy
and colleagues26 retrieved about one third of the conserved eukaryotic protein coding genes, used as proxy for
genome completeness, in their MAST-4 SAG assembly. is limited recovery is likely produced by the bias intro-
duced during the whole-genome amplication, which seems to preferentially amplify certain genomic regions27.
A recent study has revealed an average genome recovery per SAG of 81% when compared against the 10.4 Mbp
reference genome of Cryptosporidium parvum28, a pathogenic protist infecting both humans and animals29.
Unfortunately, most protist species have larger and more complex genomes, lacking also reference genomes that
can help during assembly.
Here, we increase the recovery of genomic information from two marine HF species by co-assembling
separately-sequenced SAGs belonging to the same species (SupplementaryFig.S1). Dierent cells from two
MAST-4 lineages (clades A and E) were isolated during the Tara Oceans expedition30 and SAGs were produced.
Before co-assembling dierent SAGs, we ensured that they had identical 18S rDNA, shared comparable tetranu-
cleotide frequencies, and had > 95% overall nucleotide identity. We observed a signicant increase in the genome
recovery in both MAST-4 clades, from around 20% in individual SAGs to 68–74% in the co-assemblies. Our
approach allowed recovering genomic functions from genomes that were previously unknown, and which will be
pivotal to understand the ecological role of these uncultured agellates in the ocean.
Results and Discussion
Limitations of using only one SAG to investigate genomics of marine picoeukaryotes. A total
of 22 pico-heterotrophic cells aliating to MAST-4A (n = 13) and MAST-4E (n = 9) were isolated during the
Tara Oceans expedition. All MAST-4E cells and most MAST-4A cells were isolated from the same station in the
Mediterranean Sea (station 23, Adriatic Sea), whereas two additional MAST-4A cells derived from the Indian
Ocean (station 41, Arabic Sea) (SupplementaryTablesS1 and S2). A rst Sanger sequencing of the 18S rDNA
gene from the MDA product revealed identical sequences among all MAST-4E cells, as well as among most
MAST-4A cells (only 2 of the 13 cells had only 1 mismatch with the rest). Overall, 23 SAGs were sequenced (one
SAG was sequenced twice) producing on average 24.2 million paired-end Illumina HiSeq 2000 reads per SAG
(SupplementaryTableS2). e sequencing depth of individual SAGs was similar, with a mean value of 4.9 (± 1.4)
Gbp (SupplementaryTableS2).
Sequenced SAGs were individually assembled, resulting in assembly sizes from 1.6 to 20.3 Mbp in MAST-4A
(mean of 9.2 ± 4.5) and from 2.3 to 9.7 Mbp in MAST-4E (mean of 6.2 ± 2.4) (considering contigs > 1 kbp;
Fig.1a). is is in agreement with a previous study of one SAG from the MAST-4D lineage that resulted in an
assembly size of 16.9 Mbp using a sequencing depth of 6.6 Gbp26. Assembly sizes in the other uncultured protists
SAGs were similar as well, around 5 Mbp for Picozoa23 and Paulinella ovalis25. Based on these three studies and
our own data, it appears that the assembly size obtained by using one SAG may vary by a factor of 10, typically
between 2 and 20 Mbp. Similarly, the number of contigs assembled (Fig.1b) and their respective N50 (Fig.1c)
also varied among SAGs. Within MAST-4A, the contig number varied from 467 to 3,323 (mean of 1,696 ± 764),
and the N50 from 5.2 to 18.7 kbp (mean of 11 ± 3.1), while for MAST-4E the contig number varied from 478 to
1,649 (mean of 1,099 ± 351), and N50 from 7.5 to 13.0 kbp (mean of 10.6 ± 1.9). Again, both parameters were
similar to the values found in MAST-4D26. e GC content averaged 33.9% in MAST-4A SAGs and 44.1% in
MAST-4E SAGs (Fig.1d) and showed very little variability (1 and 0.5%, respectively). Such dierences in GC
content suggest that MAST-4A and MAST4-E are evolutionary divergent. One MAST-4A SAG had a slightly
higher GC content (Fig.1d), which could be due to (i) “non-targeted” DNA found inside the cell due to infection,
prey capture, or symbiosis, (ii) externally associated as attached cells or free DNA, or (iii) contamination during
the cell sorting or sequencing. So, a substantial part of this foreign DNA could highlight true organismal interac-
tions23, and we made the choice to leave it in our analyses for a possible further exploitation.
e variability in assembly size did not depend on sequencing depth, as shown by the comparison of estimated
genome recovery (as percentage of ultra-conserved eukaryotic genes retrieved with CEGMA) vs. sequencing
depth (Fig.2a). Genome recovery averaged 18.7% (± 9.7) in MAST-4A and 14.1% (± 5.4) in MAST-4E SAGs.
In some SAGs, genome recovery was similar to the 37.5% that we estimated for the previously sequenced
MAST-4D26. Furthermore, SAGs that were independently sequenced in two sequencing centers (AA538-G20 and
AA538-G20_bis, SupplementaryTableS2) produced a similar assembly size (9.2 and 10.2 Mbp) despite dierent
sequencing depths (4.6 and 6.8 Gbp, TableS2).
e lack of correlation between genome recovery and sequencing depth (Fig.2a) suggests near-saturation of
the sequencing eort per SAG. is was further tested by assessing the genome recovery of the two SAGs with
largest assemblies using decreasing fractions of the sequenced reads (Fig.2b). For each subsampled level, the
ve replicates behaved similarly (SE < 1.5% in both cases) and the dynamics of recovery vs. sequencing depth
www.nature.com/scientificreports/
3
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
followed a Michaelis-Menten relationship31, levelling o at the performed sequencing depth (Fig.2b). In fact,
MAST-4A and MAST-4E SAGs reached about 65% of their genome recovery with only 17% of the reads, and
about 80% with half of the reads (Fig.2b).
From the previous analyses, it appeared that individual SAGs of uncultured protist cells were variable and
recovered only a fraction (about 20%) of their genomes, which did not improve by increasing the sequencing
eort. is might depend on intrinsic properties of selected cells, their DNA integrity, as well as MDA biases21,28,32.
An option to improve genome recovery of uncultured cells would be using partial SAG assemblies to recruit
metagenome reads and/or contigs or to serve as training sets for supervised binning eorts of metagenomic data
from the same sample. is reassembly of reads has been recently tested on one archaeal SAG of Korarchaeota33,
resulting in only a slight increase of the genome recovery, from 87% to 89%. Another option is sorting multi-
ple natural cells and performing a targeted metagenomic analysis34–36. us, the complete chloroplast genome
(91 kbp) of Pelagomonas calceolata was generated from natural communities35. In the case of protists having larger
genomes and living in complex communities, sorting natural populations seems less promising, and instead the
co-assembly of closely related SAGs that belong to the same species seems a good option. is approach was rst
used by Rinke and colleagues37, which obtained prokaryotic genomes with an estimated completeness of over 90%.
Determining which SAGs can be co-assembled? Despite all SAGs from the two MAST lineages had virtually
the same 18S rDNA sequence, this could be insucient to infer genomic homogeneity for co-assembly38,39, as
cells with identical 18S rDNA could be genomically too dierent40. erefore, we run pairwise comparisons of
the SAG sequences using BLASTn (SupplementaryFig.S2). SAGs aliating to MAST-4A had a slightly lower
average nucleotide identity (ANI) among them (95.1% to 99.9%, mean of 97.6 ± 0.8%; SupplementaryFig.S2a)
than MAST-4E SAGs (98.4% to 99.6%, mean of 99.1 ± 0.3%; SupplementaryFig.S2b). In each pairwise com-
parison, and since each SAG contains a dierent region of the genome (see section below), only a fraction of the
assembly could be compared (SupplementaryFig.S2c and S2d). us, most SAGs shared less than 50% of their
genomic content (average of 32.5% ± 16.2 for MAST-4A and 27.5% ± 9.6 for MAST-4E), except the two replicated
SAGs and the pair AA538-E21/AA538-C11, which shared 71.0% and 95.9% respectively. Among MAST-4A,
AA538-K07 was atypical, presenting the lowest ANI (95.9% ± 0.4) and the lowest genome overlap with other
SAGs (from 3.2% to 8.5%). is SAG presented a second peak in its GC content (data not shown), suggesting
the presence of foreign DNA. In the current genomics era, the use of ANI becomes essential to dene microbial
species. Among prokaryotes, the ANI threshold to adequately dene species is above 95–96% in at least 20% of
the genome41. Similar data on microbial eukaryotic species is not yet available, but a threshold of 97–99% seem to
be reasonable based on our results.
Besides the 18S rDNA and ANI comparisons, we also analysed tetranucleotide frequencies coupled to ESOM
clustering42 to determine if dierent MAST-4 SAGs have the same genomic features and perhaps identify contigs
Figure 1. General characteristics of the dra genomes obtained by individual SAGs. Box plots capture the
variation in assembly size (a), number of contigs (b), N50 (c) and GC content (d) among MAST-4A (n = 14)
and MAST-4E (n = 9) SAGs.
www.nature.com/scientificreports/
4
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
with deviant signatures. In previous studies, this approach enabled the identication of genomic clusters within
prokaryotic assemblages33,43,44. To explore the potential of the ESOM mapping with eukaryotic genomes, we
used a selection of published genomes of six photosynthetic and two heterotrophic protists. Fragmented contigs
(2.5–5 kbp in size) from each genome formed clear separate clusters (SupplementaryFig.S3). We then adapted
this approach to analyse the 23 SAGs used here, represented by 17,029 fragmented contigs (about two thirds from
MAST-4A and one third from MAST-4E). As expected, the obtained topography (U-Matrix) representing the
structure of the tetranucleotide frequency dataset formed two large clusters that coincided with MAST-4A and
MAST-4E contigs (Fig.3). All SAGs within the same lineage were found in the same region of the map, revealing
the same tetranucleotide frequency prole. In addition, we observed two small clusters that contained particular
genomic signatures: the subcluster “a” of mitochondrial origin and the subcluster “b” of putative prey origin. A
detailed analysis of the contigs within these subclusters identied four subgroups (Fig.3). e rst (a1) included
17 contigs from both lineages related to the mitochondrion of Cafeteria roenbergensis (mean similarity of 94%);
probably these belonged to MAST-4 mitochondrion. e three other subgroups derived from putative prey DNA.
Subcluster a2 contained contigs related to algal mitochondria (2 to Ostreococcus spp. and 3 to Micromonas spp.,
with mean similarity of 91% and 98%, respectively), whereas contigs in subclusters b1 and b2 are related to
nuclear genome of Bathycoccus prasinos (36 contigs with mean similarity of 86%) and Ostreococcus lucimarinus
(3 contigs with mean similarity of 88%), respectively. e presence of algal DNA in these dra genomes suggests
that MAST-4 can ingest picosized algae. Although MAST-4 is generally considered a bacterial grazer, it has been
seen eating the picoalgae Micromonas pusilla in grazing experiments45. Overall, the DNA from algal prey repre-
sents a very small fraction of MAST-4 genomes (< 0.3% of fragmented contigs).
Co-assembling individual SAG sequences to by-pass the MDA bias. Based on the tetranucleotide
frequency proles and genomic data (%-GC and ANI values) we decided to co-assemble the Illumina reads
of SAGs from the two MAST-4 lineages. e co-assembly of the 14 MAST-4A SAGs yielded 48.1 Mbp and the
co-assembly of the 9 MAST-4E SAGs yielded 32.3 Mbp (considering contigs > 1 kbp; Table1). e MAST-4A
nal co-assembly contained 15,370 contigs with an N50 of 4.5 kbp, while MAST-4E contained 5,679 contigs with
an N50 of 10 kbp. e CEGMA analysis searching for 248 core eukaryotic genes identied 184 and 169 orthologs
in each genome (Table1; SupplementaryFig.S4), resulting in an estimated genome completeness of 74.2% and
68.2% in MAST-4A and MAST-4E, respectively. e same analysis done on complete genomes of free-living uni-
cellular eukaryotes sequenced in the standard way (shotgun sequencing from multiple cells of a clonal culture),
resulted in only a bit larger recovery estimates, from 78% in Chlorella variabilis and Chlamydomonas reinhardtii to
96% in Phytophthora sojae (Table1). Overall, the MAST-4 co-assemblies had 4–5 times more conserved proteins
and were 5 times longer than individual SAG assemblies.
As stated before, SAGs from the same lineage had a low sequence overlap, typically around 30%
(SupplementaryFig.S2c and S2d), suggesting that each SAG was recovering a dierent region of the genome. To
verify this statement, we mapped the reads of each SAG back to the nal co-assembly to determine the contribu-
tion of each SAG and the regions of overlap among SAGs. Although a large fraction of the co-assemblies resulted
from the combination of several SAGs, a signicant part of them, 17% in MAST-4A and 25% in MAST-4E,
derived from only one SAG (Fig.4). More than half of the nal co-assembly was obtained with 2–3 SAGs. At
the other end, a very small fraction of the nal co-assembly (< 0.5%) was found in all SAGs (Fig.4). e latter
patterns are likely the result of MDA bias50, which seems to randomly amplify a dierent region of the genome
in each SAG.
e relationship between the number of co-assembled SAGs and the size and recovery of the nal co-assembly
followed a Michaelis-Menten curve31 in both cases, with signs of saturation at the highest number of SAGs (Fig.5).
Figure 2. Genome recovery estimated by CEGMA of SAGs in relation to the sequencing eort. (a) Genome
recovery of the 23 SAGs in relation to their sequencing depth. (b) Genome recovery at dierent sequencing
depths in two selected SAGs (those with the largest genome in each clade). Each point represents the mean
recovery aer 5 separate subsamplings (at 17%, 33%, 50%, 67%, and 83%) of the total number of reads.
www.nature.com/scientificreports/
5
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
Estimated genome sizes extrapolated from this curve were 62.7 Mbp in MAST-4A and 48.5 Mbp in MAST-4E.
These values were similar to those calculated using the genome recovery by CEGMA in the co-assembled
genomes, 64.8 and 47.4 Mbp.
Gain and loss during the co-assembly. Genomic information could be lost during co-assembling if SAGs
were not identical, as was the case here. To better understand this potential loss, we compared the contigs of each
individual SAG with the nal co-assembly using Quast51. e vast majority of SAG contigs (> 99.5% of their
length) matched at > 95% identity with the co-assemblies, whereas only a very low proportion of the genomic
data present in the individual SAGs was lost, i.e. 0.3% (± 0.2) in MAST-4A and 0.2% (± 0.2) in MAST-4E. On the
other hand, the nal co-assemblies were indeed masking real genetic dierences (single nucleotide polymor-
phisms, indels) of individual cells. On average, 445 (± 126) and 122 (± 11) mismatches and 125 (± 23) and 60 (± 7)
indels per 100 kbp were detected in MAST-4A and MAST-4E, respectively.
Another way to estimate the potential gain and loss of genetic information during the co-assembly is by a
detailed analysis of the 248 core eukaryotic genes (CEGs) found within SAGs and the co-assemblies. We illus-
trated this by focusing on a subset of 34 CEGs coding for proteins involved in translation, ribosomal structure
and biogenesis processes (Fig.6). On the one hand, the nal co-assemblies of MAST-4A and MAST-4E retrieved
29 and 25 of these CEGs, whereas each individual SAG retrieved a lower number (1 to 16 in MAST-4A and 1 to
7 in MAST-4E), highlighting the signicant gain in retrieved genomic information when co-assembling (Fig.6).
In addition, we also identied several CEGs (5 in MAST-4A and 2 in MAST-4E) found in co-assemblies but
absent in individual SAGs, indicating that reads from dierent SAGs have participated in the assembly of those
CEGs. On the other hand, a few CEGs detected in the individual SAGs were not retrieved in their respective
co-assembly (KOG1770, KOG0650 and KOG0122 in MAST-4A; Fig.6). e same analysis done on the complete
set of 248 CEGs (SupplementaryFig.S4) revealed 18 and 6 CEGs exclusively found in MAST-4A and MAST-4E
co-assemblies, and 33 and 1 CEGs only found in the SAGs (Table2). ese last CEGs were in fact present in the
co-assembly but below the detection threshold of CEGMA (at least 70% of the protein length) or in the discarded
small contigs (< 1 kbp). Overall, we found that all general functions for which conserved genes are indexed were
present in both nal co-assemblies (TableS3), with only a few exceptions, as the lack of genes involved in the
transport and metabolism of inorganic ions in MAST-4E. e retrieval of multiple conserved eukaryotic func-
tions in MAST-4A and MAST-4E supported the adequacy of the co-assembly strategy to increase the amount of
retrieved genomic information.
Figure 3. Comparison of tetranucleotide frequencies of SAGs in an ESOM map. Each contig (2.5–5 kbp in
size) is represented by a point placed in the map by relatedness and colored according to their provenance from
SAGs of MAST-4A (bluish) or MAST-4E (reddish). Note that the map is continuous from top to bottom and
side to side. Large dierences in tetranucleotide frequencies (black borders) represent natural divisions between
taxonomic groups. Two clusters (a and b) were identied and taxonomically assigned (see text).
www.nature.com/scientificreports/
6
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
We then compared the sequence of CEGs in the SAGs and in the co-assemblies (Table2). In most cases, all
retrieved sequences were identical (73 of 166 in MAST-4A and 119 of 163 in MAST-4E) or above 95% similar-
ity (53 and 18, respectively). However, there were a few examples of very low sequence identity (for instance
in KOG2971 shown in Fig.6). e reason for these few cases of low similarity was the presence among the
SAGs of two CEG variants, very distant among them, and only one was detected in the nal co-assembly (since
CEGMA only detects one CEG). However, the second variant was also present in the co-assembly. erefore,
even there were two variants of the same CEG coexisting in the population, or the second variant derived from
Raw assembly
size (Mbp)
CEGMA
completeness
(%)
Number of
genes
Mean gene
size (bp)
Mean intron
density (introns
per gene)
Mean
intron
length (bp)
Number
of KOs
or GOs*Number of
KOGReference
Stramenopiles
Mar Stram MAST-4A 48.1 74.2 19,909 1,657 0.56 260 2,733 2,115 is study
MAST-4E 32.3 68.2 11,850 1,723 0.36 332 2,210 1,878 is study
Bacil alassiosira pseudonana 34.5 92.7 11,242 992 1.4 5,473 8,113 46
Oomyc Phytophthora sojae 95 96.0 19,027 8,714 3,891 19
Phytophthora ramorum 65 95.2 15,743 7,633 3,830 19
Opist Choan Monosiga brevicollis 42 92.7 9,196 3,004 6.6 174 1,843 3,389 47
Chlorophyta
Mamiell
Micromonas pusilla
CCMP1545 21.9 83.5 10,575 1,557 0.9 187 4,787 7,086 17
Micromonas pusilla RCC299 20.9 87.1 10,056 1,587 0.57 163 4,911 6,554 17
Bathycoccus prasinos 15 87.5 7,847 — 3,597 18
Ostreococcus tauri 12.5 80.6 8,116 1,257 0.39 187 3,603 5,320 16
Tre b Chlorella variabilis 46.2 77.8 9,791 2,928 209 5,372 7,938 48
Chlor Chlamydomonas reinhardtii 121 77.8 15,143 4,312 0.92 373 6,733 9,435 49
Table 1. MAST-4A and MAST-4E assembly properties in comparison to complete published genomes of
other small phototrophic and heterotrophic protists. Mar Stram, Marine Stramenopiles. Bacil, Bacillariophyceae.
Oomyc, Oomycetes. Opist, Opistokhonta. Choan, Choanoagellates. Mamiell, Mamiellophyceae. Treb,
Trebouxiophyceae. Chlor, Chlorophyceae. Assembly features of MAST-4A and MAST-4E have been calculated
on contigs longer than 1 kb. Assembly features of published genomes were retrieved from their respective
publications or, when missed, from the JGI genome portal (http://genome.jgi.doe.gov). Additionally, their CEGMA
completeness (contigs > 1 kb) were also calculated here. Missing data are shown by the symbol (—). *KOs, KEGG
Orthology. GOs, Gene Ontology. KOGs, Eukaryotic Orthologous Groups.
Figure 4. Fractions of the co-assembled genomes of MAST-4A and MAST-4E shared among their
respective SAGs (from 1 to 14 cells). e contribution of each SAG was determined through a fragment
recruitment analysis of their reads towards the nal co-assembly.
www.nature.com/scientificreports/
7
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
a putative prey. At any rate, we found very high sequence identity among all CEGs retrieved from the SAGs and
the co-assemblies.
We also studied the eect of co-assembling dierent cells on the universal rDNA operon. We rst searched
the 18S rDNA sequence and found it in 9 MAST-4A and 6 MAST-4E individual SAGs (SupplementaryFig.S5).
Oen, the complete rDNA operon was retrieved in a single contig among SAGs, while it was fragmented in
the nal co-assembly. e rDNA variability was mainly found in the variable Internal Transcribed Spacer (ITS)
regions (SupplementaryFig.S5; Fig.7). Within MAST-4A, the ITS variability correlated with sample origin
(Fig.7), as SAGs from the Indian Ocean (AB537-A17 and AB537-K04) were similar (~97% in both ITS regions)
and diered from those in the Mediterranean Sea. Dierentiation of MAST-4A populations based on ITS surveys
was explained by temperature rather than geographic distance52, and our data followed this trend since the sam-
ples from the two stations diered in more than 10 °C (SupplementaryTableS1). Furthermore, we searched for
particular regions in the ITS1 and ITS2 secondary structures (helices II and III) that need to be identical among
individuals to be considered from the same biological species53. ese regions were indeed identical in all SAGs
(Fig.7).
Finally, we predicted genes in the nal co-assemblies: 26,676 exons were predicted for MAST-4A and 14,919
for MAST-4E, which resulted in 19,909 and 11,850 predicted proteins (Table1). A comparable number of pro-
teins, when normalized by the size of its dra genome, were found in MAST-4D, 6,993 genes in the 16.9 Mbp26.
Compared with published genomes of picosized protists and other heterotrophic agellates, the number of genes
predicted in MAST-4 genomes is relatively high (Table1), only comparable with the parasitic Phytophthora spp,
and perhaps some could derive from foreign DNA. On the other hand, the mean gene size (1,657 and 1,723 bp
in each lineage) was similar to the values found in other protists genomes (Table1). Compared to other protists,
MAST-4 has compact genomes with few but long introns (Table1). Finally, a rst gene annotation of the two
co-assemblies were performed using BLASTp against the KEGG Orthology (KO) database54 revealing a total
of 2,733 and 2,210 good KO hits for MAST-4A and MAST-4E, respectively (Table1). Similarly, predicted pro-
teins were also searched against the eukaryotic orthologous groups (KOG) using rpsBLAST and the Conserved
Domains and Protein Classication database of NCBI (NCBI-CDD)55 and a total of 2,115 and 1,878 KOGs were
assigned among the MAST-4A and MAST-4E co-assemblies (Table1). Nevertheless, since foreign DNA is known
Figure 5. Cumulative genome size (a) and genome recovery (b) calculated when increasing the number of
SAGs used for co-assembly.
www.nature.com/scientificreports/
8
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
to exist in the two co-assemblies, these values are solely indicative and a deeper eort in removing any trace of
foreign DNA is still needed to get a better insight of the metabolic machinery of such uncultured organisms.
At any rate, our observation of only few contigs coming from putative preys (Fig.3) suggest a minor impact of
foreign DNA in gene prediction and annotation in the two MAST-4 co-assemblies.
Conclusion
Our study shows that only a fraction (about one h) of the genome of a picoeukaryote can be obtained from
an individual SAG. SAGs from the same species oen retrieve dierent genome regions, and recovery is hardly
improved by increasing the sequencing depth. e co-assembling strategy proposed here has proven its eciency
to bypass these limitations. To ensure the correct mixing of cells from the same species, we established two addi-
tional criteria in addition to identical 18S rDNA: a high ANI (> 95%) and similar tetranucleotide frequency pro-
les. By co-assembling SAGs we have access to more genes and functions from uncultured agellates, although
we are also missing intraspecic genetic variability. is strategy can be used and adapted to a range of uncultured
protist species, whose genomes would remain unknown or partially known otherwise.
Methods
Sample collection and single-cell sorting. Samples for single-cell sorting were collected during the cir-
cumglobal Tara Oceans expedition30 and cryopreserved as described before22. Flow cytometry cell sorting, single
cell lysis and genomic DNA amplication by Multiple Displacement Amplication (MDA)56,57 were performed by
the Bigelow Laboratory Single Cell Genomics Center (https://scgc.bigelow.org) as previously described24,58 with a
slight modication: 1x SYBR Green I (Life Technologies Corporation) was used instead of Lysotracker Green to
stain the cells (SupplementaryFig.S1a). e obtained SAGs were screened by PCR using universal eukaryotic 18S
Figure 6. Identication of the 34 CEGs coding for proteins involved in translation, ribosomal structure
and biogenesis processes within SAGs and co-assemblies of both lineages. e presence of CEGs among
SAGs and co-assembly (light grey) or solely among SAGs (dark grey) or co-assembly (black) are listed here.
Lineage
Number of CEGs detected
In SAGs and Co-assembly
100%*95%<95%NATo ta l Solely in SAGs S olely in Co-assembly
MAST-4A 73 53 29 11 166 33 18
MAST-4E 119 18 15 11 163 1 6
Table 2. Summary of the 248 CEGMA eukaryotic core genes (CEGs) determined in SAGs and
co-assemblies of both MAST lineages. *Mean amino acid sequence identity of CEGs found in several SAGs.
NA: Not applicable, since these CEGs are found in only one SAG.
www.nature.com/scientificreports/
9
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
rDNA primers and taxonomically assigned (SupplementaryFig.S1a). A total of 22 SAGs aliating to the Marine
Stramenopiles clade A (MAST-4A) and clade E (MAST-4E) were selected for sequencing. Sample associated
environmental metadata are reported in SupplementaryTableS1 and more details can be found in PANGAEA59.
SAG sequencing, assembly, quality control and completeness assessment. Aer purication of
the MDA products and generation of 101 bp paired-end libraries, each SAG was sequenced in a 1/8th Illumina
HiSeq lane at the Oregon Health & Science University (US) or the National Sequencing Center of Genoscope
(France) (SupplementaryTableS2 and SupplementaryFig.S1a). Reads were assembled or co-assembled using
SPAdes 3.1 or 3.660. In all assemblies, contigs shorter than 1 kbp were discarded. Quality proles and basic statis-
tics (genome size, number of contigs, N50, GC content) of single SAG assemblies and co-assemblies were gener-
ated with Quast51. Estimations of genome recovery were done with CEGMA61 (Core Eukaryotic Genes Mapping
Approach; SupplementaryFig.S1d).
In order to assess if genome completeness in each SAG depended on sequencing eort, reads from the two
largest SAG assemblies within each clade were randomly subsampled into 5 dierent sequencing depths with
the seqtk toolkit (https://github.com/lh3/seqtk). Five independent replicates were generated for each sequencing
depth using dierent random number generator seeds. For each pool of subsampled reads, new assemblies and
genomes recoveries (contigs > 1 kbp) were generated as described above.
SAG comparisons based on nucleotide identity and tetranucleotide frequency. Nucleotide
identity between SAGs was estimated by a pairwise BLAST analysis62 between full-length contigs of all SAGs
within each clade, with a minimum similarity of 70% and a maximal e-value of 105. Tetranucleotide frequencies
in each individual SAG were calculated using a 1 bp sliding window in both DNA strands in contigs between
2.5 and 5 kbp in size with a custom Perl script43 and clustered using ESOM42 (Emergent Self-organizing Maps;
SupplementaryFig.S1b). Raw data were normalized using robust estimates of mean and variance (“Robust
ZT” option) and trained according to Dick and colleagues43 with the k-Batch algorithm and Euclidean grid
distance. Sub-clusters of interest were isolated to identify the corresponding contigs by a BLASTn analysis
against the NCBI-nt and NCBI-RefSeq (including organelles genomes) databases63. Blast hits (similarity >  80%,
e-value < 105) were taxonomically assigned.
Genome analysis using fragment recruitment tools. Original reads were mapped back to their corre-
sponding co-assembly using bowtie2 with default parameters64 (SupplementaryFig.S1c). e reads alignments
(BAM le) obtained were processed using samtools65, BEDTools66, QualiMap267 and custom perl scripts. en
a comparison of each individual SAG assembly against the co-assembly as a reference for both MASTs was per-
formed using Quast51.
Analysis of Core Eukaryotic Genes (CEGs) and of the rDNA operon. A subset of 248 universal CEGs
within each SAG and the two nal co-assemblies were identied with CEGMA61. For each detected CEG, amino
acids sequences were aligned using Clustal-Omega68. ese alignments were then used to calculate distance
matrices based on percent identities for each sequence pair.
We searched for contigs containing the 18S rDNA sequence in all individual MAST-4 SAGs and the
co-assembly. e complete rDNA operon sequences were aligned using ClustalW69, as implemented in the
Figure 7. Alignment of the ITS1 (a) and ITS2 (b) regions of individual SAGs and the co-assembly in
MAST-4A. Conserved nucleotides in the helices II and III of the two regions were highlighted according to ITS
secondary structure models in MAST-4. Dierences against a consensus sequence (not shown) are colored as
red (A positions), green (T), blue (C), and yellow (G).
www.nature.com/scientificreports/
10
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
Geneious package70. Internal Transcribed Spacer regions (ITS1 and ITS2) were identied and annotated based
on a previous work on ITS secondary structures of MAST-453.
Gene prediction of co-assembled genomes and taxonomic proling. e initial set of CEGs pre-
dicted with CEGMA were used to train the Augustus ab initio gene predictor71 prior to its execution on the
full co-assembly using defaults parameters (SupplementaryFig.S1d). Genes were annotated using BLASTp
(e-value < 105) and rpsBLAST against, respectively, KEGG Orthology54 and the Conserved Domains and
Protein Classication database of NCBI (NCBI-CDD)55. BLASTp hits with at least 100 bp alignments including at
least 30% of query coverage and > 25% similarity were kept.
References
1. Giovannoni, S. J. & Stingl, U. Molecular diversity and ecology of microbial planton. Nature 437, 343–348 (2005).
2. DeLong, E. F. e microbial ocean from genomes to biomes. Nature 459, 200–206 (2009).
3. Falowsi, P. G., Fenchel, T. & Delong, E. F. e microbial engines that drive Earth’s biogeochemical cycles. Science 320, 1034–1039
(2008).
4. Jardillier, L., Zubov, M. V., Pearman, J. & Scanlan, D. J. Signicant CO2 xation by small prymnesiophytes in the subtropical and
tropical northeast Atlantic Ocean. ISME J. 4, 1180–1192 (2010).
5. Sherr, E. & Sherr, B. Understanding roles of microbes in marine pelagic food webs: A brief history In Microbial Ecology of the Oceans
(ed. irchman, D. L.) 27–44 (Wiley-Liss., 2008).
6. Massana, . Euaryotic picoplanton in surface oceans. Annu. ev. Microbiol. 65, 91–110 (2011).
7. Jürgens, . & Massana, . Protist grazing on marine bacterioplanton In Microbial Ecology of the Oceans (ed. irchman, D. L.)
383–424 (Wiley-Liss., 2008).
8. Massana, ., del Campo, J., Sieraci, M. E., Audic, S. & Logares, . Exploring the uncultured microeuaryote majority in the oceans:
reevaluation of ribogroups within stramenopiles. ISME J. 8, 854–866 (2014).
9. Taishita, . et al. Genetic diversity of microbial euaryotes in anoxic sediment of the saline meromictic lae Namao-ie (Japan):
On the detection of anaerobic or anoxic-tolerant lineages of euaryotes. Protist 158, 51–64 (2007).
10. Not, F., del Campo, J., Balagué, V., de Vargas, C. & Massana, . New insights into the diversity of marine picoeuaryotes. PLoS One
4, e7143 (2009).
11. Lin, Y.-C. C. et al. Distribution patterns and phylogeny of marine stramenopiles in the North Pacic Ocean. Appl. Environ. Microbiol.
78, 3387–3399 (2012).
12. Logares, . et al. Diversity patterns and activity of uncultured marine heterotrophic agellates unveiled with pyrosequencing. ISME
J. 6, 1823–1833 (2012).
13. Massana, ., Terrado, ., Forn, I., Lovejoy, C. & Pedrós-Alió, C. Distribution and abundance of uncultured heterotrophic agellates
in the world oceans. Environ. Microbiol. 8, 1515–1522 (2006).
14. Cavalier-Smith, T. & Scoble, J. M. Phylogeny of Heteroonta: Incisomonas marina, a uniciliate gliding opalozoan related to Solenicola
(Nanomonadea), and evidence that Actinophryida evolved from raphidophytes. Eur. J. Protistol. 49, 328–353 (2013).
15. Pagani, I. et al. e Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated
metadata. Nucleic Acids es. 40, D571–D579 (2012).
16. Derelle, E. et al. Genome analysis of the smallest free-living euaryote Ostreococcus tauri unveils many unique features. Proc. Natl.
Acad. Sci. USA 103, 11647–11652 (2006).
17. Worden, A. Z. et al. Green evolution and dynamic adaptations revealed by genomes of the marine picoeuaryotes Micromonas.
Science 324, 268–272 (2009).
18. Moreau, H. et al. Gene functionalities and genome structure in Bathycoccus prasinos reect cellular specializations at the base of the
green lineage. Genome Biol. 13, 74 (2012).
19. Tyler, B. M. et al. Phytophthora genome sequences uncover evolutionary origins and mechanisms of pathogenesis. Science 313,
1261–1266 (2006).
20. Del Campo, J. et al. e others: Our biased perspective of euaryotic genomes. Trends Ecol. Evol . 29, 252–259 (2014).
21. Stepanausas, . Single cell genomics: an individual loo at microbes. Curr. Opin. Microbiol. 15, 613–620 (2012).
22. Heywood, J. L., Sieraci, M. E., Bellows, W., Poulton, N. J. & Stepanausas, . Capturing diversity of marine heterotrophic protists:
one cell at a time. ISME J. 5, 674–684 (2011).
23. Yoon, H. S. et al. Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science 332, 714–717 (2011).
24. Martínez-García, M. et al. Unveiling in situ interactions between marine protists and bacteria through single cell sequencing. ISME
J 6, 703–707 (2012).
25. Bhattacharya, D. et al. Single cell genome analysis supports a lin between phagotrophy and primary plastid endosymbiosis. Sci. ep.
2, 1–8 (2012).
26. oy, . S. et al. Single cell genome analysis of an uncultured heterotrophic stramenopile. Sci. ep. 4, 4780 (2014).
27. Sidore, A. M., Lan, F., Lim, S. W. & Abate, A. . Enhanced sequencing coverage with digital droplet multiple displacement
amplication. Nucleic Acids es. 44, e66 (2016).
28. Troell, . et al. Cryptosporidium as a testbed for single cell genome characterization of unicellular euaryotes. BMC Genomics 17,
1–12 (2016).
29. Abrahamsen, M. S. et al. Complete genome sequence of the apicomplexan, Cryptosporidium par vum. Science 304, 441–445 (2004).
30. arsenti, E. et al. A holistic approach to marine Eco-systems biology. PLoS Biol. 9, 7–11 (2011).
31. Michaelis, L., Menten, M. L., Johnson, . A. & Goody, . S. e original Michaelis constant: translation of the 1913 Michaelis-
Menten paper. Biochemistry 50, 8264–8269 (2011).
32. Woye, T. et al. Assembling the marine metagenome, one cell at a time. PLoS One 4, e5299 (2009).
33. Saw, J. H. et al. Exploring microbial dar matter to resolve the deep archaeal ancestry of euaryotes. Philos. Trans . . Soc. Lond. B.
Biol. Sci. 370, 20140328 (2015).
34. Cuvelier, M. L. et al. Targeted metagenomics and ecology of globally important uncultured euaryotic phytoplanton. Proc. Natl.
Acad. Sci. USA 107, 14679–14684 (2010).
35. Worden, A. Z. et al. Global distribution of a wild alga revealed by targeted metagenomics. Curr. Biol. 22, 675–677 (2012).
36. Vaulot, D. et al. Metagenomes of the picoalga Bathycoccus from the Chile coastal upwelling. PLoS One 7, e39648 (2012).
37. ine, C. et al. Insights into the phylogeny and coding potential of microbial dar matter. Nature 499, 431–437 (2013).
38. Swan, B. . et al. Prevalent genome streamlining and latitudinal divergence of plantonic bacteria in the surface ocean. Proc. Natl.
Acad. Sci. USA 110, 11463–11468 (2013).
39. ashtan, N. et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420
(2014).
40. Logares, . et al. Phenotypically dierent microalgal morphospecies with identical ribosomal DNA: A case of rapid adaptive
evolution? Microb. Ecol. 53, 549–561 (2007).
www.nature.com/scientificreports/
11
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
41. ichter, M. & osselló-Móra, . Shiing the genomic gold standard for the proaryotic species denition. Proc. Natl. Acad. Sci. USA
106, 19126–19131 (2009).
42. Ultsch, A. & Mörchen, F. ESOM-Maps: tools for clustering, visualization, and classication with Emergent SOM. Tec h. ep. Dep.
Math. Comput. Sci. Univ. Marburg, Ger. 46, 1–7 (2005).
43. Dic, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, 85 (2009).
44. Herlemann, D. P. . et al. Metagenomic de novo assembly of an aquatic representative of the Verrucomicrobial class Spartobacteria.
MBio 4, e00569–12 (2013).
45. Massana, . et al. Grazing rates and functional diversity of uncultured heterotrophic agellates. ISME J. 3, 588–596 (2009).
46. Armbrust, E. V. et al. e genome of the diatom alassiosira pseudonana: ecology, evolution, and metabolism. Science 306, 79–86
(2004).
47. ing, N. et al. e genome of the choanoagellate Monosiga brevicollis and the origin of metazoans. Nature 451, 783–788 (2008).
48. Blanc, G. et al. e Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic
sex. Plant Cell 22, 2943–2955 (2010).
49. Merchant, S. S. et al. e Chlamydomonas genome reveals the evolution of ey animal and plant functions. Science 318, 245–250
(2007).
50. Pinard, . et al. Assessment of whole genome amplication-induced bias through high-throughput, massively parallel whole
genome sequencing. BMC Genomics 7, 216 (2006).
51. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29,
1072–1075 (2013).
52. odríguez-Martínez, ., ocap, G., Salazar, G. & Massana, . Biogeography of the uncultured marine picoeuaryote MAST-4:
temperature-driven distribution patterns. ISME J. 7, 1531–1543 (2013).
53. odríguez-Martínez, ., ocap, G., Logares, ., omac, S. & Massana, . Low evolutionary diversication in a widespread and
abundant uncultured protist (MAST-4). Mol. Biol. Evol. 29, 1393–1406 (2012).
54. anehisa, M., Goto, S., awashima, S., Ouno, Y. & Hattori, M. e EGG resource for deciphering the genome. Nucleic Acids es.
32, D277–D280 (2004).
55. Marchler-Bauer, A. et al. CDD: NCBI’s conserved domain database. Nucleic Acids es. 43, D222–D226 (2014).
56. Dean, F. B., Nelson, J. ., Giesler, T. L. & L asen, . S. apid amplication of plasmid and phage DNA using Phi 29 DNA polymerase
and multiply-primed rolling circle amplication. Genome es. 11, 1095–1099 (2001).
57. Dean, F. B. et al. Comprehensive human genome amplication using multiple displacement amplication. Proc. Natl. Acad. Sci. USA
99, 5261–5266 (2002).
58. Stepanausas, . & Sieraci, M. E. Matching phylogeny and metabolism in the uncultured marine bacteria, one cell at a time. Proc.
Natl. Acad. Sci. USA 104, 9052–9057 (2007).
59. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants: egistry of selected samples from the Tara Oceans
Expedition (2009–2013), doi: 10.1594/PANGAEA.842197 (2014).
60. Banevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19,
455–477 (2012).
61. Parra, G., Bradnam, . & orf, I. CEGMA: a pipeline to accurately annotate core genes in euaryotic genomes. Bioinformatics 23,
1061–1067 (2007).
62. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
63. Pruitt, . D., Tatusova, T., Brown, G. . & Maglott, D. . NCBI eference Sequences (efSeq): current status, new features and
genome annotation policy. Nucleic Acids es. 40, D130–D135 (2011).
64. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
65. Li, H. et al. e Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
66. Quinlan, A. . & Hall, I. M. BEDTools: a exible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842
(2010).
67. Oonechniov, ., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput
sequencing data. Bioinformatics 32, 292–294 (2016).
68. Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol.
7, 539 (2011).
69. ompson, J. D., Gibson, T. J. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX In Current Protocols in
Bioinformatics (eds Baxevanis, A. D., Petso, G. A., Stein, L. D. & Stormo, G. D.) Chapter 2: Unit 2:3, (John Wiley & Sons, Inc., 2002).
70. earse, M. et al. Geneious Basic: an integrated and extendable destop soware platform for the organization and analysis of
sequence data. Bioinformatics 28, 1647–1649 (2012).
71. Stane, M., Steinamp, ., Waac, S. & Morgenstern, B. AUGUSTUS: a web server for gene nding in euaryotes. Nucleic Acids es.
32, W309–W312 (2004).
Acknowledgements
is work was supported by the US NSF grants DEB-1031049 and OCE-821374 (to M.E.S.), by the ANR French
projects Oceanomics (ANR-11-BTBR-0008, to C.V.), France Génomique (ANR-10-INBS-09, to P.W.), and
Prometheus (ANR-09-PCS-GENM_217, to O.J.), by the EU project SINGEK (H2020-MSCA-ITN-2015-675752,
to R.M.), and by the Spanish project MEFISTO (CTM2013-43767-P, MINECO). J.-F.M. was supported by a Marie
Curie Intra-European Fellowship (PIEF-GA-2012-331190, EU). R.L. was supported by Juan de la Cierva (JCI-
2010-06594, MINECO) and Ramón y Cajal fellowships (RYC-2013-12554, MINECO). Computing resources
were obtained through the MARBITS platform at the ICM-CSIC as well as through the Red Española de
Supercomputación. We appreciate the eorts of the Single Cell Genomic Center of Bigelow (https://scgc.bigelow.
org) in cell sorting and whole genome amplication.
Author Contributions
J.-F.M., R.L. and R.M. analysed the data and J.-F.M. and R.M. wrote the manuscript. M.E.S., O.J., P.W. and C.V.
are members of the Tara Oceans consortium that initiated this study and designed the sampling and sequencing
experiments. Y.S., S.M., O.J., P.W. performed the sequencing. P.S. and F.L. helped with the analysis. All the co-authors
have revised the manuscript.
Additional Information
Accession codes: Sequence data is available at ENA (http://www.ebi.ac.uk/services/tara-oceans-data)
with accession codes ERR1138643-ERR1138646, ERR1189843-ERR1189844, ERR1189847, ERR1198925,
www.nature.com/scientificreports/
12
Scientific RepoRts | 7:41498 | DOI: 10.1038/srep41498
ERR1198927-ERR1198928, ERR1198936, ERR1198938, ERR1198941, ERR1198946, ERR1198948-
ERR1198950, ERR1198954 and ERR1744377- ERR1744380. Co-assemblies are available on request.
Supplementary information accompanies this paper at http://www.nature.com/srep
Competing nancial interests: e authors declare no competing nancial interests.
How to cite this article: Mangot, J.-F. et al. Accessing the genomic information of unculturable oceanic
picoeukaryotes by combining multiple single cells. Sci. Rep. 7, 41498; doi: 10.1038/srep41498 (2017).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
is work is licensed under a Creative Commons Attribution 4.0 International License. e images
or other third party material in this article are included in the article’s Creative Commons license,
unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license,
users will need to obtain permission from the license holder to reproduce the material. To view a copy of this
license, visit http://creativecommons.org/licenses/by/4.0/
© e Author(s) 2017

Supplementary resources

  • Article
    Full-text available
    Marine planktonic protists are critical components of ocean ecosystems and are highly diverse. Molecular sequencing methods are being used to describe this diversity and reveal new associations and metabolisms that are important to how these ecosystems function. We describe here the use of the single cell genomics approach to sample and interrogate the diversity of the smaller (pico- and nano-sized) protists from a range of oceanic samples. We created over 900 single amplified genomes (SAGs) from 8 Tara Ocean samples across the Indian Ocean and the Mediterranean Sea. We show that flow cytometric sorting of single cells effectively distinguishes plastidic and aplastidic cell types that agree with our understanding of protist phylogeny. Yields of genomic DNA with PCR-identifiable 18S rRNA gene sequence from single cells was low (15% of aplastidic cell sorts, and 7% of plastidic sorts) and tests with alternate primers and comparisons to metabarcoding did not reveal phylogenetic bias in the major protist groups. There was little evidence of significant bias against or in favor of any phylogenetic group expected or known to be present. The four open ocean stations in the Indian Ocean had similar communities, despite ranging from 14°N to 20°S latitude, and they differed from the Mediterranean station. Single cell genomics of protists suggests that the taxonomic diversity of the dominant taxa found in only several hundreds of microliters of surface seawater is similar to that found in molecular surveys where liters of sample are filtered.
  • Article
    Fungi are phylogenetically and functionally diverse ubiquitous components of almost all ecosystems on Earth, including aquatic environments stretching from high montane lakes down to the deep ocean. Aquatic ecosystems, however, remain frequently overlooked as fungal habitats, although fungi potentially hold important roles for organic matter cycling and food web dynamics. Recent methodological improvements have facilitated a greater appreciation of the importance of fungi in many aquatic systems, yet a conceptual framework is still missing. In this Review, we conceptualize the spatiotemporal dimensions, diversity, functions and organismic interactions of fungi in structuring aquatic food webs. We focus on currently unexplored fungal diversity, highlighting poorly understood ecosystems, including emerging artificial aquatic habitats.
  • Article
    Full-text available
    A metatranscriptome study targeting the protistan community was conducted off the coast of Southern California, at the San Pedro Ocean Time‐series station at the surface, 150 m (oxycline), and 890 m to link putative metabolic patterns to distinct protistan lineages. Comparison of relative transcript abundances revealed depth‐related shifts in the nutritional modes of key taxonomic groups. Eukaryotic gene expression in the sunlit surface environment was dominated by phototrophs, such as diatoms and chlorophytes, and high abundances of transcripts associated with synthesis pathways (e.g. photosynthesis, carbon fixation, fatty acid synthesis). Sub‐euphotic depths (150 m and 890 m) exhibited strong contributions from dinoflagellates and ciliates, and were characterized by transcripts relating to digestion or intracellular nutrient recycling (e.g. breakdown of fatty acids and V‐type ATPases). These transcriptional patterns underlie the distinct nutritional modes of ecologically important protistan lineages that drive marine food webs, and provide a framework to investigate trophic dynamics across diverse protistan communities. This article is protected by copyright. All rights reserved.
  • Article
    Full-text available
    Plastids are supported by a wide range of proteins encoded within the nucleus and imported from the cytoplasm. These plastid-targeted proteins may originate from the endosymbiont, the host, or other sources entirely. Here, we identify and characterise 770 plastid-targeted proteins that are conserved across the ochrophytes, a major group of algae including diatoms, pelagophytes and kelps, that possess plastids derived from red algae. We show that the ancestral ochrophyte plastid proteome was an evolutionary chimera, with 25% of its phylogenetically tractable nucleus-encoded proteins deriving from green algae. We additionally show that functional mixing of host and plastid proteomes, such as through dual-targeting, is an ancestral feature of plastid evolution. Finally, we detect a clear phylogenetic signal from one ochrophyte subgroup, the lineage containing pelagophytes and dictyochophytes, in plastid-targeted proteins from another major algal lineage, the haptophytes. This may represent a possible serial endosymbiosis event deep in eukaryotic evolutionary history.
  • Article
    Microbial eukaryotes are integral components of natural microbial communities and their inclusion is critical for many ecosystem studies yet the majority of published metagenome analyses ignore eukaryotes. In order to include eukaryotes in environmental studies we propose a method to recover eukaryotic genomes from complex metagenomic samples. A key step for genome recovery is separation of eukaryotic and prokaryotic fragments. We developed a k-mer-based strategy, EukRep, for eukaryotic sequence identification and applied it to environmental samples to show that it enables genome recovery, genome completeness evaluation and prediction of metabolic potential. We used this approach to test the effect of addition of organic carbon on a geyser-associated microbial community and detected a substantial change of the community metabolism, with selection against almost all candidate phyla bacteria and archaea and for eukaryotes. Near complete genomes were reconstructed for three fungi placed within the eurotiomycetes and an arthropod. While carbon fixation and sulfur oxidation were important functions in the geyser community prior to carbon addition, the organic carbon impacted community showed enrichment for secreted proteases, secreted lipases, cellulose targeting CAZymes, and methanol oxidation. We demonstrate the broader utility of EukRep by reconstructing and evaluating relatively high quality fungal, protist, and rotifer genomes from complex environmental samples. This approach opens the way for cultivation-independent analyses of whole microbial communities.
  • Article
    Full-text available
    Single-celled eukaryotes (protists) are critical players in global biogeochemical cycling of nutrients and energy in the oceans. While their roles as primary producers and grazers are well appreciated, other aspects of their life histories remain obscure due to challenges in culturing and sequencing their natural diversity. Here, we exploit single-cell genomics and metagenomics data from the circumglobal Tara Oceans expedition to analyze the genome content and apparent oceanic distribution of seven prevalent lineages of uncultured heterotrophic stramenopiles. Based on the available data, each sequenced genome or genotype appears to have a specific oceanic distribution, principally correlated with water temperature and depth. The genome content provides hypotheses for specialization in terms of cell motility, food spectra, and trophic stages, including the potential impact on their lifestyles of horizontal gene transfer from prokaryotes. Our results support the idea that prominent heterotrophic marine p
  • Article
    Full-text available
    Photosynthetic picoeukaryotes contribute a significant fraction of primary production in the upper ocean. Micromonas pusilla is an ecologically relevant photosynthetic picoeukaryote, abundantly and widely distributed in marine waters. Grazing by protists may control the abundance of picoeukaryotes such as M. pusilla, but the diversity of the responsible grazers is poorly understood. To identify protists consuming photosynthetic picoeukaryotes in a productive North Pacific Ocean region, we amended seawater with living ¹⁵N, ¹³C-labeled M. pusilla cells in a 24-hour replicated bottle experiment. DNA stable isotope probing, combined with high-throughput sequencing of V4 hypervariable regions from 18S rRNA gene amplicons (Tag-SIP), identified 19 operational taxonomic units (OTUs) of microbial eukaryotes that consumed M. pusilla. These OTUs were distantly related to cultured taxa within the dinoflagellates, ciliates, stramenopiles (MAST-1C and MAST-3 clades), and Telonema flagellates, thus far known only from their environmental 18S rRNA gene sequences. Our discovery of eukaryotic prey consumption by MAST cells confirms that their trophic role in marine microbial food webs includes grazing upon picoeukaryotes. Our study provides new experimental evidence directly linking the genetic identity of diverse uncultivated microbial eukaryotes to the consumption of picoeukaryotic phytoplankton in the upper ocean. This article is protected by copyright. All rights reserved.
  • Article
    Full-text available
    Single-cell genomics (SCG) appeared as a powerful technique to get genomic information from uncultured organisms. However, SCG techniques suffer from biases at the whole genome amplification step that can lead to extremely variable numbers of genome recovery (5–100%). Thus, it is unclear how useful can SCG be to address evolutionary questions on uncultured microbial eukaryotes. To provide some insights into this, we here analysed 3 single-cell amplified genomes (SAGs) of the choanoflagellate Monosiga brevicollis, whose genome is known. Our results show that each SAG has a different, independent bias, yielding different levels of genome recovery for each cell (6–36%). Genes often appear fragmented and are split into more genes during annotation. Thus, analyses of gene gain and losses, gene architectures, synteny and other genomic features can not be addressed with a single SAG. However, the recovery of phylogenetically-informative protein domains can be up to 55%. This means SAG data can be used to perform accurate phylogenomic analyses. Finally, we also confirm that the co-assembly of several SAGs improves the general genomic recovery. Overall, our data show that, besides important current limitations, SAGs can still provide interesting and novel insights from poorly-known, uncultured organisms.
  • Article
    The study of marine microbial ecology has been completely transformed by molecular and genomic data: after centuries of relative neglect, genomics has revealed the surprising extent of microbial diversity and how microbial processes transform ocean and global ecosystems. But the revolution is not complete: major gaps in our understanding remain, and one obvious example is that microbial eukaryotes, or protists, are still largely neglected. Here we examine various ways in which protists might be better integrated into models of marine microbial ecology, what challenges this will present, and why understanding the limitations of our tools is a significant concern. In part this is a technical challenge — eukaryotic genomes are more difficult to characterize — but eukaryotic adaptations are also more dependent on morphology and behaviour than they are on the metabolic diversity that typifies bacteria, and these cannot be inferred from genomic data as readily as metabolism can be. We therefore cannot simply follow in the methodological footsteps of bacterial ecology and hope for similar success. Understanding microbial eukaryotes will require different approaches, including greater emphasis on taxonomically and trophically diverse model systems. Molecular sequencing will continue to play a role, and advances in environmental sequence tag studies and single-cell methods for genomic and transcriptomics offer particular promise.
  • Article
    Full-text available
    Background: Infectious disease involving multiple genetically distinct populations of pathogens is frequently concurrent, but difficult to detect or describe with current routine methodology. Cryptosporidium sp. is a widespread gastrointestinal protozoan of global significance in both animals and humans. It cannot be easily maintained in culture and infections of multiple strains have been reported. To explore the potential use of single cell genomics methodology for revealing genome-level variation in clinical samples from Cryptosporidium-infected hosts, we sorted individual oocysts for subsequent genome amplification and full-genome sequencing. Results: Cells were identified with fluorescent antibodies with an 80 % success rate for the entire single cell genomics workflow, demonstrating that the methodology can be applied directly to purified fecal samples. Ten amplified genomes from sorted single cells were selected for genome sequencing and compared both to the original population and a reference genome in order to evaluate the accuracy and performance of the method. Single cell genome coverage was on average 81 % even with a moderate sequencing effort and by combining the 10 single cell genomes, the full genome was accounted for. By a comparison to the original sample, biological variation could be distinguished and separated from noise introduced in the amplification. Conclusions: As a proof of principle, we have demonstrated the power of applying single cell genomics to dissect infectious disease caused by closely related parasite species or subtypes. The workflow can easily be expanded and adapted to target other protozoans, and potential applications include mapping genome-encoded traits, virulence, pathogenicity, host specificity and resistance at the level of cells as truly meaningful biological units.
  • Article
    Full-text available
    Global estimates indicate the oceans are responsible for approximately half of the carbon dioxide fixed on Earth. Organisms less than or equal to5??m in size dominate open ocean phytoplankton communities in terms of abundance and CO2 fixation, with the cyanobacterial genera Prochlorococcus and Synechococcus numerically the most abundant and more extensively studied compared with small eukaryotes. However, the contribution of specific taxonomic groups to marine CO2 fixation is still poorly known. In this study, we show that among the phytoplankton, small eukaryotes contribute significantly to CO2 fixation (44%) because of their larger cell volume and thereby higher cell-specific CO2 fixation rates. Within the eukaryotes, two groups, herein called Euk-A and Euk-B, were distinguished based on their flow cytometric signature. Euk-A, the most abundant group, contained cells 1.8±0.1??m in size while Euk-B was the least abundant but cells were larger (2.8±0.2??m). The Euk-B group comprising prymnesiophytes (73±13%) belonging largely to lineages with no close cultured counterparts accounted for up to 38% of the total primary production in the subtropical and tropical northeast Atlantic Ocean, suggesting a key role of this group in oceanic CO2 fixation.
  • Article
    Full-text available
    Sequencing small quantities of DNA is important for applications ranging from the assembly of uncultivable microbial genomes to the identification of cancer-associated mutations. To obtain sufficient quantities of DNA for sequencing, the small amount of starting material must be amplified significantly. However, existing methods often yield errors or non-uniform coverage, reducing sequencing data quality. Here, we describe digital droplet multiple displacement amplification, a method that enables massive amplification of low-input material while maintaining sequence accuracy and uniformity. The low-input material is compartmentalized as single molecules in millions of picoliter droplets. Because the molecules are isolated in compartments, they amplify to saturation without competing for resources; this yields uniform representation of all sequences in the final product and, in turn, enhances the quality of the sequence data. We demonstrate the ability to uniformly amplify the genomes of single Escherichia coli cells, comprising just 4.7 fg of starting DNA, and obtain sequencing coverage distributions that rival that of unamplified material. Digital droplet multiple displacement amplification provides a simple and effective method for amplifying minute amounts of DNA for accurate and uniform sequencing.
  • Article
    Full-text available
    Motivation: Detection of random errors and systematic biases is a crucial step of a robust pipeline for processing high-throughput sequencing (HTS) data. Bioinformatics software tools capable of performing this task are available, either for general analysis of HTS data or targeted to a specific sequencing technology. However, most of the existing QC instruments only allow processing of one sample at a time. Results: Qualimap 2 represents a next step in the QC analysis of HTS data. Along with comprehensive single-sample analysis of alignment data, it includes new modes that allow simultaneous processing and comparison of multiple samples. As with the first version, the new features are available via both graphical and command line interface. Additionally, it includes a large number of improvements proposed by the user community. Availability: The implementation of the software along with documentation is freely available at http://www.qualimap.org.
  • Article
    Full-text available
    The origin of eukaryotes represents an enigmatic puzzle, which is still lacking a number of essential pieces. Whereas it is currently accepted that the process of eukaryogenesis involved an interplay between a host cell and an alphaproteobacterial endosymbiont, we currently lack detailed information regarding the identity and nature of these players. A number of studies have provided increasing support for the emergence of the eukaryotic host cell from within the archaeal domain of life, displaying a specific affiliation with the archaeal TACK superphylum. Recent studies have shown that genomic exploration of yet-uncultivated archaea, the so-called archaeal 'dark matter', is able to provide unprecedented insights into the process of eukaryogenesis. Here, we provide an overview of state-of-the-art cultivation-independent approaches, and demonstrate how these methods were used to obtain draft genome sequences of several novel members of the TACK superphylum, including Lokiarchaeum, two representatives of the Miscellaneous Crenarchaeotal Group (Bathyarchaeota), and a Korarchaeum-related lineage. The maturation of cultivation-independent genomics approaches, as well as future developments in next-generation sequencing technologies, will revolutionize our current view of microbial evolution and diversity, and provide profound new insights into the early evolution of life, including the enigmatic origin of the eukaryotic cell. © 2015 The Authors.