ArticlePDF Available

Abstract and Figures

Forecasting changes to ecological communities is one of the central challenges in ecology. However, nonlinear dependencies, biotic interactions and data limitations have limited our ability to assess how predictable communities are. Here, we used a machine learning approach and environmental monitoring data (biological, physical and chemical) to assess the predictability of phytoplankton cell density in one lake across an unprecedented range of timescales. Communities were highly predictable over hours to months: model R2 decreased from 0.89 at 4 hours to 0.74 at 1 month, and in a long-term dataset lacking fine spatial resolution, from 0.46 at 1 month to 0.32 at 10 years. When cyanobacterial and eukaryotic algal cell densities were examined separately, model-inferred environmental growth dependencies matched laboratory studies, and suggested novel trade-offs governing their competition. High-frequency monitoring and machine learning can set prediction targets for process-based models and help elucidate the mechanisms underlying ecological dynamics.
Content may be subject to copyright.
LETTER The predictability of a lake phytoplankton community, over
time-scales of hours to years
Mridul K. Thomas,
1,2
*
Simone Fontana,
1,3
Marta Reyes
1
Michael Kehoe
4
and
Francesco Pomati
1,5
Abstract
Forecasting changes to ecological communities is one of the central challenges in ecology. However,
nonlinear dependencies, biotic interactions and data limitations have limited our ability to assess
how predictable communities are. Here, we used a machine learning approach and environmental
monitoring data (biological, physical and chemical) to assess the predictability of phytoplankton
cell density in one lake across an unprecedented range of time-scales. Communities were highly
predictable over hours to months: model R
2
decreased from 0.89 at 4 hours to 0.74 at 1 month, and
in a long-term dataset lacking fine spatial resolution, from 0.46 at 1 month to 0.32 at 10 years. When
cyanobacterial and eukaryotic algal cell densities were examined separately, model-inferred environ-
mental growth dependencies matched laboratory studies, and suggested novel trade-offs governing
their competition. High-frequency monitoring and machine learning can set prediction targets for
process-based models and help elucidate the mechanisms underlying ecological dynamics.
Keywords
Cyanobacteria, environmental monitoring, forecasting, machine learning, phytoplankton, predic-
tion, time series.
Ecology Letters (2018) 21: 619–628
INTRODUCTION
Forecasting how environmental change will alter communities
and ecosystems is perhaps the most important task facing ecol-
ogists today, and a tremendous challenge to our ecological
understanding (Mouquet et al. 2015; Petchey et al. 2015;
Houlahan et al. 2017). Nonlinear relationships (such as
between temperature and most biological processes), stochas-
ticity and sensitive dependence on initial conditions are sources
of uncertainty that community ecology shares with other pre-
dictive disciplines, such as climate science. However, ecology
additionally has to grapple with biotic interactions in complex
food webs, evolutionary change and a paucity of data with
which to assess predictive power and refine models (Magurran
et al. 2010). Therefore, with few exceptions (notably in disease
ecology, e.g. Axelsen et al. 2014), we do not know how pre-
dictable ecological communities are (i.e. how strong the associ-
ation between present and future system states is, a proxy for
forecast ability). Quantifying this would allow us to under-
stand the time-scale over which we can provide actionable
input for management and legislative decision-making, recently
termed the ecological forecast horizon (Petchey et al. 2015).
To make accurate long-term forecasts, ecology needs to
develop process-based forecasts akin to those prevalent in cli-
mate science. Correlational approaches based on present con-
ditions and abundances are likely to make inaccurate forecasts
over decadal time-scales because patterns of environmental
covariation will change in the future (Williams et al. 2007).
Process-based models avoid this problem but are a challenge
to design because of their complexity. This arises from a lack
of knowledge of the functional forms (or shapes) relating
population/community change to environmental factors, and a
lack of data with which to parameterise them (Kremer et al.
2016). The scale of this challenge is highlighted by recent work
showing that complex, highly nonlinear interactions between
abiotic factors are a regular feature of physiological and eco-
logical processes (Edwards et al. 2016; Zhu et al. 2016; Zim-
mer et al. 2016; Thomas et al. 2017). Designing process-based
models using a traditional approach may require extensive
experimental work examining high-dimensional interactions.
High-throughput screening technologies are helping to address
this problem. But in many cases, a traditional experimental
approach to understanding interactions (i.e. through multidi-
mensional factorial experiments) may not be realistic given
present funding and experimental constraints.
Machine learning (ML) algorithms offer us an alternative
path towards the creation of these process-based models.
When faced with complex environmental datasets, ML allows
us to avoid important constraints inherent in most traditional
statistical approaches (a priori specification of functional
forms, interactions and error distributions). Despite relying
on underlying correlations, ML algorithms can improve sub-
stantially on traditional correlative analyses (Kehoe et al.
2012, 2015; Rivero-Calle et al. 2015). They can be used on
1
Department of Aquatic Ecology, Eawag: Swiss Federal Institute of Aquatic
Science and Technology, D
ubendorf, Switzerland
2
Centre for Ocean Life, DTU Aqua, Technical University of Denmark, Lyngby,
Denmark
3
Biodiversity and Conservation Biology, Swiss Federal Research Institute WSL,
Birmensdorf, Switzerland
4
Global Institute for Water Security and School of Environment and
Sustainability, University of Saskatchewan, Saskatechwan, Saskatoon, Canada
5
Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH),
Z
urich, Switzerland
*Correspondence: E-mail: mrit@dtu.dk
©2018 John Wiley & Sons Ltd/CNRS
Ecology Letters, (2018) 21: 619–628 doi: 10.1111/ele.12927
complex datasets to assess associations (Rivero-Calle et al.
2015) and to quantify predictability in the absence of the
knowledge needed for a process-based model (Ewers et al.
2017). Even more importantly, they can be used to infer the
functional forms and interactions needed to develop process-
based models. This approach will require large datasets, but
as ecology enters the big data era, acquiring this is becoming
feasible for many systems. As the cost of data acquisition con-
tinues to decrease, ML may prove a more efficient approach
(relative to high-dimensional factorial experiments) to assess-
ing community predictability and understanding the drivers of
complex ecological dynamics.
Natural communities of microbes such as phytoplankton
are vital parts of most biogeochemical cycles and food webs
(Falkowski et al. 1998; Field et al. 1998), and so assessing their
predictability is especially important. Phytoplankton have gen-
eration times on the order of a day, and respond extremely
rapidly to environmental change: shifts on time-scales of min-
utes to hours are sufficient to elicit physiological and ecological
changes (Goldman & Glibert 1982; Demers et al. 1991; Hemme
et al. 2014). Despite this sensitivity to environmental condi-
tions, we do not know the time-scales over which phytoplank-
ton community dynamics may be predicted.
Historically, most plankton-monitoring campaigns have mea-
sured the community at coarse time-scales of once to twice a
month (Jochimsen et al. 2013), or on the order of once per 10
generations. These efforts have helped us understand broad
changes driven by eutrophication and environmental warming
(Pomati et al. 2012; Jochimsen et al. 2013), helping to make the
case for policies limiting further changes. However, with rare
exceptions (notably Hunter-Cevera et al. 2014, 2016), plank-
ton-monitoring efforts have not captured the data needed to
accurately assess community predictability across time-scales.
High-frequency monitoring campaigns that sample communi-
ties and environmental drivers on sub-daily time-scales can
partly address this (Pomati et al. 2011, 2013; Hunter-Cevera
et al. 2014, 2016), filling in pieces of the picture that coarser
long-term datasets have hinted at. They also provide us with
the quantity of data needed to profitably employ ML tools.
We quantified the predictability of phytoplankton cell den-
sity over time-scales ranging from 4 hours to 10 years, or
approximately 10
1
to 10
3
generations. Cell density, or the
number of phytoplankton cells per unit volume, is the most
important parameter characterising a phytoplankton commu-
nity. It is a strong proxy for phytoplankton biomass (includ-
ing in our system, Fig. S1) and for primary productivity, an
important ecosystem property. We characterised predictability
of cell density in Greifensee, a meso-eutrophic peri-alpine lake
in Switzerland. The Greifensee plankton community, chem-
istry and physics have been monitored for >30 years. In this
period, there have been dramatic changes in biological and
chemical parameters as a result of eutrophication and re-oli-
gotrophication (B
urgi et al. 2003). We make use of two com-
plementary datasets examining the Greifensee phytoplankton
community:
(1) High-frequency time series from monitoring campaigns
carried out in the summer and autumn of 2014 and 2015. Cell
density was measured every 4 hours at six different depths
(Fig. 1) using scanning flow cytometry (SFCM), and environ-
mental data were also collected (Table S1).
(2) A long-term time series created from monthly measure-
ments of depth-integrated phytoplankton measurements
(Fig. 1), as well as associated environmental data (Table S2).
Although the datasets differ in methodology and size, they
measure substantially similar biological, physical and chemical
factors (Tables S1 and S2). Given the length of the two data-
sets and their sampling frequency and location, we are able to
directly compare predictability in the two datasets at a time
lag of 1 month.
In addition to community density, ecology aims to predict
the dynamics of functional groups and taxa. Especially
because toxic cyanobacterial blooms are a major health con-
cern (Chorus & Bartram 1999; Paerl & Huisman 2009; Paerl
et al. 2011), we also assessed the predictability of cyanobacte-
rial cell density over these time-scales, and the drivers of com-
petitive dynamics between cyanobacteria and eukaryotic
phytoplankton (Fig. 2). Eukaryotic densities have remained
relatively stable in Greifensee since the 1980s, while average
cyanobacterial densities first increased 100-fold during
eutrophication and then decreased by a similar amount over
this time period as a result of re-oligotrophication (Fig. 2).
Understanding the drivers of growth and competition between
these two broad phytoplankton groups can help us refine pro-
cess-based models of water quality, with important implica-
tions for the management of aquatic ecosystem services.
METHODS
Overview
Greifensee is an 8.45 km
2
peri-alpine lake in Switzerland
(47.35 °N, 8.68 °E) with a documented history of eutrophica-
tion and re-oligotrophication (B
urgi et al. 2003). The lake is
currently meso-eutrophic, 32 m deep at its deepest point and
just over 20 m deep at the sampling locations used for both
datasets.
We use two datasets in this study: a high-frequency dataset
consisting of measurements every 4 h during the summer and
autumn of 2014 and 2015 using the automated monitoring
station Aquaprobe (Pomati et al. 2011), and a long-term data-
set consisting of monthly measurements from March 1984 to
June 2016. In both cases, important environmental data (both
abiotic and biotic) was collected simultaneously near the mid-
dle of the lake, allowing similar analyses to be conducted and
thereby enabling comparisons. However, the datasets differ in
important ways:
(1) The high-frequency dataset involved measurements by
SFCM, while the long-term dataset involved microscopy mea-
surements. Therefore, sampling effort and density assessment
methods differ.
(2) The high-frequency dataset consists of measurements at
six specific depths (1.0, 2.5, 4.0, 5.5, 7.0 and 8.5 m). Abiotic
environmental data were also estimated at the same depth as
the collected sample. In contrast, the long-term dataset con-
sists of integrated phytoplankton measurements across the top
20 m of the lake. Abiotic measurements were not integrated,
©2018 John Wiley & Sons Ltd/CNRS
620 M. K. Thomas et al. Letter
but collected at specific depths (except for light, which is a
surface estimate), and so we calculated the maximum and
minimum value of each abiotic factor in the top 20 m for use
as predictors.
(3) The high-frequency dataset consists of 7161 measurements,
while the long-term dataset contains 383 measurements.
Generation of high-frequency dataset
Scanning flow cytometry description
We used a scanning flow cytometer, the CytoSense (http://
www.cytobuoy.com), to quantify the density of the total phy-
toplankton community as well as its cyanobacterial and
eukaryotic algal fractions (estimated densities are strongly cor-
related with estimates from microscopy, Fig. S2). The Cyto-
Sense characterises the scattering and pigment fluorescence of
individual phytoplankton cells. It measures cells and colonies
across a large proportion of the phytoplankton length range,
between approximately 2 lm and 1 mm in length. Particles
that enter the system cross two coherent 15 mW solid-state
lasers. The instrument’s laser and sensor wavelengths are
designed to target the fluorescence signals primarily from
chlorophyll-a and phycocyanin, but also capture signals from
phycoerythrin and carotenoids. We used two different instru-
ments in 2014 and 2015, with small differences in configura-
tion. Instrument settings and data processing steps may be
found in the supplementary information.
SFCM field sampling procedure
All samples were collected from a floating platform (Aquap-
robe, Pomati et al. 2011) near the middle of the lake
(47.3663 °N, 8.665 °E). Every four hours, water was sampled
automatically at each of the six depths (as described in
Pomati et al. 2011). Water samples were pumped into a
150 mL sampling chamber at the surface through a tube with
a 0.6-cm diameter opening. The sampling chamber was
flushed with water from the sampling depth three to five times
over 2 min before the CytoSense collected a subsample of up
to 500 lL for measurement.
Environmental factors
The full list of environmental parameters is found in
Table S1. We measured temperature, conductivity and irradi-
ance at all depths. We also collected weekly depth-specific
samples for dissolved nutrient (nitrate, phosphate, ammo-
nium) concentration estimation, and integrated measurements
of size-fractionated zooplankton (>150 lm). We monitored
meteorological factors including wind speed and rainfall, and
made use of additional data on water inflow (including flow
rate, temperature and nutrient concentrations) into the lake
provided by the Office of Waste, Water, Energy and Air
(AWEL) of Canton Z
urich, Switzerland. Details of sampling
procedures, instruments used and measurement methodology
may be found in the supplementary information.
3
4
5
Aug Sep Oct
2014
Depth
1
2.5
4
5.5
7
8.5
3
4
5
Aug Sep Oct Nov Dec
2015
3
4
5
1990 2000 2010
1984−2016
Time
Log(cell density), per mL
Figure 1 Dynamics of cell density of the total phytoplankton community, in both the high-frequency and long-term datasets from Greifensee. High-
frequency measurements were made every 4 h in summerfall 2014 and 2015, at six depths. Long-term measurements were made monthly from 1984 to
2016 and were integrated over the top 20 m. Note that X-axes are on different scales in each panel. Y-axes are identical for the top two panels but differ
for the third.
©2018 John Wiley & Sons Ltd/CNRS
Letter Predicting phytoplankton dynamics 621
Generation of long-term dataset
Field sampling procedure
Approximately every month, water samples were collected for
physical, chemical and biological measurements near the centre
of the lake (47.3525°N, 8.6748°E), about 1.5 km from the
floating platform used for the high-frequency measurements.
For microscopic counts of the phytoplankton community, an
integrated water sample was collected over the upper 20 m of
the water column with a Schr
oder sampler (B
urgi et al. 2003).
Environmental factors
The full list of parameters is found in Table S2. To measure
chemical and physical parameters, water samples were collected
every 2.5 m over the whole water column. These were taken at
the same location and on the same dates, and were analysed
using standard limnological methods (Rice et al. 2012). Inte-
grated zooplankton samples were collected over the upper 20 m
of the water column. More details about sampling procedures,
instruments used and measurement methodology may be found
in B
urgi et al. (2003). We also made use of a publicly available
surface irradiance dataset (Schulz et al. 2008; M
uller et al.
2015) to estimate the monthly averaged irradiance at the water
surface based on estimates at a location approximately 2 km
from our sampling location (47.35°N 8.65°E).
Data processing
For every time point, we calculated the maximum and mini-
mum value of every depth-specific parameter (such as
phosphate concentration) across the entire water column, and
used these for subsequent analyses. Additionally, samples
were not collected on the same day every month and, in rare
cases, more than one sample was collected in a month. We
therefore aggregated measurements by rounding to the nearest
month and then averaged duplicate values.
We found that the smallest phytoplankton species in the
lake (Synechococcus sp.) could not be detected by the flow
cytometer, so we excluded this species from the microscopy
dataset to enable a better comparison of the two datasets.
Machine learning analyses
Random forests overview
Random forests (RFs) are a robust ML tool comprising ensem-
bles of regression trees (or classification trees) (Breiman 1999).
In each regression ‘tree’ within the random ‘forest’, a randomly
selected subset of the data is recursively partitioned based on
the most strongly associated predictor. At each node, a random
subset of the total number of predictors is considered for parti-
tioning. The final tree prediction for new data is given by the
average value of the data within each branch of the tree. By
aggregating predications across trees, RFs are able to reproduce
arbitrarily complex shapes without a priori functional form
specification.
We took advantage of three features that make RFs a flexi-
ble and useful tool for examining ecological systems: (1) per-
mutation importance, (2) easy quantification of partial effects
of individual predictors, and (3) out-of-bag prediction.
1
2
3
4
5
6
Aug Sep Oct
Depth
1
2.5
4
5.5
7
8.5
2014
1
2
3
4
5
6
Aug Sep Oct Nov Dec
2015
1
2
3
4
5
6
1990 2000 2010
1984−2016
1
2
3
4
5
6
Aug Sep Oct
1
2
3
4
5
6
Aug Sep Oct Nov Dec
1
2
3
4
5
6
1990 2000 2010
Time
Log(cell density), per mL
CyanobacteriaEukaryotes
Figure 2 Dynamics of cell density of the cyanobacteria and eukaryotic phytoplankton, in both the high-frequency and long-term datasets from Greifensee.
High-frequency measurements were made every 4 h in summerfall 2014 and 2015, at six depths. Long-term measurements were made monthly from 1984
to 2016 and were integrated over the top 20 m. Note that X-axes are on different scales in each column.
©2018 John Wiley & Sons Ltd/CNRS
622 M. K. Thomas et al. Letter
(1) The importance of each predictor in a RF is assessed by
permuting the predictor across all trees in the forest and
quantifying the resulting change in the forest’s error rate.
More important predictors lead to a greater increase in error
when permuted.
(2) The partial effect of any single predictor on the dependent
variable can also be quantified, allowing us to examine the
functional form of the relationship (which may be arbitrarily
nonlinear, though non-bifurcating).
(3) Out-of-bag (OOB) prediction allows us to make accurate
estimates of error rate and goodness of fit (model R
2
) via a
process akin to cross-validation (Breiman 1999). Each data
point is present in the training data of only a subset of all
trees that comprise the forest. Therefore, the value of every
point may be predicted using the trees that have not been
trained with it. The ‘OOB prediction error’, or mean differ-
ence between the OOB predictions and the true value of all
points in the dataset (see Fig. S3, S4 for examples using our
data) can be used to quantify the RF’s predictive ability
through a pseudo-R
2
:
R2¼1MSE
varðyÞ
where MSE is the mean squared error of the OOB predictions
when compared with the true values, and var(y) is the vari-
ance in the dependent variable. As in the case of a standard
R
2
, a pseudo-R
2
has an upper bound of 1, indicating perfect
model performance. However, note that unlike a standard R
2
,
there is no lower bound. It is possible for the pseudo-R
2
to be
negative, if MSE >var(y). This may be interpreted as saying
that the model prediction is worse than the mean value of the
dependent variable in the entire dataset. In our analyses, we
saw low, negative values of pseudo-R
2
in a few cases; we
rounded these values to zero to avoid confusion, while noting
this in the figure captions.
Data pre-processing
In our analyses, we omitted: (1) all entries where the depen-
dent variable was missing, and (2) the predictors sampling
depth and sampling time. We omitted the latter in order to
accurately estimate the effects of predictors that covary with
depth and time. In other words, we believe that gradients
in light, temperature and nutrients should characterise most
of the relevant information contained within depth and
time.
Quantifying predictability
We quantified the predictability of log cell density of the total
phytoplankton community, and the cyanobacterial and
eukaryotic fractions, using the pseudo-R
2
calculated based on
the OOB predictions of the fitted model (see details above).
We estimated predictability at time lags ranging from 4 h to
1 month in the high-frequency dataset, and from 1 month to
10 years in the long-term dataset. For every time lag, we fit
two models. They predicted log cell density using: (1) only log
cell density at the specified time lag, and (2) both log cell den-
sity and environmental parameters at the specified time lag.
For example, our simplest model considering a time lag of
four hours predicted log cell density at all time points using
only log cell density from the measurements four hours prior
to them.
Predictor importance
We assessed the importance of predictors at all time lags using
the change in model error rate when the predictor values were
permuted.
Partial effects of environment on growth
We characterised the model-inferred effects of environmental
factors on cyanobacteria and eukaryotes. Instead of examin-
ing the effects of these predictors on cell density, we quanti-
fied how they influence the population growth rate (i.e.
specific growth rate, the rate of change in density between
time points). We did this to facilitate comparison between the
partial effects in our field dataset and extensive prior labora-
tory findings for the same predictors. However, the functional
forms remained highly similar to the model explaining cell
density.
We focussed on two factors that are known to strongly
influence phytoplankton growth (Litchman & Klausmeier
2008) and were identified as important in our analyses: light
and temperature.
Because laboratory studies typically measure the effects of
environmental factors on population growth rate per day,we
multiplied the estimates of growth rate over 4 h by a factor of
6 to express them in the same units.We then fitted RFs to
these population growth rates using environmental parameters
at a 4-hour lag and estimated the partial effects of light and
temperature. Note that we omitted log density as a predictor,
but model structure was otherwise identical to those previ-
ously described.
Although we were also interested in the effects of dissolved
nitrate, phosphate and N:P ratio, we had data of lower tem-
poral resolution for these predictors, which limited the power
of analyses of these factors.
Model fitting and settings
Analyses were done in the Rstatistical environment v3.3.3
(R Core Team 2017) using the package randomforestSRC
(Ishwaran & Kogalur 2007; Ishwaran 2017). We used 2000
trees for every forest, and set the number of predictors to
be considered at each node to one-third of the total num-
ber of predictors. Missing data among the predictors were
imputed for model fitting, but not for predictor impor-
tance assessment (Ishwaran & Kogalur 2007; Ishwaran
2017).
RESULTS
Phytoplankton cell density was highly predictable on time-
scales of hours to months. In our high-frequency dataset,
pseudo-R
2
of the RF models trained with cell density and
environmental data decreased from 0.89 at a 4 hour lag to
0.74 at a lag of 1 month (Fig. 3). The model using only cell
density as a predictor had a lower R
2
at all time lags. As
time lag increased, including environmental data led to lar-
ger improvements in predictability: the difference in R
2
between the two models was 0.03 at a 4 h lag, and 10
©2018 John Wiley & Sons Ltd/CNRS
Letter Predicting phytoplankton dynamics 623
times higher (0.30) at a 1 month lag (Fig. 3). In the long-
term dataset, R
2
of the model trained with cell density and
environmental data decreased from 0.46 at a 1 month time
lag to 0.35 at 6 months, after which it remained relatively
stable (Fig. 3). R
2
in the density-only model was lower at
all time lags.
Apart from cell density, which was the strongest predictor
at all time lags in our high-frequency dataset, the most impor-
tant predictors were light, temperature and thermocline depth,
the last of which is itself an indirect effect of temperature
(Fig. 4). Light and temperature were also most important in
our long-term dataset on time-scales of months (Fig. 4). At
time-scales of years, dissolved phosphorus and zooplankton
density become more important.
Cyanobacteria were more predictable than eukaryotes at all
time-scales, in both high-frequency and long-term datasets
(Fig. 3). Model R
2
for cyanobacteria was consistently higher
than that for eukaryotes by approximately 520 percentage
points (Fig. 3), in both types of models (cell density only and
cell density with environmental data).
To motivate the development of process-based models of
phytoplankton competition, we also examined the partial
effects of environmental factors on the growth rate of
cyanobacteria and eukaryotic algae (Fig. 5). Temperature and
light, the strongest predictors in our dataset, showed biologi-
cally realistic nonlinear patterns. Additionally, in both cases,
each group dominated a region of parameter space, suggesting
the presence of trade-offs in performance.
DISCUSSION
Assessing the predictability of natural communities is crucial
if we are to develop forecasts of how ecosystems will be
altered by environmental change (Mouquet et al. 2015;
Petchey et al. 2015; Houlahan et al. 2017). However, our abil-
ity to predict community dynamics has been limited by our
Hours Days Weeks Months Years
0.0
0.2
0.4
0.6
0.8
1.0
4 8 12 16 20 1 2 3 1 2 41 2 3 6 1 2 5 10
Time
R2
dataset
high−frequency
long−term
variables
Cell density +
environmental variables
Cell density only
Hours Days Weeks Months Years
0.0
0.2
0.4
0.6
0.8
1.0
4 8 12 16 20 1 2 3 1 2 41 2 3 6 1 2 5 10
Time
R2
taxon
cyanobacteria
eukaryotes
dataset
high−frequency
long−term
variables
Cell density +
environmental variables
Cell density only
Figure 3 Decline in predictability of the phytoplankton community (whole community in black, cyanobacteria in blue, eukaryotes in green) with time,
characterised by the random forest pseudo-R
2
. The predictive contribution of environmental information increased with increasing time lag (i.e. distance
between solid and dashed lines increases). Cyanobacteria were consistently more predictable than eukaryotes. Despite overlap between high-frequency and
long-term datasets at a time lag of 1 month, there is a sharp decline in predictability. This is likely driven by the lack of depth resolution in plankton and
environmental data in the long-term dataset. The spike in R
2
of the ‘cell density only’ models at 1 year reflects strong annual cycles in density. Note that
pseudo-R
2
values can go negative (see Methods), and we rounded a few slightly negative values up to zero. We present the same results in terms of change
in Mean Absolute Error with increasing time lag in Fig. S5.
©2018 John Wiley & Sons Ltd/CNRS
624 M. K. Thomas et al. Letter
understanding of environmental dependencies and biotic inter-
actions (McGill et al. 2006). Our results suggest that lake phy-
toplankton communities are highly predictable over timescales
of hours to months (approximately 10
1
to 10
2
generations),
and possibly longer (Figs 3, S5). Our approach quantifies the
decline in predictability with increasing time lag, identifies the
predictors that contribute to predictive power and points
towards realistic trade-offs and parameterisations through the
examination of partial effects. Together, these can inform the
development of process-based models, set targets for their
forecasts and help identify a forecast horizon for adaptive
management strategies. This is especially true in the case of
cyanobacteria, many of which are a threat to human health
and aquatic ecosystem services because of toxin production.
This group is also believed to be hard to forecast (Chorus &
Bartram 1999; Paerl & Huisman 2009; Paerl et al. 2011). In
contrast, we find cyanobacterial densities to be consistently
more predictable than those of eukaryotes (Fig. 3).
As our understanding of ecological processes improves, the
limits to predictability of ecological systems will be deter-
mined by more fundamental constraints such as stochasticity
and sensitive dependence on initial conditions. Despite these,
we find strong, ecologically important environmental forcing
in a natural system across a range of time-scales (Figs 35).
Consequently, we believe that process-based models are very
likely to provide us with useful predictions over the medium-
to-long term. In other words, we believe that despite the
complexity of phytoplankton communities, the ecological
forecast horizon (Petchey et al. 2015) is sufficiently distant for
ecologists to provide useful input into adaptive management
strategies. Note that we do not quantify a specific horizon
here because this requires the specification of a (arbitrary)
forecast threshold; readers may choose these thresholds for
themselves and identify the resulting forecast horizon using
Fig. 3 and Fig. S5.
Light and temperature were strongly predictive of phyto-
plankton dynamics across time-scales (Fig. 4), consistent with
existing ecological understanding (Litchman & Klausmeier
2008). We also found that zooplankton density and dissolved
phosphorus concentrations become highly predictive on time-
scales longer than a year, consistent with an ongoing, multi-
decadal decrease in phosphorus and biomass in Greifensee
(B
urgi et al. 2003). This identification of variables that are
known to play a major role in phytoplankton ecology
strengthens our confidence in the relationships underlying our
metric of predictability. However, we note that predictor
importance while a useful tool is sensitive to missing data
patterns. Our estimates therefore understate the importance of
two major groups of predictors in our high-frequency data:
nutrients and zooplankton density. Unlike most other predic-
tors that were measured every four hours, these were mea-
sured weekly in 2014 and twice a week in 2015 (Table S1). To
partially correct for this difference, we also assessed the rela-
tive importance of all predictors when these were interpolated
log(cell density)
thermocline depth
irradiance
surface irradiance
temperature
pH
conductivity
oxygen
air pressure
inflow temperature
rain duration
Hours Days Weeks
4 8 12 16 20 1 2 3 1 2 4
log(cell density)
irradiance
temperature max
conductivity max
conductivity min
oxygen min
oxygen max
zooplankton density (small)
zooplankton density (all sizes)
zooplankton density (large)
dissolved phosphate max
Months Years
123612 510
Time
Figure 4 The most important predictors of phytoplankton cell density at different time lags, ordered by descending rank. Light and temperature (directly,
or indirectly through thermocline depth) were important predictors at most time-scales. In the long-term dataset, phosphorus and zooplankton density
become highly important predictors at time-scales of >1 year. Only the most important predictors are shown here, for legibility (the top five predictors
contribute >80% of the predictive power in most cases). In the high-frequency dataset, only predictors that are in the top 10 most important for at least
one time lag are shown, while in the long-term dataset, we show only predictors that appear in the top five most important at least once. See Tables S4
and S5 for the importance score of all predictors tested.
©2018 John Wiley & Sons Ltd/CNRS
Letter Predicting phytoplankton dynamics 625
using generalised additive models (GAMs) (Fig. S6). Models
with interpolated nutrients and zooplankton predictors had
marginally higher R
2
values and these predictors rose consid-
erably in importance, especially dissolved nitrogen (Fig. S6).
We believe that these results are noteworthy and highlight the
potential value of high-frequency monitoring of these vari-
ables. However, we choose not to focus on them here because
we are unable to validate the interpolated estimates.
Importantly, the predictive power of environmental factors in
our models arises from nonlinear dependencies that are consis-
tent with causal relationships established through laboratory
studies (Fig. 5; Litchman & Klausmeier 2008). Light, one of the
most important predictors, has a partial effect on growth that is
a saturating function for cyanobacteria and a right-skewed uni-
modal function for eukaryotes (Fig. 5); these are the only
shapes consistent with laboratory measurements of light-depen-
dent growth (Eilers and Peeters 1988, Edwards et al. 2015). The
partial effect of temperature is an increasing function and possi-
bly a left-skewed unimodal curve, consistent with prior eco-
physiological findings, including in phytoplankton (Kingsolver
2009, Thomas et al. 2012, 2016). This concordance between
controlled laboratory studies and ML-derived field patterns
increases our confidence in this ML approach, and suggests that
the relationships we have uncovered will be useful in guiding
process-based model creation. Furthermore, it suggests that
ML approaches may be used to discover novel ecological pat-
terns. This is particularly important in the case of interactions
between factors, which presently require labour-intensive and
expensive multifactorial experiments to understand. However,
we note that at present, these partial effects contain uncertainty
in the intercept because of our lower frequency measurements
of predator-driven mortality (Fig. 5). In other words, a con-
stant value may need to be added to each curve to achieve
quantitative accuracy; this would have the effect of moving
curves a small amount up or down the Y-axis. But importantly,
this would not influence estimates of important parameters
from these curves, such as the optimal temperature and optimal
irradiance for growth, and alpha (the slope of the growth-irra-
diance curve at low irradiance). These parameters are crucial to
understanding and modelling competitive dynamics.
The partial effects we show here (Fig. 5) point towards
trade-offs that could enable the co-existence of cyanobacteria
and eukaryotes. Cyanobacteria appear to benefit from high
light intensity and high temperature, while eukaryotes have a
growth advantage in the converse conditions. Therefore, tem-
poral heterogeneity in one or both of these dimensions could
allow for the maintenance of both these groups (Chesson
2000). Cyanobacteria do possess higher optimal temperatures
for growth than eukaryotic phytoplankton at temperate lati-
tudes (Thomas et al. 2016), consistent with the temperature-
dependence we see (Fig. 5). The apparent trade-off between
growth at high and low light intensities was not seen in a syn-
thesis of laboratory-measured light traits (Schwaderer et al.
2011), but at present, lab measurements are available only
from a few species and may be influenced by interactions with
other factors. Laboratory data on a broader range of species
group cyanobacteria eukaryotes
−0.20
−0.10
0.00
0.10
0 250 500 750 1000
Irradiance (µE.m2.s1)
−0.25
−0.20
−0.15
−0.10
−0.05
0.00
10 15 20 25
Temperature (°C)
Environmental factors
Specific growth rate (per day)
Figure 5 Partial effects of important environmental factors on the population growth rates of cyanobacteria and eukaryotes, based on an RF model with a
4-hour time lag. The patterns reveal environmental dependencies consistent with laboratory experiments and suggest trade-offs with important ecological
implications. Cyanobacteria appear to benefit from high light and high temperature. The predominantly negative partial effects reflect the fact that in 2015,
the community was decreasing through the majority of the monitoring campaign leading to imprecise estimates of the drivers of mortality (such as
predation). Therefore, differences in shape and magnitude are highly informative, but the absolute estimates are only indicative. Better estimates of the
drivers of mortality would likely change the intercept on these curves, moving the entire curves up or down the Y-axis. Additionally, interactions between
factors are captured by the complete forest prediction, but are not visible in single-dimension partial effects plots. Note that Y-axes are on different scales.
©2018 John Wiley & Sons Ltd/CNRS
626 M. K. Thomas et al. Letter
and under a greater range of conditions will help resolve this
discrepancy. If true, the ML-derived field pattern we identify
here also suggests an explanation for the formation of surface
scums by cyanobacteria through buoyancy regulation (Paerl
et al. 2011; Carey et al. 2012). Scum formation important
due to the negative impact on lake ecosystem services is
consistent with a cyanobacterial benefit from higher irradi-
ance. In contrast, eukaryotes appear to have a lower optimal
irradiance and might experience photo-degradation from sur-
face growth. These observations offer an example of the
insights that may be gained through a combination of high-
frequency monitoring and machine learning.
Our models may understate the long-term predictability of
the phytoplankton community. The difference in predictability
between high-frequency and long-term datasets at a time lag of
1 month suggests that if a similar methodology was followed in
the long-term dataset, reasonably high R
2
values may have
been obtained over time-scales of years. This difference is dri-
ven by several factors: (1) our high-frequency dataset includes
measurements of both phytoplankton and environmental fac-
tors at specific depths, as opposed to integrated values across
the water column as in the long-term dataset, (2) the high-fre-
quency dataset has more than an order of magnitude more
data points (7161 vs. 383) with which to train the ML algo-
rithm, and (3) the long-term dataset explores a greater range of
parameter space in temperature, nutrient concentration, zoo-
plankton density and unmeasured factors. Of the three, we
believe depth-specific sampling may be the most important
determinant, as the difference in model R
2
at a lag of one
month is <15% in the case of the cell density-only models, and
30% in the models with both cell density and environmental
factors (Fig. 3). However, we also note that in more complex
systems where migration is important such as coastal and
open-ocean communities predictability may be lower unless
physical circulation patterns are highly predictable as well.
It is important to note that although pseudo-R
2
provides
estimates of predictability that are robust (Breiman 1999), we
have not assessed a true forecast, in which error is allowed to
compound through time. Our approach can in principle be
used to make a forecast, but we chose not to do so because of
the long winter-to-spring break in our high-frequency dataset,
and the large changes in environmental conditions towards
the end of the two monitoring seasons. Attempting to forecast
would require us to predict in conditions well outside those
that the model was trained on, where it will inevitably per-
form poorly. Despite this limitation, we believe that out-of-
bag error is a useful proxy for forecast error: the realistic
environmental dependencies (Fig. 5) highlight that we are
uncovering the mechanisms underpinning ecological dynamics.
In the future, a broader sampling of parameter space and a
longer time series (through year-round monitoring) should
allow us to make forecasts and test true forecast skill.
We have shown that high-frequency environmental monitor-
ing and machine learning approaches can be employed to iden-
tify patterns in complex ecological communities, to assess their
predictability, and to uncover dependencies that can be incor-
porated into process-based models of communities and ecosys-
tems. This can help us address fundamental questions in
ecology: What are the drivers of ecological processes and how
does this change through time? How large of an effect does
environmental and demographic stochasticity have on commu-
nities? What are the dominant trade-offs that maintain diversity
in natural systems and how do they operate in dynamic envi-
ronments? But perhaps more importantly, it can help us to
improve our forecasts of ecological systems, fulfilling a funda-
mental obligation that ecology owes to society.
ACKNOWLEDGEMENTS
This work was funded by Swiss National Science Foundation
grants CRSII2_147654 and 31003A_144053. We thank: the
Office of Waste, Water, Energy and Air (AWEL) of Canton
Z
urich for providing permission for in situ monitoring and
data on water inflow into Greifensee; the laboratory groups of
H. R. B
urgi and P. Spaak for data collection and access to the
long-term dataset; Esther Keller for assistance with micro-
scopy; Dany Steiner, Hannele Penson and Christian Ebi for
help with field work and equipment maintenance; Idronaut
and Cytobuoy for support during monitoring campaigns; and
Colin T. Kremer for helpful discussions and comments on the
manuscript.
DATA ACCESSIBILITY STATEMENT
Data available from the Dryad Digital Repository: https://doi.
org/10.5061/dryad.r4454
AUTHORSHIP
MKT and FP conceived the study. MKT, SF, MR and FP
collected the data. MKT analysed the data. MKT wrote the
manuscript with substantial input from FP, SF, MK and MR.
REFERENCES
Axelsen, J.B., Yaari, R., Grenfell, B.T. & Stone, L. (2014). Multiannual
forecasting of seasonal influenza dynamics reveals climatic and
evolutionary drivers. PNAS, 111, 95389542.
Breiman, L. (1999). Random forests. Mach. Learn, 45, 135.
B
urgi, H.R., B
uhrer, H. & Keller, B. (2003). Long-term changes in
functional properties and biodiversity of plankton in lake greifensee
(Switzerland) in response to phosphorus reduction. Aquat. Ecosyst.
Health Manage., 6, 147158.
Carey, C.C., Ibelings, B.W., Hoffmann, E.P., Hamilton, D.P. & Brookes,
J.D. (2012). Eco-physiological adaptations that favour freshwater
cyanobacteria in a changing climate. Water Res., 46, 13941407.
Chesson, P. (2000). Mechanisms of maintenance of species diversity.
Annual Review of Ecology and Systematics, 31, 343366.
Chorus I, Bartram, J. (1999) Toxic Cyanobacteria in Water: A Guide to
Their Public Health Consequences, Monitoring, and Management. E &
FN Spon, 416 pp.
Demers, S., Roy, S., Gagnon, R. & Vignault, C. (1991). Rapid light-
induced changes in cell fluorescence and in xanthophyll-cycle pigments
of Alexandrium excavatum (Dinophyceae) and Thalassiosira
pseudonana (Bacillario-phyceae): a photo-protection mechanism. Mar.
Ecol. Prog. Ser., 76, 185193.
Edwards, K.F., Thomas, M.K., Klausmeier, C.A. & Litchman, E. (2015).
Light and growth in marine phytoplankton: allometric, taxonomic, and
environmental variation. Limnol. Oceanogr., 60, 540552.
Edwards, K.F., Thomas, M.K., Klausmeier, C.A. & Litchman, E. (2016).
Phytoplankton growth and the interaction of light and temperature: a
©2018 John Wiley & Sons Ltd/CNRS
Letter Predicting phytoplankton dynamics 627
synthesis at the species and community level. Limnol. Oceanogr., 61,
12321244.
Eilers, P.H.C. & Peeters, J.C.H. (1988). A model for the relationship
between light intensity and the rate of photosynthesis in
phytoplankton. Ecol. Model., 42, 199215.
Ewers, R.M., Andrade, A., Laurance, S.G., Camargo, J.L., Lovejoy, T.E.
& Laurance, W.F. (2017). Predicted trajectories of tree community
change in Amazonian rainforest fragments. Ecography, 40, 2635.
Falkowski, P.G., Barber, R.T. & Smetacek, V. (1998). Biogeochemical
controls and feedbacks on ocean primary production. Science, 281,
200206.
Field, C.B., Behrenfeld, M.J., Randerson, J.T. & Falkowski, P.G. (1998).
Primary production of the biosphere: integrating terrestrial and oceanic
components. Science, 281, 237240.
Goldman, J.C. & Glibert, P.M. (1982). Comparative rapid ammonium
uptake by four species of marine phytoplankton. Limnol. Oceanogr.,
27, 814827.
Hemme, D., Veyel, D., M
uhlhaus, T. et al. (2014). Systems-wide analysis
of acclimation responses to long-term heat stress and recovery in the
photosynthetic model organism chlamydomon as reinhardtii. Plant
Cell, 26, 42704297.
Houlahan, J.E., McKinney, S.T., Anderson, T.M. & McGill, B.J. (2017).
The priority of prediction in ecological understanding. Oikos, 126, 17.
Hunter-Cevera, K.R., Neubert, M.G., Solow, A.R., Olson, R.J., Shalapyonok,
A. & Sosik, H.M. (2014). Diel size distributions reveal seasonal growth
dynamics of a coastal phytoplankter. PNAS,111,98529857.
Hunter-Cevera, K.R., Neubert, M.G., Olson, R.J., Solow, A.R., Shalapyonok,
A. & Sosik, H.M. (2016). Physiological and ecological drivers of early spring
blooms of a coastal phytoplankter. Science, 354, 326329.
Ishwaran H., Kogalur U.B. (2017). Random Forests for Survival,
Regression and Classification (RF-SRC), R package version 2.4.2.
Ishwaran, H. & Kogalur, U.B. (2007). Random survival forests for R. R.
News, 7(2), 2531.
Jochimsen, M.C., K
ummerlin, R. & Straile, D. (2013). Compensatory
dynamics and the stability of phytoplankton biomass during four
decades of eutrophication and oligotrophication. Ecol. Lett., 16, 8189.
Kehoe, M., O’Brien, K., Grinham, A., Rissik, D., Ahern, K.S. & Maxwell,
P. (2012). Random forest algorithm yields accurate quantitative
prediction models of benthic light at intertidal sites affected by toxic
Lyngbya majuscula blooms. Harmful Algae, 19, 4652.
Kehoe, M.J., Chun, K.P. & Baulch, H.M. (2015). Who smells?
forecasting taste and odor in a drinking water reservoir. Environ. Sci.
Technol., 49, 1098410992.
Kremer, C.T., Williams, A.K., Finiguerra, M. et al. (2016). Realizing the
potential of trait-based aquatic ecology: new tools and collaborative
approaches. Limnol. Oceanogr., 62, 253271.
Litchman, E. & Klausmeier, C.A. (2008). Trait-based community ecology
of phytoplankton. Annu. Rev. Ecol. Evol. Syst., 39, 615639.
Magurran, A.E., Baillie, S.R., Buckland, S.T. et al. (2010). Long-term
datasets in biodiversity research and monitoring: assessing change in
ecological communities through time. Trends Ecol. Evol., 25, 574582.
McGill, B.J., Enquist, B.J., Weiher, E. & Westoby, M. (2006). Rebuilding
community ecology from functional traits. Trends Ecol. Evol., 21, 178185.
Mouquet, N., Lagadeuc, Y., Devictor, V. et al. (2015). Predictive ecology
in a changing world. J. Appl. Ecol., 52, 12931310.
M
uller, R., Pfeifroth, U., Tr
ager-Chatterjee, C., Cremer, R., Trentmann
J, J. & Hollmann, R. (2015) Surface Solar Radiation Data Set -
Heliosat (SARAH) - Edition 1.
Paerl, H.W. & Huisman, J. (2009). Climate change: a catalyst for global
expansion of harmful cyanobacterial blooms. Environ. Microbiol. Rep.,
1, 2737.
Paerl, H.W., Hall, N.S. & Calandrino, E.S. (2011). Controlling
harmful cyanobacterial blooms in a world experiencing
anthropogenic and climatic-induced change. Sci. Total Environ., 409,
17391745.
Petchey, O.L., Pontarp, M., Massie, T.M. et al. (2015). The ecological
forecast horizon, and examples of its uses and determinants. Ecol.
Lett., 18, 597611.
Pomati, F., Jokela, J., Simona, M., Veronesi, M. & Ibelings, B.W. (2011).
An automated platform for phytoplankton ecology and aquatic
ecosystem monitoring. Environ. Sci. Technol., 45, 96589665.
Pomati, F., Matthews, B., Jokela, J., Schildknecht, A. & Ibelings, B.W.
(2012). Effects of re-oligotrophication and climate warming on
plankton richness and community stability in a deep mesotrophic lake.
Oikos, 121, 13171327.
Pomati, F., Kraft, N.J.B., Posch, T., Eugster, B., Jokela, J. & Ibelings,
B.W. (2013). Individual cell based traits obtained by scanning flow-
cytometry show selection by biotic and abiotic environmental factors
during a phytoplankton spring bloom. PLoS ONE, 8, e71677.
R Core Team (2017). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/.
Rice, E.W., Baird, R.B., Eaton, A.D. & Clesceri, L.S. (eds.) (2012)
Standard Methods for the Examination of Water and Wastewater,
22nd edn. American Water Works Association/American Public Works
Association/Water Environment Federation.
Rivero-Calle, S., Gnanadesikan, A., Del Castillo, C.E., Balch, W.M. &
Guikema, S.D. (2015). Multidecadal increase in North Atlantic
coccolithophores and the potential role of rising CO2. Science, 350,
15331537.
Schulz, J., Albert, P., Behr, H.-D. et al. (2008). Operational climate
monitoring from space: the EUMETSAT satellite application facility on
climate monitoring (CM-SAF). Atmosp. Chem. Phy. Disc., 8, 85178563.
Schwaderer, A.S., Yoshiyama, K., de Tezanos Pinto, P., Swenson, N.G.,
Klausmeier, C.A. & Litchman, E. (2011). Eco-evolutionary differences
in light utilization traits and distributions of freshwater phytoplankton.
Limnol. Oceanogr., 56, 589598.
Thomas, M.K., Kremer, C.T., Klausmeier, C.A. & Litchman, E. (2012).
A global pattern of thermal adaptation in marine phytoplankton.
Science, 338, 10851088.
Thomas, M.K., Kremer, C.T. & Litchman, E. (2016). Environment and
evolutionary history determine the global biogeography of
phytoplankton temperature traits. Glob. Ecol. Biogeogr., 25, 7586.
Thomas, M.K., Aranguren-Gassis, M., Kremer, C.T., Gould, M.R.,
Anderson, K., Klausmeier, C.A. et al. (2017). Temperature-nutrient
interactions exacerbate sensitivity to warming in phytoplankton. Glob.
Change Biol., 23, 32693280.
Williams, J.W., Jackson, S.T. & Kutzbach, J.E. (2007). Projected
distributions of novel and disappearing climates by 2100 AD. PNAS,
104, 57385742.
Zhu, K., Chiariello, N.R., Tobeck, T., Fukami, T. & Field, C.B. (2016).
Nonlinear, interacting responses to climate limit grassland production
under global change. PNAS, 113, 1058910594.
Zimmer, A., Katzir, I., Dekel, E., Mayo, A.E. & Alon, U. (2016).
Prediction of multidimensional drug dose responses based on
measurements of drug pairs. PNAS, 113, 1044210447.
SUPPORTING INFORMATION
Additional Supporting Information may be found online in
the supporting information tab for this article.
Editor, Tim Coulson
Manuscript received 24 November 2017
First decision made 23 December 2017
Manuscript accepted 5 January 2018
©2018 John Wiley & Sons Ltd/CNRS
628 M. K. Thomas et al. Letter
... In the end, the authors had to deal with around 30 environmental variables and a large dataset that would have been hardly tractable only a few years ago. Using machine learning, they were able to generate models capable of predicting with a high degree of confidence phytoplankton temporal community dynamics [23]. ...
... The phytoplankton example described above refers to a complex system strongly controlled by chemical and physical variables that can be precisely identified and measured. Thus, the study's main result, the high predictability of phytoplankton cell density over hours to months [23], although remarkable, does not come as a total surprise. In some way, we can draw analogies between the phytoplankton study and other ones dealing with a non-mainstream AI technique called "symbolic regression", which provides a method for the automatic discovery of equations explaining complex processes. ...
Chapter
The rapid technological development is generating unprecedented opportunities for scientific research. In particular, the development of artificial intelligence (AI) techniques, with computers becoming increasingly independent in performing tasks and—to a certain degree—taking decisions, is opening new frontiers. Machine learning techniques are now mainstream in many fields of research. For instance, they provide exceptional aid in automatic image classification of remote sensing imagery and look very promising as tools for identifying non-trivial relationships in large, multivariate datasets. However, as AI spreads across multiple scientific disciplines, awareness is increasing about the potential risk of relying on the “black box”, trusting predictions with a limited understanding of how these had been achieved. Studies focusing on the application of AI to specific contexts (image classification) have revealed that such risks are real: even when an AI is accurately trained, its thought processes might take unexpected directions. Furthermore, although AI has proven very efficient in specific settings, its applicability to complex ecological scenarios (e.g. as a tool to identify mechanisms and drivers of ecosystem stability and species diversity, or inform conservation decisions) seems still quite distant. Considering the issue from a much broader perspective and looking at the potential role of AI in the future of Earth’s natural systems, important philosophical considerations emerge. Regardless of the fate of humanity, averting the environmental crisis is necessary to reduce the so-called existential risk, which entails both the risk of human extinction and the risk that humanity will not reach technological maturity or will reach it in an ill way. Ensuring that future AIs will align with this goal should be therefore considered as a global priority.
... Our data set with almost weekly sampling at 2-4 depth intervals for 12 consecutive years is at the upper limit of what is operationally feasible using manually sampling and microscope counting which delivers a taxonomically detailed picture of the community. Higher spatio-temporal resolution requires automatic probing and image analysis systems which are increasingly becoming available (Thomas et al., 2018) but at least so far at the cost of taxonomical resolution as far as ciliates are concerned.. ...
Thesis
Plankton food webs are the basis of marine and limnetic ecosystems. Especially aquatic ecosystems of high biodiversity provide important ecosystem services for humankind as providers of food, coastal protection, climate regulation, and tourism. Understanding the dynamics of biomass and coexistence in these food webs is a first step to understanding the ecosystems. It also lays the foundation for the development of management strategies for the maintenance of the marine and freshwater biodiversity despite anthropogenic influences. Natural food webs are highly complex, and thus often equally complex methods are needed to analyse and understand them well. Models can help to do so as they depict simplified parts of reality. In the attempt to get a broader understanding of the complex food webs, diverse methods are used to investigate different questions. In my first project, we compared the energetics of a food chain in two versions of an allometric trophic network model. In particular, we solved the problem of unrealistically high trophic transfer efficiencies (up to 70%) by accounting for both basal respiration and activity respiration, which decreased the trophic transfer efficiency to realistic values of ≤30%. Next in my second project I turned to plankton food webs and especially phytoplankton traits. Investigating a long-term data set from Lake Constance we found evidence for a trade-off between defence and growth rate in this natural phytoplankton community. I continued working with this data set in my third project focusing on ciliates, the main grazer of phytoplankton in spring. Boosted regression trees revealed that temperature and predators have the highest influence on net growth rates of ciliates. We finally investigated in my fourth project a food web model inspired by ciliates to explore the coexistence of plastic competitors and to study the new concept of maladaptive switching, which revealed some drawbacks of plasticity: faster adaptation led to higher maladaptive switching towards undefended phenotypes which reduced autotroph biomass and coexistence and increased consumer biomass. It became obvious that even well-established models should be critically questioned as it is important not to forget reality on the way to a simplistic model. The results showed furthermore that long-term data sets are necessary as they can help to disentangle complex natural processes. Last, one should keep in mind that the interplay between models and experiments/ field data can deliver fruitful insights about our complex world.
... There have been very few studies reporting on the correlation between cyanobacterial and phytoplankton absolute abundances. The findings of Thomas et al. (2018) that eukaryotic phytoplankton densities had remained relatively stable in Lake Greifensee (Switzerland) for over 30 years and cyanobacterial densities had changed 100-fold is consistent with our results. Carey et al. (2014) also reported that Gloeotrichia increased the biomass of Bacillariophyta (diatoms) and Chlorophyta (green algae) and high densities of Gloeotrichia might stimulate the other phytoplankton by leaking nitrogen and phosphorus. ...
Article
Full-text available
Cyanobacterial blooms are global threats to freshwater ecosystem functioning, human health, and ecoservices. We assessed impacts of cyanobacterial bloom intensity on plankton ecosystem functioning using eukaryotic phytoplankton and zooplankton indicators and associated key physicochemical data collected from four seasons of two years at 24 evenly distributed sites in Lake Taihu that has year-around cyanobacterial blooms. Our analyses involved comparison of four site-groups with different bloom intensities and analyzing all sampling sites together using comparison, hierarchical partitioning analysis, generalized additive mixed model, and structural equation model. We found that cyanobacterial abundance positively associated with TP and temperature (negatively with TN:TP), while phytoplankton positively associated with TN. There was an inverse relation trend between relative abundances of phytoplankton and cyanobacteria, but there was no clear trend between absolute abundances of phytoplankton and cyanobacteria. Rotifers were most dominant when cyanobacteria were unabundant, while cladocerans presented higher abundance when cyanobacteria were in high abundance. Phytoplankton functional richness and species richness negatively and zooplankton functional richness and species richness positively associated with cyanobacterial bloom intensity. Cyanobacterial bloom intensity negatively associated with resource use efficiencies (RUEs) of phytoplankton and rotifers, and positively associated with RUE of cladocerans. Our analytical approach of integrating comparison of site-groups and analyzing all sites together uncovered how cyanobacterial bloom intensity shifted and altered physicochemical and biological conditions and plankton ecosystem functioning, and identified the mechanism and strength of the interactive linkages among physicochemical and biological indicators. Although our results may be different from oligotrophic lakes or reservoirs, our findings provide new insights in understanding the impacts of cyanobacterial bloom intensity on the dynamics of plankton communities and ecosystem functioning for polymictic eutrophic lakes, which may have broad application in enhancing the knowledge of this subject and provides the science base for managing polymictic eutrophic lake water quality and ecosystem functioning.
... For example, remote sensing has been proposed as a promising technology for the continuous and high-frequency monitoring of biodiversity and ecosystem functions at large spatial scales (Pettorelli et al. 2018). Also, automated and continuous monitoring of phytoplankton communities through in situ scanning-flow cytometry increases our ability to predict future changes in biodiversity and the functions it sustains (Thomas et al. 2018). ...
Article
Ramon Margalef was a pioneering scientist who introduced an interdisciplinary approach to ecological studies. His studies were among the first to incorporate various concepts in the literature of aquatic ecology, covering topics such as organisms, ecosystem interactions and evolution. To bring Margalef’s approach into current scientific studies, in this review we explore his vision of aquatic ecology within four interrelated fields of study: ecological theory, microbial diversity, biogeochemical cycles and global environmental changes. Taking inspiration from his studies, we analyse current scientific challenges and propose an integrated approach, considering the unifying concept of Margalef’s Mandala with the aim of improving future studies on aquatic microbial ecology.
... The advantage of machine-learning methods is in their ability to mimic linear relationships between dependent and independent variables. Machine learning models are also able to assess associations and quantify predictability in the absence of the knowledge needed for developing process-based models (Thomas et al., 2018). Within the field of microbial water quality, several researchers have used ML algorithms to develop models which could be used to make rapid water quality determinations in rivers, streams, Great Lakes beaches, groundwater, drinking water wells, and water distribution systems (Brooks et al., 2016;Mohammed et al., 2018Mohammed et al., , 2021Panidhapu et al., 2020;Tousi et al., 2021;Weller et al., 2021;White et al., 2021). ...
Article
Full-text available
The microbial quality of irrigation water is an important issue as the use of contaminated waters has been linked to several foodborne outbreaks. To expedite microbial water quality determinations, many researchers estimate concentrations of the microbial contamination indicator Escherichia coli (E. coli) from the concentrations of physiochemical water quality parameters. However, these relationships are often non-linear and exhibit changes above or below certain threshold values. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during the irrigation season. E. coli concentrations along with 12 other water quality parameters were measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting (SGB) machines, random forest (RF), support vector machines (SVM), and k-nearest neighbor (kNN) algorithms. The RF model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. For individual years, the RMSE of the predicted E. coli concentrations (log10 CFU 100 ml−1) ranged from 0.244 to 0.346 and 0.304 to 0.418 for Pond 1 and 2, respectively. For the 3-year datasets, these values were 0.334 and 0.381 for Pond 1 and 2, respectively. In most cases there was no significant difference (P > 0.05) between the RMSE of RF and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used vs. 8 or 12, indicating that more tedious and costly measurements provide no substantial improvement in the predictive accuracy of the evaluated algorithms.
... Previous studies mainly focused on the causes and consequences of HABs, but no persuasive conclusions have been drawn based on traditional algaemonitoring campaigns and statistical analysis (Asnaghi et al., 2017;Jochimsen et al., 2013). Therefore, there remains a need for fast, costeffective, and online monitoring, together with advanced modelling approaches to disentangle the underlying trends from the complicated pattern of river algal blooms events (Thomas et al., 2018;Thomson-Laing et al., 2020). ...
Article
Many dammed rivers throughout the world have experienced frequent harmful algal blooms (HABs) in the context of climate change and anthropogenic activities. Accurate forecasting of algal parameters (i.e., algal cell density and microcystin concentration) has great practical significance for taking precautions against HABs risks. Long short-term memory (LSTM) networks have recently shown potential in predicting water quality parameters. However, there is still little known about the robustness of the LSTM in forecasting highly time-resolved measurement of algal parameters. This study developed a hybrid deep-learning architecture (XG-LSTM) composed of one XGBoost module and two parallel LSTM models to predict algal cell density and microcystin concentration in the Three Gorges Reservoir (TGR). The proposed model was validated by in situ multi-sensor-system monitoring data at four bloom-impacted tributaries in the TGR. Each modelling process utilized the antecedent information of the algal parameters and the corresponding environmental variables as inputs for forecasting the algal parameters for the coming hours and days. As expected, the presented model achieved better performance than those without special feature extraction procedures, providing that the use of selected environmental parameters can improve LSTM performance. In addition, the hybrid XG-LSTM model successfully captured the time-series patterns of both algal cell density and microcystin concentration compared with other data-driven models, further suggesting the reliable utilization of this model in early warnings of bloom toxicity. Thus, the results presented demonstrate the potential of deep learning technology for real-time prediction of algal parameters in the TGR, and possibly for rapid detection of developing HABs in other aquatic ecosystems.
Chapter
Simulating progressive, multiple species extinctions in ecological networks while keeping track of the network’s response to subsequent species loss can be an informative exercise. However, the nature of the information emerging from such exercise strongly depends on the identity of nodes (species) removed from the network and the order in which we remove them. For example by replicating the experiment of removing nodes one after another in random order in different ecological networks, one can get an idea of how the networks compare in robustness against a generic form of perturbation. Yet, informative criteria can be used instead of random node removal. For example one could remove species in decreasing order of their current vulnerability to extinction as assessed by IUCN. Similarly, one could combine ecological niche and global circulation models to estimate relative species vulnerability to future climate change and then use those predictions to identify extinction sequences. Although these “informed” approaches can provide a more realistic picture of the potential paths diversity loss might take in the near future compared to random simulations, identifying boundaries (i.e. a best and a worst-case scenarios) for the many potential trajectories of collapse in a given network is essential to put specific predicted patterns into perspective. Knowledge developed in the physics context offers sophisticated and efficient techniques to identify either the most or the least detrimental sequences of species extinction, that is sequences either maximizing or minimizing the speed at which a given network approaches collapse following progressive species removal. In doing that, those techniques also offer novel, largely overlooked tools for conservation. Determining which nodes are crucial for network persistence (at any stage of network dismantling) appears as a straightforward criterion for identifying conservation targets which has received very little consideration to date.
Book
This book provides, for the first time, a comprehensive overview of the fundamental roles that ecological interactions play in extinction processes, bringing to light an underground of hidden pathways leading to the same dark place: biodiversity loss. We are in the midst of the sixth mass extinction. We see species declining and vanishing one after another. Poached rhinos, dolphins and whales slaughtered, pandas surviving only in captivity are strong emotional testimonials of what is happening. Yet, the main threat to natural communities may be overshadowed by the disappearance of large species, with most extinctions happening unnoticed and involving less eye-catching organisms, such as parasites and pollinators. Ecosystems hide countless, invisible wires connecting organisms in dense networks of ecological interactions. Through these networks, perturbations can propagate from one species to another, producing unpredictable effects. In worst case scenarios, the loss of one species might doom many others to extinction. Ecologists now consider such mechanisms as a fundamental – and still poorly understood - driver of the ongoing biodiversity crisis. Hidden Pathways to Extinction makes the invisible links connecting the fates of species and organisms evident, exploring why complexity can enhance ecosystem stability and yet accelerate species loss. Page after page, Strona provides convincing evidence that we are primarily responsible for the fall in biodiversity, that we are falling too, and that we need to redouble our conservation efforts now, or it won't be long before we hit the ground.
Article
The catch of Corbicula japonica is one of the top three in Japan's inland water fisheries. Most of the Japanese fishing stock of this bivalve comes from Lake Shinji. Since the 1980's the catch of this species declined drastically with strong inter-annual variation. New recruitment and poor growth of the clams were thought to be the main reasons for this decline. This bivalve has a poor transportation ability after the settlement. Therefore, it is also important to know their suitable habit that is affected by multi interacting parameters of water qualities and sediments. In order to get a better understanding of their habitat, we used machine learning models to test abiotic limiting factors. The database that we used in our study included 337 sampling stations with 7 physicochemical variables and the Corbicula japonica counting which was divided into two size categories: small (shell length < 4 mm) and large (shell length ≥ 4 mm). Due to their low self-transportation ability, their survival is primarily influenced by the site environment, such as water quality and sediment conditions. To extract physicochemical thresholds that directly impact these clam populations, we applied a coupled methodology based on a random forest model and partial dependence plots. We highlighted that the use of three different populations groups for each size category, allowed us to significantly increase the accuracy of our model to predict groups. Moreover, our results show that the three most influential predictors were, in order of importance, the water depth, ignition loss of the bottom sediments and silt clay (diameter ≤ 0.063 mm) content of the bottom sediments; with, respectively, the following physicochemical limitations for the preferable habitat: 4 m, 10% and 45%. These limitations are based on the thresholds that we extracted from the coupling between a Random Forest model and the partial dependence plots. Our findings emphasize that this coupled methodology has the advantage of being able to manage the zero values of population and give graphical interpretation of the limiting environmental factors. Consequently, the results that we presented here clearly demonstrate a further step towards comprehension of the inter-annual variations in the fishery resources of Corbicula japonica after accounting for the impact of new recruitment. In addition, this kind of numerical tool could help the local government to manage the clam's population by restoring and conserving the water quality and sediment conditions of the lake.
Article
Statistical and climate models are frequently used for biodiversity projections under future climatic changes, but their predictive capacity for freshwater plankton may vary among different species and community metrics. Here, we used random forests to model plankton species and community metrics as a function of biological, climatic, physical, and chemical data from long‐term (2000–2017) monitoring data collected from Lake Müggelsee Berlin, Germany. We (1) compared the predictability of well‐known lake plankton metric types (biomass, abundance, taxonomic diversity, Shannon diversity, Simpson diversity, evenness, taxonomic distinctness, and taxonomic richness) and (2) assessed how the relative influence of different environmental drivers varies across lake plankton metric models. Overall, the metric predictability was highest for biomass and abundance followed by taxonomic richness. The biomass of dominant phytoplankton taxonomic groups such as cyanobacteria (adjusted‐R2 = 0.53) and the abundance of dominant zooplankton taxonomic groups such as rotifers (adjusted‐R2 = 0.59) and daphnids (adjusted‐R2 = 0.51) were more predictable than other metric types. The plankton metric predictability increased when grouping phytoplankton species according to their functional traits (adjusted‐R2 = 0.37 ± 0.14, mean ± SD, n = 36 functional groups) compared to higher taxonomic units (adjusted‐R2 = 0.25 ± 0.15, n = 22 taxonomic groups). Light, nutrients, water temperature, and seasonality for phytoplankton and food resources for zooplankton were the main drivers of both taxonomic and functional groups, giving confidence that our models captured the expected major environmental drivers. Our quantitative analyses highlight the multidimensionality of lake planktonic responses to environmental drivers and have implications for our capacity to select appropriate metrics for forecasting the future of lake ecosystems under global change scenarios.
Chapter
Full-text available
Random survival forests (RSF) is a flexible nonparametric tree‐ensemble method for the analysis of right‐censored survival data. In this article, we provide a short overview of RSF. We review survival splitting rules for growing random survival trees, in‐bag and out‐of‐bag (OOB) ensemble estimators, prediction performance, variable importance, and partial plots. We also briefly describe the extension of RSF to competing risks.
Article
Full-text available
Temperature and nutrients are fundamental, highly nonlinear drivers of biological processes, but we know little about how they interact to influence growth. This has hampered attempts to model population growth and competition in dynamic environments, which is critical in forecasting species distributions, as well as the diversity and productivity of communities. To address this, we propose a new model of population growth that includes the temperature-nutrient interaction and test a novel prediction: that a species’ optimum temperature for growth, Topt, is a saturating function of nutrient concentration. We find strong support for this prediction in experiments with a marine diatom, Thalassiosira pseudonana: Topt decreases by 3-6°C at low nitrogen and phosphorus concentrations. This interaction implies that species are more vulnerable to hot, low nutrient conditions than previous models account for. The interaction dramatically alters species’ range limits in the ocean, projected based on current temperature and nitrate levels as well as those forecast for the future. Ranges are smaller not only than projections based on the individual variables, but also than projected ranges based on a simpler model of temperature-nutrient interactions. Nutrient deprivation is therefore likely to exacerbate environmental warming's effects on communities.
Article
Full-text available
Trait-based ecology, which focuses on using the traits of species and individuals to understand ecology (from populations to ecosystems), is becoming an increasingly productive and widely employed paradigm. To date, trait-based approaches have been used to study taxa from microbes to megafauna in every major area of aquatic ecology yielding exciting results. However, this promising field faces a number of significant obstacles, including: (1) identifying and measuring ecologically relevant traits, (2) integrating inter- and intra-specific trait variation, (3) detecting and quantifying trait correlations and trade-offs, and (4) accounting for the context dependency of traits. These issues are often particularly acute for specific taxa or systems. This paper highlights these looming challenges, as well as ways to address them. Proposed solutions center around using new technologies to collect trait data, coordinating research efforts, and curating and sharing data. Throughout, we take an interdisciplinary approach, sharing examples spanning a wide range of aquatic taxa and systems. It is our hope that this paper will stimulate frank discussions and help the growing field of trait-based aquatic ecology maximize its potential.
Article
Full-text available
The objective of science is to understand the natural world; we argue that prediction is the only way to demonstrate scientific understanding, implying that prediction should be a fundamental aspect of all scientific disciplines. Reproducibility is an essential requirement of good science and arises from the ability to develop models that make accurate predictions on new data. Ecology, however, with a few exceptions, has abandoned prediction as a central focus and faces its own crisis of reproducibility. Models are where ecological understanding is stored and they are the source of all predictions – no prediction is possible without a model of the world. Models can be improved in three ways: model variables, functional relationships among dependent and independent variables, and in parameter estimates. Ecologists rarely test to assess whether new models have made advances by identifying new and important variables, elucidating functional relationships, or improving parameter estimates. Without these tests it is difficult to know if we understand more today than we did yesterday. A new commitment to prediction in ecology would lead to, among other things, more mature (i.e. quantitative) hypotheses, prioritization of modeling techniques that are more appropriate for prediction (e.g. using continuous independent variables rather than categorical) and, ultimately, advancement towards a more general understanding of the natural world.
Article
Full-text available
A great challenge for ecologists is predicting how communities in fragmented tropical landscapes will change in the future. Available evidence suggests that fragmented tropical tree communities are progressing along a trajectory of ‘retrogressive succession’, in which the community shifts towards an early or mid-successional state that will persist indefinitely. Here, we investigate the potential endpoint of retrogressive succession, examining whether it will eventually lead to the highly depauperate communities that characterise recently abandoned agricultural lands. We tested this hypothesis by using neural networks to construct an empirical model of Amazonian rainforest-tree-community responses to experimental habitat fragmentation. The strongest predictor of tree-community composition in the future was its composition in the present, modified by variables like the composition of the surrounding habitat matrix and distance to forest edge. We extrapolated network predictions over a 100 year period and quantified trajectories of forest communities in multidimensional ordination space. We found no evidence that forest communities, including those near forest edges, were converging strongly towards a composition dominated by just one or two early successional genera. Retrogressive succession may well be stronger in fragmented landscapes altered by chronic disturbances, such as edge-related fires, selective logging, or intense windstorms, but in this experimental landscape in which other human disturbances are very limited, it is unlikely that forest edge communities will fully revert to the species poor assemblages observed in very early successional landscapes. This article is protected by copyright. All rights reserved.
Article
Full-text available
Global changes in climate, atmospheric composition, and pollutants are altering ecosystems and the goods and services they provide. Among approaches for predicting ecosystem responses, long-term observations and manipulative experiments can be powerful approaches for resolving single-factor and interactive effects of global changes on key metrics such as net primary production (NPP). Here we combine both approaches, developing multidimensional response surfaces for NPP based on the longest-running, best-replicated, most-multifactor global-change experiment at the ecosystem scale-a 17-y study of California grassland exposed to full-factorial warming, added precipitation, elevated CO2, and nitrogen deposition. Single-factor and interactive effects were not time-dependent, enabling us to analyze each year as a separate realization of the experiment and extract NPP as a continuous function of global-change factors. We found a ridge-shaped response surface in which NPP is humped (unimodal) in response to temperature and precipitation when CO2 and nitrogen are ambient, with peak NPP rising under elevated CO2 or nitrogen but also shifting to lower temperatures. Our results suggest that future climate change will push this ecosystem away from conditions that maximize NPP, but with large year-to-year variability.
Article
Full-text available
Finding potent multidrug combinations against cancer and infections is a pressing therapeutic challenge; however, screening all combinations is difficult because the number of experiments grows exponentially with the number of drugs and doses. To address this, we present a mathematical model that predicts the effects of three or more antibiotics or anticancer drugs at all doses based only on measurements of drug pairs at a few doses, without need for mechanistic information. The model provides accurate predictions on available data for antibiotic combinations, and on experiments presented here on the response matrix of three cancer drugs at eight doses per drug. This approach offers a way to search for effective multidrug combinations using a small number of experiments.
Article
Full-text available
Temperature strongly affects phytoplankton growth rates, but its effect on communities and ecosystem processes is debated. Because phytoplankton are often limited by light, temperature should change community structure if it affects the traits that determine competition for light. Furthermore, the aggregate response of phytoplankton communities to temperature will depend on how changes in community structure scale up to bulk rates. Here, we synthesize experiments on 57 phytoplankton species to analyze how the growth-irradiance relationship changes with temperature. We find that light-limited growth, light-saturated growth, and the optimal irradiance for growth are all highly sensitive to temperature. Within a species, these traits are co-adapted to similar temperature optima, but light-limitation reduces a species' temperature optimum by ∼5°C, which may be an adaptation to how light and temperature covary with depth or reflect underlying physiological correlations. Importantly, the maximum achievable growth rate increases with temperature under light saturation, but not under strong light limitation. This implies that light limitation diminishes the temperature sensitivity of bulk phytoplankton growth, even though community structure will be temperature-sensitive. Using a database of primary production incubations, we show that this prediction is consistent with estimates of bulk phytoplankton growth across gradients of temperature and irradiance in the ocean. These results indicate that interactions between temperature and resource limitation will be fundamental for explaining how phytoplankton communities and biogeochemical processes vary across temperature gradients and respond to global change.
Article
Climate affects the timing and magnitude of phytoplankton blooms that fuel marine food webs and influence global biogeochemical cycles. Changes in bloom timing have been detected in some cases, but the underlying mechanisms remain elusive, contributing to uncertainty in long-term predictions of climate change impacts. Here we describe a 13-year hourly time series from the New England shelf of data on the coastal phytoplankter Synechococcus, during which the timing of its spring bloom varied by 4 weeks. We show that multiyear trends are due to temperature-induced changes in cell division rate, with earlier blooms driven by warmer spring water temperatures. Synechococcus loss rates shift in tandem with division rates, suggesting a balance between growth and loss that has persisted despite phenological shifts and environmental change. © 2016, American Association for the Advancement of Science. All rights reserved.