Cascaded neural networks improving fish species prediction accuracy: The role of the biotic information

Abstract and Figures

Species distribution is the result of complex interactions that involve environmental parameters as well as biotic factors. However, methodological approaches that consider the use of biotic variables during the prediction process are still largely lacking. Here, a cascaded Artificial Neural Networks (ANN) approach is proposed in order to increase the accuracy of fish species occurrence estimates and a case study for Leucos aula in NE Italy is presented as a demonstration case. Potentially useful biotic information (i.e. occurrence of other species) was selected by means of tetrachoric correlation analysis and on the basis of the improvements it allowed to obtain relative to models based on environmental variables only. The prediction accuracy of the L. aula model based on environmental variables only was improved by the addition of occurrence data for A. arborella and S. erythrophthalmus. While biotic information was needed to train the ANNs, the final cascaded ANN model was able to predict L. aula better than a conventional ANN using environmental variables only as inputs. Results highlighted that biotic information provided by occurrence estimates for non-target species whose distribution can be more easily and accurately modeled may play a very useful role, providing additional predictive variables to target species distribution models.
Content may be subject to copyright.
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
Cascaded neural networks
improving sh species prediction
accuracy: the role of the biotic
Simone Franceschini
1, Emanuele Gandola
1,2, Marco Martinoli1, Lorenzo Tancioni1 &
Michele Scardi
Species distribution is the result of complex interactions that involve environmental parameters as
well as biotic factors. However, methodological approaches that consider the use of biotic variables
during the prediction process are still largely lacking. Here, a cascaded Articial Neural Networks
(ANN) approach is proposed in order to increase the accuracy of sh species occurrence estimates and
a case study for Leucos aula in NE Italy is presented as a demonstration case. Potentially useful biotic
information (i.e. occurrence of other species) was selected by means of tetrachoric correlation analysis
and on the basis of the improvements it allowed to obtain relative to models based on environmental
variables only. The prediction accuracy of the L. aula model based on environmental variables only
was improved by the addition of occurrence data for A. arborella and S. erythrophthalmus. While biotic
information was needed to train the ANNs, the nal cascaded ANN model was able to predict L. aula
better than a conventional ANN using environmental variables only as inputs. Results highlighted
that biotic information provided by occurrence estimates for non-target species whose distribution
can be more easily and accurately modeled may play a very useful role, providing additional predictive
variables to target species distribution models.
Developments in Machine Learning have resulted in an increasingly wider utilization of those methods in ecolog-
ical and environmental modeling13 due to their ability to handle non-linear relationships and to provide accurate
results in simulations. Especially, within a framework of global climate changing and increasing anthropic dis-
turbance, the use of ML methods for assessing species occurrence and distribution has become a very important
means to detect changes in environmental health4,5.
Articial Neural Networks (ANNs) are increasingly used by scientists and policy makers in order to support
water management strategies and environmental policies6. Particularly, predicting structure and diversity of sh
assemblages under natural and anthropic disturbance and understanding which environmental factors are the
most relevant to species distribution are fundamental aspects in conservation and management activities aimed
at preserving freshwater ecosystems or restoring them to the optimal ecological status79.
Several studies used ANNs to elucidate the role of the main environmental variables involved in sh occur-
rence prediction1012. Moreover, most of the studies were focused on the role of purely environmental factors in
aecting species distribution and on the relationships between them1315. e use of biotic information has only
rarely been taken into account as a complementary source of input variables16,17.
It is well known in ecology that sh species distribution is aected both by environmental variables and biotic
interactions such as interspecic competition or predation18. erefore, biotic relationships aect likewise sh
community structure, so dening a certain number of sh species combinations which may really exist. In fact,
given a sh species assemblage containing n species, the theoretical number of combinations of sh species occur-
rences should be 2n, while ecological works have evidenced that they are far fewer19. While in most cases the
reason for recurring sh assemblages may depend on species that share similar responses to environmental con-
ditions, in some cases correlations in species distributions may highlight potential biotic interactions.
1Department of Biology, University of Rome Tor Vergata, via della Ricerca Scientifica 1, 00133, Rome, Italy.
2Department of Mathematics, University of Rome Tor Vergata, via della Ricerca Scientica 1, 00133, Rome, Italy.
Correspondence and requests for materials should be addressed to S.F. (email:
Received: 19 July 2017
Accepted: 15 February 2018
Published: xx xx xxxx
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
As combinations of sh species occurrences are not innite and biotic interactions may aect sh species
distribution, these relationships – even in case they are only the outcome of similar responses to environmental
conditions - can be exploited in order to obtain better predictions of sh species occurrence. Several papers deal
with methods aimed at investigating ecological interactions between sh species in freshwater ecosystems, e.g.
using generalized linear models20 or mechanistic models, as proposed by Olden & Po21. Here, we present an
approach aimed at exploiting the information conveyed by potential ecological interactions between freshwater
sh species, thus improving the accuracy of species distribution models. Highlighting potential ecological inter-
actions between sh species may be considered a secondary valuable outcome of the proposed method, since
they can be inferred on the basis of the gain in accuracy of an ANNs model when predicted occurrences of other
species, which can be more easily modeled, are used as additional input variables. To demonstrate this approach,
we tested several models based on the addition of occurrence data for other correlated species to a species distri-
bution model aimed at Leucos aula, a thermophilic species characterized by an omnivorous diet (invertebrates,
algae and aquatic macrophytes) that mainly occurs in water streams and lakes with slow current and plentiful
benthic vegetation22. e selection of L. aula as the target species for demonstrating this new modeling approach
was independent of conservation issues and only based on the good level of knowledge about its ecology and on
the even balance between its presence and absence records, which made this species a good candidate for species
distribution modeling.
Obviously, once the co-predictor species had been selected on the basis of the available eld data, only their
predicted occurrences were passed as inputs to the model aimed at predicting L. aula. As the output from one or
more ANNs here becomes the input to another, this methodology can be referred to as cascaded neural networks
and it has been already used in other ecological applications23. e main goal of this procedure was to select and
exploit suitable biotic information, either causal or correlative, that is already available in any set of sh assem-
blage records that also includes the target species. Needless to say, information obtained from eld work is only
needed to train the cascaded ANNS, as estimated occurrences for co-predictor species are obtained from dedi-
cated ANNs and passed at run time to the ANN aimed at predicting the occurrence of L. aula.
Materials and Methods
Data collection and sampling sites. Data have been obtained from 264 samples that have been collected
from 1991 to 1995 and published in report about the sh fauna of the Veneto region (north-eastern Italy, Fig.1)
by Zanetti et al.24 and Salviati et al.25. Seasonal sampling activities in the same sites have been stored in the data-
base as dierent records to represent the local inter-annual variability of both environmental variables and sh
fauna. Fish assemblage composition was recorded as binary presence/absence data for 34 sh species (Table1).
e values of 20 environmental variables (Table2) were also recorded during sh sampling. Most of these varia-
bles had been already considered in previous studies2629.
Elevation data were obtained by cartographic or in situ GPS measurements. Mean depth was measured by a
graduated pole. All the percentages about the mesohabitat characteristics (runs, pools, ries) and the particle
size of sediment (boulders, rocks and pebbles, gravel, sand, silt and clay) were visually estimated by the opera-
tor. Stream velocity was measured by hydrometric paddle-wheels and it was converted to semi-quantitative val-
ues (0 = still waters; 1 = 5–6 cm/s; 2 = 7–30 cm/s; 3 = 35–50 cm/s; 4 = 55–100 cm/s; 5 = >100 cm/s). Vegetation
cover (i.e. the percentage of the stream channel covered by aquatic macrophytes) as well as shade were visually
Figure 1. Sampling sites. Veneto river basins, NE of Italy. (a) Elevation map of the river basins. BLACK dots
mark the position of the sample sites. (b) L. aula occurrence in the river basins. GREEN dots mark presence,
RED both presence and absence (same site, dierent times), BLACK absence. Images were obtained by using
QGIS soware51 ( Original image was generated by Michele Scardi and then processed
by Emanuele Gandola using Adobe Photoshop cs6 (Version 13.0).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
estimated by the operator. e anthropic disturbance takes into account hydromorphological alterations of the
rivers due to increasing anthropic impacts (channel shape, urbanization, etc.) and it was visually estimated by the
operator. e conductivity and the pH values were evaluated by the use of handheld instruments.
Although geographical coordinates can be regarded as proxies for other variables that are not explicitly
included in the data set in any kind of empirical model, included those based on Machine Learning techniques,
they were not used to avoid biases related to spatial autocorrelation.
Fish were sampled using a standard electro-sh shoulder-bag (4KW, 0.3–6 Ampere, 150–600 Volt) and all
available habitats were sampled along a stream channel 40–70 m long (the transect length was about 10 times the
width of the wetted channel).
Fish sampling met all relevant ethical safeguards and all captured shes were anesthetized with 0.035% MS 222
solution (Tricaine 92 Methanesulfonate) and photographed before release.
Data set processing. To reduce biases in model development, eight taxa with low occurrence were excluded
(<10 samples, marked with an asterisk in Table1). In fact, diculties of ANNs in identifying distribution patterns
of rare species could easily led to incorrect predictions30.
Moreover Oncorhynchus mykiss, Salmo trutta and Salmo (trutta) hybr. trutta/marmoratus were excluded
regardless of their rarity since their occurrence does not depend on environmental conditions alone. Indeed both
O. mykiss and S. trutta distribution is strictly related to the articial release of reared juveniles, while distribution
of Salmo (trutta) hybr. trutta/marmoratus is partly correlated to the occurrence of the two parental species.
NScientic name English name
1Leucos aula (Bonaparte, 1841) (Triotto)
2Padogobius bonelli (Bonaparte, 1846) Padanian Goby
3Scardinius erythrophthalmus (Linnaeus, 1758) Rudd
4Esox lucius (Linnaeus, 1758) European Pike
5Squalius cephalus (Linnaeus, 1758) Chub
6Alburnus arborella (Bonaparte, 1841) Bleak
7Cottus gobio (Linnaeus, 1758) Bullhead
8Tinca tinca (Linnaeus, 1758) Tench
9Cobitis taenia (Linnaeus, 1758) Spined loach
10 Phoxinus phoxinus (Linnaeus, 1758) Minnow
11 Anguilla anguilla (Linnaeus, 1758) European Eel
12 Knipowitschia punctatissima (Canestrini, 1864) Italian Spring Goby
13 Salmo marmoratus (Cuvier, 1817) Marble Trout
14 Sabanejewia larvata (DeFilippi, 1859) Italian Loach
15 Ameiurus melas (Ranesque, 1820) Black Bullhead
16 Lepomis gibbosus (Linnaeus, 1758) Pumpkinseed
17 Barbus plebejus (Bonaparte, 1839) Italian Barbel
18 Protochondrostoma genei (Bonaparte, 1839) South Europe Nase
19 Gasterosteus aculeatus (Linnaeus, 1758) ree-spined Stickleback
20 Carassius carassius (Linnaeus, 1758) Crucian Carp
21 Gobio gobio (Linnaeus, 1758) Gudgeon
22 Telestes soua (Risso, 1827) Blageon
23 ymallus thymallus (Linnaeus, 1758) Grayling
24 Lampetra planeri (Bloch, 1784)*Po Brook La mprey
25 Gambusia holbrooki (Girard, 1859)*Eastern mosquitosh
26 Barbus meridionalis (Risso, 1827)*Mediterreanean Barbel
27 Micropterus salmoides (Lacepède, 1802)*Large-Mouthed Bass
28 Perca uviatilis (Linnaeus, 1758)*Perch
29 Abramis brama (Linnaeus, 1758)*Common Bream
30 Cyprinus carpio (Linnaeus, 1758)*Common C arp
31 Salvelinus fontinalis (Mitchill, 1814)*Brook Char
32 Salmo trutta (Linnaeus, 1758)** S ea Trou t
33 Oncorhynchus mykiss (Walbaum 1792)** Rainbow Trout
34 Salmo(trutta) hybr. trutta/marmoratus** Sea Trout-Marble Trout hybrid
Table 1. List of the sh species in the Veneto data set. Taxa on white background were used in the models
while grey background highlights the excluded species. Scientic names were revised according to the current
classication. e Italian name is shown in brackets for the only species with no English name. *Taxa excluded
since their presence records were <10. **Taxa excluded regardless of their rarity because their occurrence
depends on stocking programmes.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
Fish fauna occurrence has been coded by binary values (0–1), i.e. absence or presence respectively, while
quantitative or semi-quantitative environmental variables were normalized in a [0, 1] interval.
Species correlation. e tetrachoric correlation coecient, which is analogous to the Pearson correlation coef-
cient, but aimed at binary data, was computed between L. aula and other sh species in R31 with the package psych32.
Articial Neural Network models. Models architecture. In this study, several models based on ANNs
were developed and optimized to predict L. aula occurrence. ANNs were trained and tested by using the nnet33
function of R, considering three layered feed-forward neural networks with bias. e performance of dierent
networks (with 1 to 15 hidden neurons) were compared in order to choose the best network conguration. A
sigmoid transfer function was used both for hidden and output layer, so enabling the network to learn non-linear
relationships between input and output vectors. e ability to easily handle non-linear relationships34 is a very
useful feature of ANNs, especially when dealing with highly complex data sets.
Model development. e ANN model development was based on the following general procedure (Fig.2):
(1) an ANN aimed at predicting the target species occurrence is trained with n environmental variables as
inputs and its output is analyzed to establish the baseline performance level;
(2) p ANNs predicting the target species are trained with the same n environmental variables and with an
additional input based on occurrence records for each one of the p remaining species, one at the time (this
step is aimed at nding out the potential contribution of known biotic information, i.e. species occurrence,
to the target species predictions, thus identifying the species whose addition as co-predictor provided the
largest improvements relative to step 1);
(3) an ANN aimed at assessing the expected occurrence of the most eective co-predictor species, according
to step 2, is trained using as inputs the n environmental variables only;
(4) a cascaded ANN model aimed at predicting the target species occurrence is obtained by combining the
best ANN from step 2 and the one from step 3.
In case more than a single co-predictor species may play a useful role, the procedure can be modied in order
to exploit the biotic information they contribute to the model. is requires training one more ANN at step 2,
using all the k co-predictor species as k additional inputs to the same ANN, and k ANNs at step 3, one for each
co-predictor species. e nal cascaded ANN model will be comprised of the ANN with k co-predictors species
as additional inputs and of k ANNs aimed at predicting the occurrence of each co-predictor species on the basis
of environmental inputs only. e latter ANNs pass their output to the input layer of the former, thus allowing
the resulting model to predict the target species occurrence on the basis of environmental input variables only.
e cascaded ANNs approach will be here demonstrated using two co-predictor species.
Post-processing of model outputs. Model optimization was performed using the Receiver Operating
Characteristic (ROC) curves35,36. Ideally, the neutral cut-o to discriminate presence/absence predictions, i.e.
to binarize output from ANNs, should be 0.5. However, unbalanced numbers of presence and absence cases in
Var iabl e Min Max Mean Median
Elevation (m) 13.00 1785.00 400.92 260.00
Mean depth (m) 0.01 1.46 0.45 0.40
Runs (area, %) 0.00 100.00 55.14 55.00
Pools (area, %) 0.00 90.00 14.79 5.56
Ries (area, %) 0.00 100.00 30.00 22.03
Mean width (m) 1.00 80.00 9.32 6.00
Boulders (area, %) 0.00 100.00 17.01 10.00
Rocks and pebbles (area, %) 0.00 100.00 29.97 30.00
Gravel (area, %) 0.00 96.00 21.48 15.00
Sand (area, %) 0.00 80.00 7.99 4.50
Silt and clay (area, %) 0.00 100.00 23.44 0.00
Stream velocity (score, 0–5) 0.00 5.00 0.00 0.00
Vegetation cover (area, %) 0.00 100.00 10.85 0.00
Shade (%) 0.00 100.00 37.86 40.00
Anthropic disturbance (score, 0–4) 0.00 4.00 1.45 1.60
pH 5.63 9.33 7.75 7.76
Conductivit y (µS cm1) 11.00 1851.00 406.63 390.00
Gradient (%) 0.02 41.60 4.38 1.38
Catchment area (km2) 0.34 3274.01 169.82 19.71
Distance from source (km) 0.33 119.27 16.79 7.14
Table 2. Environmental descriptors used as input (i.e. predictive) variables.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
training data oen lead to output values whose distribution can be better binarized by a dierent threshold value,
thus minimizing false positive (FP) and false negative (FN) results37.
ROC curve analysis was performed to dene the best threshold value for each model, taking into account the
test set. e evaluation of the ROC curves was performed using the R package pROC38.
Model validation. Models performance was evaluated using ve-fold Cross-validation (CV)39. A confusion
matrix was computed for each model to show true positive (TP), false positive (FP), false negative (FN) and true
negative (TN) predicted cases.
e prediction error of the models was assessed by the Cohens kappa (K) coecient40, which measures the
deviation of model predictions from those of a random process:
+− ++++ +
−+ ++++
()[(()()()())/ ]
[(()()()())/ ]
While the deviation from random predictions may be formally tested, Kappa values can be also interpreted
heuristically using the scale proposed by Landis and Koch41.
Figure 2. Model development. e general procedure for training a cascaded ANN model involves four steps:
1) an ANN aimed at predicting the target species (y) is trained with n environmental variables (x) only as inputs
and its output is analyzed to establish the baseline performance level; 2) p ANNs are trained to predict the
same target species, using the same n environmental variables and an additional input based on the occurrence
records for each one of the p remaining species, one at the time, thus identifying the species whose addition as
co-predictor provides the largest performance improvement in relative to step 1; 3) an ANN aimed at assessing
the expected occurrence of the most eective co-predictor species, according to step 2, is trained using as
inputs the n environmental variables only; 4) a cascaded ANN model aimed at predicting the target species is
obtained by combining the best ANN from step 2 and the one from step 3. e cascaded ANN model needs
observed data for the environmental variables only, while biotic information is provided through sub-model
predictions and therefore is not needed to run the model. Green ANN input nodes require eld data, while pink
ANN nodes provide or require predicted values. Only a single co-predictor species is shown, but a very similar
procedure can be applied if more co-predictor species are used.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
Sensitivity analysis and perturbation method. In order to assess the contribution of each input variable to the
ANNs estimation process three methods were chosen:
• A sensitivity analysis was carried out according to the “prole” method proposed by Lek10,42. e scale (i.e. the
number of intervals in which each variable is divided) was set to 50; while all other variables were set at their
minimum values, rst quartile, median, third quartile and maximum.
• e “perturbation” method was applied following the approach proposed by Scardi and Harding43. White
noise in the [0.3, 0.3] range was added to each input variable while keeping the values for all the others
• e “weights” method, proposed by Olden et al.13, was also applied. is method calculates the importance of
each variable as the product of the raw connection weights between each input-output neuron and sums the
product across all hidden neurons. e sign of the contribution shows if increasing values of the predictive
variable are positively or negatively correlated to the expected probability of species presence.
rough these methods we wanted to highlight variables that play a major role in the prediction process.
is result can be useful to infer potential causal relationships or to select variables that are good candidates for
developing a simpler model44. While applying these methodologies to reduce the number of input variables is a
typical goal with ANNs modeling45, to demonstrate the cascaded neural network approach we decided to keep
the entire set of variables, which includes those that are more commonly included in freshwater sh community
modeling10,2729. In fact, the a posteriori selection of an eective subset of predictors was certainly possible, but
it was not relevant to the goal of this study, which is to show that predictions about the occurrence of a species
can be improved by using predictions about other (easier to predict) species. erefore, in order to demonstrate
this modeling strategy, keeping the same set of input variables for each model was much more convenient and
allowed to obtain fully comparable results from dierent options. Obviously, in case the curse of dimensionality46
impaired the training procedure, which was not the case with our data, then selecting a subset of input variables
could have been necessary.
Data availability. All data generated or analysed during this study are included in these published articles:
Zanetti et al.24 and Salviati et al.25.
L. aula prediction. e rst model generated for L. aula prediction was built with environmental variables
only as inputs to the ANN. ROC curve analysis showed that the optimal cut-o value for binarizing the ANN
output was 0.548. e model prediction on test set data was improved from a K value of 0.582 to 0.627 (con-
dence interval: 0.410–0.805) using the ROC curve cut-o value. e confusion matrix shown in Table3 is the one
associated to the median K value obtained by 5-fold cross-validation.
e ranking of K values obtained from models trained by adding occurrence information for an additional
species to the ANN inputs are shown in Fig.3. No improvements in model performance were observed when
species whose occurrence was loosely correlated to L. aula records were added as co-predictors. In fact, addition
of species with null to weak tetrachoric correlation to L. aula (i.e. with r ranging between 0.04 and 0.54) did not
provide better K values than the original model with no biotic co-predictors. By contrast, using species whose
correlation to L. aula (in absolute value) was higher than 0.54 as additional ANN inputs allowed to improve
model performance, although the resulting K values were not strictly proportional to the value of the tetrachoric
correlation coecient (Fig.3). e largest increase in model accuracy was obtained by the addition of A. arborella
and S. erythrophthalmus occurrence information to the model, reaching K values of 0.815 and 0.809 respectively,
exceeding in both cases the upper limit of the K condence interval obtained for the rst model K (0.410–0.805).
Confusion matrixes derived by the addition of each one of those species are shown in Tables4, 5.
L. aula prediction via cascaded ANNs. Expected probabilities of occurrence of A. arborella and S. eryth-
rophthalmus were then used to improve the learning process of the model for L. aula via the cascaded ANNs
e model for A. arborella occurrence prediction showed a K value of 0.708 relative to the test set, while the
K value for S. erythrophthalmus model was 0.659. Both K values were obtained from the optimized model using
the binarization cut-o values from ROC curves, i.e. 0.603 and 0.571, for A. arborella and S. erythrophthalmus
L. aula prediction models were improved by using the predicted occurrence probabilities of the two species
as co-predictors, i.e. as new input variables in secondary ANNs. Results (Tables6,7) showed K values of 0.729
and 0.697 for the L. aula model using predicted A. arborella and S. erythrophthalmus presence probabilities as
Variables importance. The results of the “profile” method are shown in Fig.4. Graphs illustrate the
responses of the ANN to variations of each input variable. Results showed that modeled occurrence probabilities
for both co-predictor species positively contributed to the estimation of L. aula occurrence.
In Fig.5 the results obtained with the “perturbation” method are shown. For each variable, increasing white
noise additions caused increasing mean square error values in the output. While the expected probabilities of
occurrence were only the output of an ANN, i.e. they were not real values, they proved to be the most inuential
predictive variables in the estimation process for L. aula.
e relative contributions of the input variables to the prediction of L. aula according to the “weights” method
are shown in Fig.6. e predictive variables with the highest positive relative contributions were distance from
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
source, A. arborella, S. erythrophthalmus and conductivity. Elevation and anthropic disturbance showed a high
contribution on the occurrence estimation of L. aula from a negative point of view (i.e. for increasing values of
these predictive variables a low probability of L. aula presence was expected).
Our results showed how an ANN model aimed at predicting L. aula occurrence achieved dierent levels of accuracy
depending on the addition of correlated species as biotic co-predictors. Taxa that show a high positive correlation with
L. aula share its main ecological features, e.g. tolerance to low oxygen and habitat preference for slow current22, and
therefore respond in a similar way to environmental conditions. However, some of them were easier to predict than
L. aula, while their distribution could be regarded as a proxy for complex environmental features that in turn may
implicitly play a role in driving the distribution of L. aula. erefore, they can be useful as co-predictors, even when
their occurrence is unknown, because their modeled distribution is reliable enough to be used instead of eld data.
Absence (0) Presence (1)
Predicted Absence (0) 35 3
Presence (1) 5 10
Table 3. Confusion matrix obtained by L. aula prediction on testing set.
Figure 3. Results obtained by addition of correlated species. K values of models obtained by the addition of an
additional co-predictor species relative to their tetrachoric correlation coecient with L. aula. Species whose
addition signicantly increased the K value, i.e. above the upper limit of the condence interval of the model
based on environmental variables only, i.e. [0.410,0.805], are marked in bold. Grey dots represent results from
5each fold in the 5-fold cross-validation. Image was obtained by using R soware31.
Absence (0) Presence (1)
Predicted Absence (0) 36 0
Presence (1) 4 13
Table 4. Confusion matrix obtained by the addition of A. arborella observed occurrence as input variable.
Absence (0) Presence (1)
Predicted Absence (0) 37 1
Presence (1) 3 12
Table 5. Confusion matrix obtained by the addition of S. erythrophthalmus observed occurrence as input
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
ANN models which included as co-predictors observed data about the occurrence of the most correlated
species to L. aula, i.e. A. arborella and S. erythrophthalmus, (r = 0.91 and r = 0.90 respectively), showed the high-
est accuracy. K was equal to 0.815 for A. arborella and 0.809 for S. erythrophthalmus, in both cases exceeding the
upper limit of K condence interval of the L. aula model based on environmental variables only (Fig.3). Using the
occurrence of the two most correlated species as additional input information the model performance improved
from “good” to “very good” according to the scale of K of agreement by Landis and Koch41. In this case, model
improvements depended on the biotic information conveyed by strongly correlated species, which indirectly
suggested where environmental conditions were potentially suited for L. aula presence. Indeed the presence of A.
arborella and S. erythrophthalmus could be regarded as an indicator of the river traits where L. aula is more likely
to occur. On the other hand, the addition of species like S. cephalus and C. gobio as co-predictors also improved
the accuracy of the model regardless the strength of their correlation to L. aula (Fig.3). is suggests that in
some cases the improvement in model performance is not due to co-occurrence factors, while higher order rela-
tionships may play a role in aecting the learning process of the model. In fact, complex ecological relationships
between species can be easily exploited thanks to the ability of ANNs to handle non-linear relationships between
input variables42. is could be an important issue from an ecological perspective, because ANNs models could
point out relationships between sh species that in some instances are independent of co-occurrence factors.
However, in most cases the performance of ANN models was not increased through the use of weakly cor-
related species. In fact, species with tetrachoric correlation between 0.36 and 0.54 provided no improvement,
with the exception of P. genei. ese species indeed may seem to share part of the distribution of L. aula, but
they are usually found in a transition zone characterized by fast water current where L. aula is absent22. Finally,
weakly correlated species (e.g. G. aculeatus) only added noise that could induce a decrease in prediction accu-
racy. In this case model accuracy was even lower than using only the standard set of environmental variables
as inputs (K = 0.609). In fact, G. aculeatus presence is strictly correlated to spring-fed pools47 and therefore its
co-occurrence with L. aula is completely random.
At the same time, the introduction of strongly negatively correlated species made it possible to improve model
performance as much as with the positively correlated ones. In this case, improvement in model predictions was
obtained by exclusion factors, as the presence of S. marmoratus and C. gobio suggested dierent features of the
stream ecosystem, becoming a good predictor for L. aula absence.
Using predicted probabilities of occurrence of either A. arborella or S. erythrophthalmus as additional inputs
improved estimates of L. aula presence (K = 0.729 and K = 0.697 respectively). While these K values were lower
than those obtained by using observed data for those co-predictors (see Fig.3), the addition of the estimated
probabilities of occurrence of one of the co-predictor species improved the L. aula ANN model based on envi-
ronmental variables only. is was a logical outcome, as their predicted probability of occurrence, although quite
accurate, could not entirely match the real species distribution and therefore their eectiveness as co-predictors
was partly reduced.
e addition of predicted probabilities of occurrence for both A. arborella and S. erythrophthalmus provided
a further improvement in the L. aula model accuracy (K = 0.765). is result proved that combinations of two or
more co-predictor species may allow to further improve the accuracy of cascaded ANN models, which obviously
can be used by passing them only data about environmental variables.
Absence (0) Presence (1)
Predicted Absence (0) 35 1
Presence (1) 5 12
Table 6. Confusion matrix obtained by the addition of A. arborella predicted occurrence as input variable.
Absence (0) Presence (1)
Predicted Absence (0) 36 2
Presence (1) 4 11
Table 7. Confusion matrix obtained by the addition of S. erythrophthalmus predicted occurrence as input
variable. Finally, another model was trained using both species presence probabilities as co-predictors, thus
obtaining a K value of 0.765 (Table8). e “prole”, “perturbation” and “weights” sensitivity analyses were
performed on this model.
Absence (0) Presence (1)
Predicted Absence (0) 36 1
Presence (1) 4 12
Table 8. Confusion matrix obtained by the addition of both species predicted occurrences as input variables.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
Lek’s “proles” in Fig.4 pointed out that both co-predictor species signicantly contributed to the estima-
tion of L. aula occurrence by the ANN model. In particular, as the occurrence probabilities for the two species
increased, an increase in the probability of L. aula occurrence was also expected.
ese results provided a useful insight into the cascaded ANNs. In fact, as the A. arborella and S. erythroph-
thalmus occurrence probabilities are the output of independent ANNs, their values are the results of specic
environmental patterns which can be indirectly passed to the second ANN21, which is aimed at predicting L. aula.
For this reason, their predicted probabilities of occurrence enhance the presence or absence estimation for L. aula
at any given site.
Figure 4. Lek’s “prole” method for sensitivity analysis. e occurrence expected probability of L. aula
(“Response”) at increasing values of each input variable, keeping all the others normalized inputs at ve xed
levels ranging from 0 to 1 with a 0.25 step, thus generating ve response curves. Images were obtained by using
R soware.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
It is therefore not surprising that the results obtained from “perturbation” sensitivity analysis (Fig.5) proved
that the ANN model was more sensitive to variations in co-predictor species occurrence probabilities than to
any environmental variable. In fact, as soon as the biotic information was added to the ANNs as co-predictor
variables, most environmental variables seemed to play a less important role in estimating L. aula occurrence.
“Weights” method (Fig.6) showed that presence probabilities for A. arborella and S. erythrophthalmus are pos-
itively correlated with the L. aula presence probability. is result conrmed that high probabilities of presence of
both species provide valuable information about the environmental conditions at any given site where L. aula is
to be predicted. From an ecological point of view, these results explain how the occurrence of A. arborella and S.
erythrophthalmus could convey ecosystem information that could not be inferred from any single environmental
variable. e information added by the predicted probabilities of occurrence of the two species became an impor-
tant input signal to the ANN because their potential occurrence reinforces the eect of suitable environmental
conditions for L. aula presence. e cascaded ANNs approach signicantly improved L. aula prediction by 22% of
the K value (K = 0.765 against K = 0.627 for the rst model). Other approaches considered the biotic information
as additional input variable in predictive models, in particular ANN16,23. Despite the good results that have been
obtained by similar modeling procedures, biotic information has almost ever been used in the form of observed
values. e use of predictions from other independent ANNs as additional input signals allowed to apply the L.
aula ANN model even at sites where no biotic information was directly available, but where it could be estimated
on the basis of environmental variables.
Several authors explained how complex dynamics occur in predicting species distribution, since it is the result of
complex relationships involving physical, chemical and biotic factors. Identifying biotic interactions between sh
species can be very dicult, since indirect or high order relationships can be present. e main goal of this study was
Figure 5. “Perturbation” method for sensitivity analysis. Percent increase in mean square error of the ANN
output obtained by perturbation of the test set data patterns. White noise in the [0.3, 0.3] range was added to
each value of each input variable, while keeping all the other inputs at their original values. Image was obtained
by using R soware.
Figure 6. “Weights” method for variable imprtance. Relative importance of input variables is assessed on the
basis of ANN weights. Negative contributions of input variables imply negative correlation between predictive
variables and L. aula occurrence (e.g. probability of presence is expected to be low at high elevation sites). Image
was obtained by using R soware.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
to show that better prediction of a species (here L. aula was used to demonstrate the approach) can be obtained by
adding predicted probabilities of occurrence of correlated species as additional inputs. Moreover, our results suggest
that potential interactions between species can be highlighted by analyzing model performances. Indeed, changes in
ANN accuracy induced by additional co-predictor species suggests dierent levels of potential interaction between L.
aula and other taxa, which in some cases are independent of co-occurrence factors, since model prediction improve-
ments occur even with intermediate correlations between species, as in the case of S. cephalus. From this perspective,
improvements in an ANN model may be regarded as a clue for the existence of ecological interactions between sh
species, which obviously have to be further analyzed and eventually conrmed by more specic approaches.
e methodological framework here proposed provided higher predictive accuracy than conventional ANN
models on the basis of the selection of correlated species as co-predictors. e most relevant co-predictor species
were chosen on the basis of signicant improvements in K values. is allowed to apply a selection criteria which
provided only useful input information to the cascaded ANN without overly increasing its complexity. Using
expected probabilities of sh occurrence as additional input variables implies that estimated biotic information
can be added to the learning process of ANN models rather than observed biotic information, which would
severely limit the practical value of the models.
As Scardi et al.19 also evidenced, the use of ANNs or related models in order to obtain more accurate pre-
diction of sh species distribution cannot be really eective without the incorporation of approaches with an
ecological perspective. In fact, conventional modeling methods may be unable to explain the complexity of the
biotic systems and their interactions18. e direct or indirect relationships between species are relevant factors
which signicantly aect the sh assemblage composition, so the incorporation of biotic knowledge shall be
considered as a focal point in species distribution modeling48. Of course there is a clear evidence that biotic inter-
actions between species can change among dierent ecosystems49,50. Moreover, dierent selection criteria can be
applied in order to choose which species may be relevant to the prediction process in ANNs. On this basis, several
approaches may be considered in future in order to improve cascaded ANNs prediction by considering even more
sources of biotic information.
1. Le, S., Guégan, J. F. (Eds). Articial Neuronal Networs. Springer Berlin Heidelberg, Berlin, Heidelberg (2000).
2. Olden, J. D., Lawler, J. J. & Po, N. L. Machine Learning Methods Without Tears: A Primer for Ecologists. Q. ev. Biol. 83, 171–193, (2008).
3. Armitage, D. W. & Ober, H. . A comparison of supervised learning techniques in the classication of bat echolocation calls. Ecol.
Inform. 5, 465–473, (2010).
4. Cheng, L., Le, S., Le-Ang, S. & Li, Z. Predicting sh assemblages and diversity in shallow laes in the Yangtze iver basin.
Limnologica - Ecology and Management of Inland Waters 42, 127–136, (2012).
5. Jia, Y. T. & Chen, Y. F. iver health assessment in a large river: Bioindicators of sh population. Ecol. Indic. 26, 24–32, https://doi.
org/10.1016/j.ecolind.2012.10.011 (2013).
6. Le, S. et al (Eds). Modelling Community Structure in Freshwater Ecosystems. Springer Berlin Heidelberg, Berlin, Heidelberg (2005).
7. Scardi, M., Cataudella, S., Di Dato, P., Fresi, E. & Tancioni, L. An expert system based on sh assemblages for evaluating the
ecological quality of streams and rivers. Ecol. Inform. 3, 55–63, (2008).
8. uaro, ., Gubiani, É. A., Cunico, A. M., Moretto, Y. & Piana, P. A. Comparison of sh and macroinvertebrates as bioindicators of
Neotropical streams. Environ. Monit. Assess. 188, 1–13, (2015).
9. Vaseem, H. & Banerjee, T. . Evaluation of pollution of Ganga iver water using sh as bioindicator. Environ. Monit. Assess. 188,
1–9, (2016).
10. Le, S., Belaud, A., Baran, P., Dimopoulos, I. & Delacoste, M. ole of some environmental variables in trout abundance models using
neural networs. Aquat. Living esour. 9, 23–29, (1996).
11. Ibarra, A. A., Gevrey, M., Par, Y.-S., Lim, P. & Le, S. Modelling the factors that inuence sh guilds composition using a bac-
propagation networ: assessment of metrics for indices of biotic integrity. Ecol. Model 160, 281–290 (2003).
12. Giam, X. & Olden, J. D. A new 2-based metric to shed greater insight on variable importance in articial neural networs. Ecol.
Model. 313, 307–313, (2015).
13. Olden, J. D., Joy, M. . & Death, . G. An accurate comparison of methods for quantifying variable importance in articial neural
networs using simulated data. Ecol. Model. 178, 389–397, (2004).
14. Maravelias, C. D., Haralabous, J. & Papaconstantinou, C. Predicting demersal sh species distributions in the Mediterranean Sea
using articial neural networs. Mar. Ecol. Prog. Ser. 255, 249–258, (2003).
15. onan, . F. et al. Predicting factors that inuence sh guild composition in four coastal rivers (southest ivory coast) using articial
neural networs. Croatian Journal of Fisheries 73, 48–57, (2015).
16. Muñoz-Mas, ., Martínez-Capel, F., Alcaraz-Hernández, J. D. & Mouton, A. M. Can multilayer perceptron ensembles model the
ecological niche of freshwater sh species? Ecol. Model. 309–310, 72–81, (2015).
17. Olaya-Marín, E. J., Martínez-Capel, F., García-Bartual, . & Vezza, P. Modelling critical factors aecting the distribution of the
vulnerable endemic Eastern Iberian barbel (Luciobarbus guiraonis) in Mediterranean rivers. Mediterr. Mar. Sci. 17, https://doi.
org/10.12681/mms.1351 (2015).
18. Guisan, A. & uiller, W. Predicting species distribution: oering more than simple habitat models. Ecol. Lett. 8, 993–1009, https:// (2005).
19. Scardi, M. et al. Optimisation of articial neural networs for predicting sh assemblages in rivers, in: Modelling Community
Structure in Freshwater Ecosystems. Springer, Berlin, Heidelberg, pp. 114–129. (2005).
20. Le athwic, J. ., Elith, J. & Hastie, T. Comparative performance of generalized additive models and multivariate adaptive regression
splines for statistical modelling of species distributions. Ecol. Model., Predicting Species Distributions 199, 188–196, https://doi.
org/10.1016/j.ecolmodel.2006.05.022 (2006).
21. Olden, J. D. & Po, N. L. Ecological Processes Driving Biotic Homogenization: Testing a Mechanistic Model Using Fish Faunas.
Ecology 85, 1867–1875, (2004).
22. ottelat, M. and Freyhof, J. Handboo of European Freshwater Fishes. ottelat, Cornol and Freyhof, Berlin (2007).
23. Watts, M. J. & Worner, S. P. Comparing ensemble and cascaded neural networs that combine biotic and abiotic variables to predict
insect species distribution. Ecol. Inform. 3, 354–366, (2008).
24. Zanetti, M., Loro, ., Turin, P. & ussino, G. (Eds). Carta Ittica – Indagine idrologica, chimico-sica e biologica delle acque uenti
bellunesi. Provincia di Belluno e Bioprogramm s.c.r.l. - Amministrazione Provinciale di Belluno, Assessorato Caccia e Pesca (1993).
25. Salviati, S., Marconato, E., Maio, G., Perini, V. & Marconato, A. (Eds). La Carta Ittica della Provincia di Vicenza - Amministrazione
Provinciale di Vicenza (1997).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientific REpoRTS | (2018) 8:4581 | DOI:10.1038/s41598-018-22761-4
26. Olden, J. D. & Jacson, D. A. Fish–habitat relationships in laes: gaining predictive and explanatory insight by using articial neural
networs. T. Am. Fish. Soc. 130, 878–897 (2001).
27. Joy, M. . & Death, . G. Predictive modelling of freshwater sh as a biomonitoring tool in New Zealand. Freshwater Biol. 47,
2261–2275, (2002).
28. Joy, M. . & Death, . G. Predictive modelling and spatial mapping of freshwater sh and decapod assemblages using GIS and
neural networs. Freshwater Biol. 49, 1036–1052, (2004).
29. Olden, J. D., Joy, M. . & Death, . G. ediscovering the species in community-wide predictive modeling. Ecol. Appl. 16, 1449–1460
30. Özesmi, S. L., Tan, C. O. & Özesmi, U. Methodological issues in building, training, and testing articial neural networs in ecological
applications. Ecol. Model. 195, 83–93, (2006).
31.  Development Core Team. : A language and environment for statistical computing.  Foundation for Statistical Computing,
Vienna, Austria ISBN 3-900051-07-0, http://www. (2008).
32. evelle, W. psych: Procedures for Personality andPsychological esearch, http://CAN.age=psych.
Version=1.6.6 (2006).
33. Venables, W. N., ipley, B. D. Modern Applied Statistics with S. Fourth Edition. Springer, New Yor ISBN 0-387-95457-0 (2002).
34. Le, S. & Guégan, J. F. Articial neural networs as a tool in ecological modelling, an introduction. Ecol. Model. 120, 65–73, https:// (1999).
35. Hand, D.J. Construction and assessment of classication rules, Wiley series in probability and statistics. Wiley, Chichester; New Yor
36. Dlamini, W. M. A Bayesian belief networ analysis of factors inuencing wildre occurrence in Swaziland. Environ. Modell. Sow.
25, 199–208,.2009.08.002 (2010).
37. Peel, A. J. et al. Use of cross-reactive serological assays for detecting novel pathogens in wildlife: Assessing an appropriate cuto for
henipavirus assays in African bats. J. Virol, Methods 193, 295–303, (2013).
38. obin, X. Display and Analyze OC Curves,OC/. Version 1.8. (2011).
39. Borra, S. & Di Ciaccio, A. Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty
methods. Comput.l Stat. Data An. 54, 2976–2989, (2010).
40. Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46, https://doi.
org/10.1177/001316446002000104 (1960).
41. Landis, J. . & och, G. G. e Measurement of Observer Agreement for Categorical Data. Biometrics 33, 159–174, https://doi.
org/10.2307/2529310 (1977).
42. Le, S. et al. Application of neural networs to modelling nonlinear relationships in ecology. Ecol. Model. 90, 39–52, https://doi.
org/10.1016/0304-3800(95)00142-5 (1996).
43. Scardi, M. & Harding, L. W. Jr. Developing an empirical model of phytoplanton primary production: a neural networ case study.
Ecol. Model. 120, 213–223, (1999).
44. Gevrey, M., Dimopoulos, I. & Le, S. eview and comparison of methods to study the contribution of variables in articial neural
networ models. Ecol. Model., Modelling the structure of aquatic communities: concepts, methods and problems. 160, 249–264, https:// (2003).
45. Olden, J. D. & Jacson, D. A. Illuminating the “blac box”: a randomization approach for understanding variable contributions in
articial neural networs. Ecol. Model. 154, 135–150, (2002).
46. Bengio, S. & Bengio, Y. Taing on the curse of dimensionality in joint distributions using neural networs. IEEE Transactions on
Neural Networs 11, 550–557, (2000).
47. Clavero, M., Pou-ovira, Q. & Zamora, L. Biology and habitat use of three-spined sticlebac (Gasterosteus aculeatus) in
intermittent Mediterranean streams. Ecol. Freshw. Fish. 18, 550–559, (2009).
48. Araújo, M. B. & Luoto, M. e importance of biotic interactions for modelling species distributions under climate change. Global
Ecol. Biogeogr. 16, 743–753, (2007).
49. Hayden, B. et al. Interactions between invading benthivorous sh and native whitesh in subarctic laes. Freshwater Biol. 58,
1234–1250, (2013).
50. Franssen, N. . & Durst, S. L. Prey and non-native sh predict the distribution of Colorado pieminnow (Ptychocheilus lucius) in a
south-western river in North America. Ecol. Freshw. Fish. 23, 395–404,.12093 (2014).
51. Quantum GIS Development Team. Quantum GIS Geographic Information System. Open Source Geospatial Foundation Project
UL (2009).
We thank Martin Bennett (University of Rome ‘Tor Vergata, IT) for English revision.
Author Contributions
S.F. developed, trained and tested the ANN models with help from E.G.; L.T. provided his expert knowledge about
sh assemblage structure; M.M. helped with the acquisition of environmental data; S.F. wrote the manuscript with
help and suggestions of M.S.; M.S. conceived and designed the research. All authors discussed the results and
commented on the manuscript.
Additional Information
Competing Interests: e authors declare no competing interests.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit
© e Author(s) 2018
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... Table 1 shows the environmental variables that were measured during fish collections in the research area. Most of these variables had been already reported in previous studies (Joy and Death 2002;Olden et al. 2006;Murase et al. 2009;Franceschini et al. 2018). ...
... Thus, fish distribution data can be obtained from their relationships with environment variables, as also shown in other studies (e.g., Brind'Amour et al. 2011;Murase et al. 2009). Our findings also confirm that fish distributions cannot be predicted using only a single environmental variable, as presented in other studies (Franceschini et al. 2018). Some variabilities between the training dataset and the test datasets (Table 4; Figs. 4, 5, 6, 7) were probably due to sub-data differences (30% for test) in the datasets, which leads to differences in predictions in the training process. ...
... As shown previously, the combined environmental variables improved the performance of estimating the fish distribution. ANN is considered a powerful modeling tool in ecology (Cohen and Wallén 1980;Brosse et al. 1999a; Thessen 2016; Odigie and Olomukoro 2020) and ANN models have been applied to predict fish distributions in several studies (Brosse et al. 1999b;Pittman and Brown 2011;Franceschini et al. 2018). Furthermore, the high performance of ANN models based on temporal and spatial environmental relationships indicates that this method is more effective than traditional models for predicting the occurrence of fish (Guisan et al. 2002;Leathwick et al. 2005;Muñoz et al. 2013). ...
Full-text available
The early stages of fish during their life cycle, including larvae and juveniles, are sensitive to the environment. Determining the occurrences of fish larvae and juvenile relative to their associated environments is essential for conservation and fisheries management. Computer-based modeling has rarely been applied for forecasting the distribution patterns of the early fish stages in dynamic systems such as estuaries. In the present study, we applied novel modeling techniques to fish larval and juvenile samples collected in May, September, November, and December during 2019 along the Ba Lat estuary of the Red River, northern Vietnam. The results showed that the occurrences of freshwater and marine fish larvae and juveniles were inversely related to environmental factors (electrical conductivity, temperature, pH, depth, shore distance and turbidity) with a high square of multiple correlation coefficients. The occurrences of the two fish groups were strongly related to temporal and spatial changes in the estuary, and these correlations could be utilized for machine learning processing. Linear regression, Gaussian process models, ensemble regression, and artificial neural network (ANN) models were applied to elucidate the distributions of fish larvae and juveniles. It shows that ANN models obtained the highest R² (> 0.63). In addition, the spatial distribution prediction of fish larvae and juveniles using ANN models was similar to the field measurement. Thus, we suggest utilizing ANN models to predict the occurrences of early fish stages in estuaries in tropical regions such as Vietnam. Recommendations for further applications of ANN models are also given in this study.
... Each neuron needs to cover the input, weight, threshold, and activation functions. In addition, the process of adjusting the weight to produce the target output is actually the "training" [59]. Xi = (X1, X2, X3… Xn) represents the input parameters of BPNN, while Wij = (Wi1, Wi2… Win) represents the corresponding weight of each input. ...
... Each neuron needs to cover the input, weight, threshold, and activation functions. In addition, the process of adjusting the weight to produce the target output is actually the "training" [59]. X i = (X 1 , X 2 , X 3 . . . ...
Full-text available
Recycled aggregate concrete (RAC), due to its high porosity and the residual cement and mortar on its surface, exhibits weaker strength than common concrete. To guarantee the safe use of RAC, a compressive strength prediction model based on artificial neural network (ANN) was built in this paper, which can be applied to predict the RAC compressive strength for 28 days. A data set containing 88 data points was obtained by relative tests with different mix proportion designs. The data set was used to develop an ANN, whose optimal structure was determined using the trial-and-error method by taking cement content (C), sand content (S), natural coarse aggregate content (NCA), recycled coarse aggregate content (RCA), water content (W), water–colloid ratio (WCR), sand content rate (SR), and replacement rate of recycled aggregate (RRCA) as input parameters. On the basis of different numbers of hidden layers, numbers of hidden layer neurons, and transfer functions, a total of 840 different back propagation neural network (BPNN) models were developed using MATLAB software, which were then sorted according to the correlation coefficient R2. In addition, the optimal BPNN structure was finally determined to be 8–12–8–1. For the training set, the correlation coefficient R2 = 0.97233 and RMSE = 2.01, and for the testing set, the correlation coefficient R2 = 0.96650 and RMSE = 2.42. The model prediction deviations of the two were both less than 15%, and the results show that the ANN achieved pretty accurate prediction on the compressive strength of RAC. Finally, a sensitivity analysis was carried out, through which the impact of the input parameters on the predicted compressive strength of the RAC was obtained.
... The disentangling of these two behaviours enables to explicitly define the decision rule of shad during the final choice of spawning. This decision rule was simulated with a machine learning tools (Artificial neural networks; ANN), which is documented to handle non-linear relationships and to provide accurate results in simulations(Lek et al. 1996;Olden et al. 2008;Franceschini et al. 2018).Artificial neural networks (ANNs) are used in ecology to predict the impact of climate change but also to evaluate the most important factors controlling biological processes(Maravelias et al. 2003;Franceschini et al. 2018). The relative influences of environmental factors on freshwater fish distribution were notably assessed by ANNs(Lek et al. 1996;Maravelias et al. 2003;Ibarra et al. 2003;Konan et al. 2015;Olaya-Marin et al. 2015;Muñoz-Mas et al. 2015;Giam and Olden 2015). ...
... The disentangling of these two behaviours enables to explicitly define the decision rule of shad during the final choice of spawning. This decision rule was simulated with a machine learning tools (Artificial neural networks; ANN), which is documented to handle non-linear relationships and to provide accurate results in simulations(Lek et al. 1996;Olden et al. 2008;Franceschini et al. 2018).Artificial neural networks (ANNs) are used in ecology to predict the impact of climate change but also to evaluate the most important factors controlling biological processes(Maravelias et al. 2003;Franceschini et al. 2018). The relative influences of environmental factors on freshwater fish distribution were notably assessed by ANNs(Lek et al. 1996;Maravelias et al. 2003;Ibarra et al. 2003;Konan et al. 2015;Olaya-Marin et al. 2015;Muñoz-Mas et al. 2015;Giam and Olden 2015). ...
This PhD takes place in a context of climate change (IPCC 2018) and a general decline in fish species. The objective of this PhD thesis was to define environmental control over the reproduction of the allis shad. Using 4 main studies with several modelling tools (Manly index, BRT model, HoOS model and flirtyShadBrain model), we studied this environmental control and assessed the future impact of climate change.The first step in assessing the impact of habitat changes was to test the influence of environmental factors on shad reproduction (paper #1, paper #2 and the flirtyShadBrain model). We first explored the influence of temperature, then tested several environmental factors on shad reproduction. In practice, we evaluate that the shad is a photoperiodic species. Day length may be the seasonal data that triggers migration, and temperature and flow are used for short-term decisions (final choice to reproduce with social benchmarks). We used this knowledge to explore the potential impact of climate change. According to our multifactorial projections, it would appear that allis shad spawners will not be affected by future global warming for the RCP 2.6 scenario, and that even in the worst case scenario (RCP 8.5), habitat favorability should even increase, although with an earlier favourable period. Thus, climate change does not appear to be a major threat to this species.
... La validación cruzada tipo k-fold se utilizó durante la sección del número de neuronas en la capa oculta, así como en la determinación del número de veces (epoch) que se pasan los datos de entrenamiento por la red neuronal. Se utilizó el análisis CWA (Connection-Weight-Approach) [34][35][36] para, una vez ajustado el modelo, determinar el efecto de cada componente o relación en la resistencia a compresión a los 7 días del UHPC. Además, se realizó una validación experimental para evaluar la precisión del modelo y su habilidad para la predicción en un caso real. ...
Full-text available
RESUMEN El hormigón de ultra alto rendimiento (UHPC) es un tipo de hormigón de alta tecnología que exhibe excelen-tes propiedades mecánicas y de durabilidad. En los últimos años, el uso de materiales cementantes suplemen-tarios (SCM) como sustitución parcial del humo de cemento y sílice ha sido objeto de gran interés por parte de la comunidad científica para reducir los altos costos y la huella de carbono del UHPC. Algunas sus aplica-ciones, como el refuerzo sísmico de estructuras existentes no dúctiles, requieren del desarrollo de resistencias tempranas. Sin embargo, la sustitución del cemento y humo de sílice puede modificar de algunas propiedades, como las resistencias iniciales de UHPC. Por otro lado, el uso de SCM produce un material altamente com-plejo, siendo más difícil comprender el efecto de cada componente y sus interacciones en el desarrollo de resistencias tempranas en el hormigón. Este estudio tiene como objetivo desarrollar un modelo de redes neu-ronales artificiales (ANN) para predecir la resistencia a la compresión a los siete días del UHPC, pudiendo incorporar varios SCM como el humo de sílice, ceniza volante, escoria granulada de alto horno, polvo de vidrio reciclado, ceniza de cascarilla de arroz, residuo de catalizador de craqueo catalítico fluido, metacaolín, carbonato cálcico pulverizado, además de filler mineral como el polvo de cuarzo. Para el desarrollo del mo-delo de una sola capa oculta se usaron 523 datos de investigaciones publicadas. Además, el modelo también fue validado mediante el uso de trabajos experimentales. Finalmente, el algoritmo Connection-Weight-Approach (CWA) se utilizó para analizar las relaciones entre los componentes del UHPC y la resistencia a la compresión a los siete días. Los resultados señalaron que el modelo ANN es un modelo eficiente para prede-cir la resistencia a la compresión a los 7 días del UHPC incluso cuando se incorporan SCM. Palabras-clave: Validación cruzada k-fold, ANN, UHPC, materiales cementantes suplementarios, resisten-cia a la compresión a los 7 días. ABSTRACT Ultra-high-performance concrete (UHPC) is a high-tech kind of concrete which exhibits superb mechanical properties and improved durability. Over the last years the use of supplementary cementitious materials (SCM) as partial substitution of cement and silica fume has been the object of great interest by the scientific community in order to reduce the high costs and carbon footprint of UHPC. Some of the more promising applications of this type of special concrete, such as the seismic retrofitting of non-ductile existing structures, require the development of high or ultra-high early strength. However, the replacement of cement and silica fume can result in the modification of some properties such as the early strength of UHPC, which usually needs great amounts of cement and silica fume. On the other hand, the use of several SCM has as outcome a highly complex material, which makes it more difficult to understand the effect of each component and their interactions on early strength. This study is aimed to develop an artificial neural networks (ANN) approach to predict the seven-day compressive strength of UHPC, being able to incorporate several SCM such as silica fume, fly ash, ground granulated blast furnace slag, recycled glass powder, rice husk ash, fluid catalytic cracking catalyst residue, metakaolin, limestone powder, in addition to mineral filler such as quartz powder. A total of 523 data from previous published works was used to train the one-hidden-layer ANN model. The model was also validated by performing experimental works. Besides, Connection-Weight-Approach algorithm (CWA) was used to analyse the relationships between the UHPC"s components and the seven-day
... The performance of feedforward NN has already been tested in various disciplines, including ecology and biology (Frey and Rusch, 2013;David, et al., 2019;Joseph, 2020). Increasingly researchers use NNs to explore species distributions in fishery applications, such as Leucos avla occurrence estimates in the Veneto river basins in north-eastern Italy, and feeding habitat characteristics of skipjack tuna in west-central Pacific Ocean (Franceschini et al., 2018;Wang et al., 2018). The effects of spatiotemporal scale on commercial fishery abundance index suitability  Figure 5. Median habitat AI distribution (derived from the synthetic dataset, upper layer) and median CV of habitat abundance distribution (derived from the original datasets, lower layer) characterized by scaled catch for each month (July-October) and each spatial scale (. ...
With consideration of sophisticated modern commercial fisheries, the commonly used metric catch per unit effort (CPUE) may not be a reasonable proxy for generating abundance indices (AIs) for all species. Presumably, spatiotemporal scale is a critical factor that affects the accuracy of local/aggregated AIs derived from spatial modelling approaches, thus it is necessary to evaluate how scale affects scientific estimates of abundance. We explored three commonly utilized AI proxies, including aggregated catch (CatchAI), aggregated effort (EffortAI), and CPUEAI from the perspective of accuracy and spatial representational ability using a neural network (NN) model at different spatiotemporal scales. As a case example, we grouped the Chinese fleet's Northwest Pacific neon flying squid (Ommastrephes bartramii) fishery dataset (2009–2018) at four spatial scales (0.25° × 0.25°, 0.5° × 0.5°, 1° × 1°, 2° × 2°) to construct monthly and annual resolution models. The results showed that for both simulated and real datasets, AIs based on catch data had better accuracy, consistency, and spatial representational ability compared to CPUE and effort-dependent AI models at all spatial scales. Relative to the finest spatial scale, only results from the model with 0.5° × 0.5° resolution preserved enough distributional detail to reflect the known migration route for O. bartramii. Model results exhibited large variation dependent on spatial scale, particularly amongst CPUEAI scenarios. We suggest that scale comparisons among potential proxies should be conducted prior to AIs being used for applications such as population trends in stock assessment.
... Several post-hoc methods have been used to assess the relative importance of predictor variables in SDMs, e.g., connection weight (Franceschini et al., 2018;Olden et al., 2006), split count (Elith et al., 2008;Yu et al., 2020), and permutation (Bradter et al., 2013;Smith and Santos, 2020). Connection weight and split count are specific to neural networks and tree-based models, respectively, whereas permutation is model-agnostic (applicable to any kind of ML). ...
Species distribution models (SDMs), in which species occurrences are related to a suite of environmental variables, have been used as a decision-making tool in ecosystem management. Complex machine learning (ML) algorithms that lack interpretability may hinder the use of SDMs for ecological explanations, possibly limiting the role of SDMs as a decision-support tool. To meet the growing demand of explainable MLs, several interpretable ML methods have recently been proposed. Among these methods, SHaply Additive exPlanation (SHAP) has drawn attention for its robust theoretical justification and analytical gains. In this study, the utility of SHAP was demonstrated by the application of SDMs of four benthic macroinvertebrate species. In addition to species responses, the dataset contained 22 environmental variables monitored at 436 sites across five major rivers of South Korea. A range of ML algorithms was employed for model development. Each ML model was trained and optimized using 10-fold cross-validation. Model evaluation based on the test dataset indicated strong model performance, with an accuracy of ≥0.7 in all evaluation metrics for all MLs and species. However, only the random forest algorithm showed a behavior consistent with the known ecology of the investigated species. SHAP presents an integrated framework in which local interpretations that incorporate local interaction effects are combined to represent the global model structure. Consequently, this framework offered a novel opportunity to assess the importance of variables in predicting species occurrence, not only across sites, but also for individual sites. Furthermore, removing interaction effects from variable importance values (SHAP values) clearly revealed non-linear species responses to variations in environmental variables, indicating the existence of ecological thresholds. This study provides guidelines for the use of a new interpretable method supporting ecosystem management.
... 40 For example, connection weights that change sign (e.g., positive to negative) between the input-hidden to hidden-output layers would have a canceling effect. 41 The CWA method was used in this study to evaluate the importance of every variable on the 1-day compressive strength. The results of CWA are compared with previous research and the experimental outcome. ...
This paper analyzes the application of artificial neural networks (ANN) to predict the 1-day compressive strength of ultra-high-performance concrete (UHPC) made with any combination of powders and supplementary cementitious materials (SCM) such as silica fume (SF), fly ash (FA), ground granulated blast slag furnace (GGBSF), recycled glass powder (GP), rice husk ash (RHA), fluid catalytic cracking catalyst residue (FC3R), metakaolin (MK), limestone powder (LP), and quartz powder (QP). A total of 604 data from scientific literature were used to train the one hidden layer ANN model by using the k-fold validation procedure. Furthermore, 90 UHPC mixtures were experimentally performed to validate the proposed ANN model. The performance of the model was assessed using several statistical performance indexes: ratio of the root mean square error to the standard deviation of measured data (RSR), root mean square error (RSME), normalized mean bias error (NMBE), Nash-Sutcliff efficiency, and coefficient of multiple determination (R 2). Connection weight approach (CWA) algorithm was utilized to analyze the relationships between the UHPC components and the 1-day compressive strength. The results indicated that the ANN is an efficient model for predicting the early strength (1-day compressive strength) of UHPC achieving R 2 values of 0.88 and 0.86 on the test data and experimental data, respectively, even when the experimental dosages included combinations of components that were not found in the training data. The results of the CWA analysis indicated that SCM such as MK, FC3R, SF, and LP, as well as other factors such as virtual packing density, improved the early strength of UHPC, whereas FA, GP, and RHA were pointed out as harmful for the one-day compressive strength. As conclusion, the ANN model could be helpful in the developing of UHPC with early strength needs by preselecting the combinations of available SCM and powders that have better results in the model at lower cost.
Full-text available
Various methodologies including genetic analyses, morphometrics, proteomics, lipidomics, metabolomics, etc. are now used or being developed to authenticate fish and seafood. Such techniques usually lead to the generation of enormous amounts of data. The analysis and interpretation of this information can be particularly challenging. Statistical techniques are therefore commonly used to assist in analyzing these data, visualizing trends and differences and extracting conclusions. This review article aims at presenting and discussing statistical methods used in studies on fish and seafood authenticity and adulteration, allowing researchers to consider their options based on previous successes/failures but also offering some recommendations about the future of such techniques. Techniques such as PCA, AMOVA and FST statistics, that allow the differentiation of genetic groups, or techniques such as MANOVA that allow large data sets of morphometric characteristics or elemental differences to be analyzed are discussed. Furthermore, methods such as cluster analysis, DFA, CVA, CDA and heatmaps/Circos plots that allow samples to be differentiated based on their geographical origin are also reviewed and their advantages and disadvantages as found in past studies are given. Finally, mathematical simulations and modeling are presented in a detailed review of studies using them, together with their advantages and limitations.
The water content and it purity completely depends on biotic species that are existing in the environment. The forecasted weight of the fish will be the fishing success and the challenge still prevails in the field of weight forecasting of fish. With this overview, this paper provide following contribution. Firstly, the Fish weight dataset from KAGGLE repository is subjected to da-ta pre-processing. Secondly, the raw data set is applied to find the regression relationship between all the features with target fish weight is done with visualization. Thirdly, the anova test is applied to verify features with PR(>F) < 0.05 that highly influence the Target. Fourth, the raw dataset and feature scaled dataset are applied to all the Linear and ensembling Regression models and performance are analysed. Fifth, feature correlation is examined and results shows that diagonal length, Vertical length and Cross length are having correlation of 1.0 which increases multicollinearity issue that leads to undesirable predictions. So those features are removed and fitted with Linear and ensembling Regression models and the performance are analysed. Sixth, the outlier predictions of the features are done and it removed by IQR Analysis and then fitted with Linear and ensembling Regression models and the performance are analysed with intercept, EVS, MAE, MSE and R2Score. Experimental Results shows that polynomial regression is able to achieve accuracy of 97% before and after feature scaling, outlier removal. Among ensembling, Gradient Boosting is providing the accuracy of 98% be-fore and after feature scaling, outlier removal.
Full-text available
Ganga River, life line of millions of people got heavily polluted due to uncontrolled anthropogenic activities. To monitor the effect of pollution of the river on its aquatic life, a field study was conducted by analyzing the different biomarker enzymes and biochemical parameters in the various tissues (muscles, liver, gills, kidney, brain, and skin) of the Indian major carp Labeo rohita collected from the River Ganga from different study sites of Varanasi district. Activities of antioxidant enzymes, e.g., superoxide dismutase, catalase, and level of lipid peroxidation were found to be higher in the fish collected from the river showing pollutant-induced oxidative stress in the fish. Disturbed health status of the river fish was also manifested by increased activities of aspartate amino transferase, alanine amino transferase, and alkaline phosphatase. Concentration of nutritionally important biomolecules (proteins, lipids, and moisture) and energy value were also found to be significantly lower in the tissues of the River fish indicating its decreased nutritional value due to oxidative stress caused by different pollutants.
Full-text available
Luciobarbus guiraonis (Eastern Iberian barbel) is an endemic fish species restricted to Spain, mainly distributed in the Júcar River Basin District. Its study is important because there is little knowledge about its biology and ecology. To improve the knowledge about the species distribution and habitat requirements, nonlinear modelling was carried out to predict the presence/absence and density of the Eastern Iberian barbel, based on 155 sampling sites distributed throughout the Júcar River Basin District (Eastern Iberian Peninsula). We used multilayer feed-forward artificial neural networks (ANN) to represent nonlinear relationships between L. guiraonis descriptors and variables regarding the physical habitat and biological components (macroinvertebrates, fish, riparian forest). The gradient descent algorithm was implemented to find the optimal model parameters; the importance of the ANN’s input variables was determined by the partial derivatives method. The predictive power of the model was evaluated with the Cohen’s kappa (k), the correctly classified instances (CCI), and the area under the curve (AUC) of the receiver operator characteristic (ROC) plots. The best model predicted presence/absence with a high performance (k= 0.66, CCI= 87% and AUC= 0.85); the prediction of density was moderate (CCI = 62%, AUC=0.71 and k= 0.43). The fundamental variables describing the presence/absence were; solar radiation (the highest contribution was observed between 2000 and 4200 WH/m2), drainage area (with the strongest influence between 3000 and 5.000 km2), and the proportion of exotic fish species (with relevant contribution between 50 and 100%). In the density model, the most important variables were the coefficient of variation of mean annual flows (relative importance of 50.5%) and the proportion of exotic fish species (24.4%). The models provide important information about the relation of L. guiraonis with biotic and abiotic variables, this new knowledge can help develop future studies and management plans for the conservation of this species in the Júcar River Basin District and, potentially, for the conservation of other endemic fish species of Barbus and Luciobarbus in Mediterranean rivers.
Full-text available
The biomonitoring of aquatic ecosystems in developing countries faces several limitations, especially related to gathering resources. The present study aimed at comparing the responses of fish and benthic macroinvertebrates to environmental change, to identify which group best indicates the differences between reference and impacted streams in southern Brazil. We determined reference and impacted sites based on physical and chemical variables of the water. For the analysis and comparison of biological responses, we calculated 22 metrics and submitted them to a discriminant analysis. We selected from this analysis only six metrics, which showed that the two studied assemblages respond differently to environmental change. A larger number of metrics were selected for macroinvertebrates than for fish in the separate analysis. The metrics selected for macroinvertebrates in the pooled analysis (i.e., fish and macroinvertebrates together) were different from those selected in the separate analysis for macroinvertebrates alone. However, the metrics selected for fish in the pooled analysis were the same selected in the separate analysis for fish alone. The macroinvertebrate assemblage was more effective for distinguishing reference from impacted sites. We suggest the use of macroinvertebrates as bioindicators of Neotropical streams, especially in situations in which time and money are short.
A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods. The emphasis is on presenting practical problems and full analyses of real data sets.