ArticlePDF Available

Abstract and Figures

This research aims to establish the possible habitat suitability of Heracleum sosnowskyi ( HS ), one of the most aggressive invasive plants, in current and future climate conditions across the territory of the European part of Russia. We utilised a species distribution modelling framework using publicly available data of plant occurrence collected in citizen science projects ( CSP ). Climatic variables and soil characteristics were considered to follow possible dependencies with environmental factors. We applied Random Forest to classify the study area. We addressed the problem of sampling bias in CSP data by optimising the sampling size and implementing a spatial cross-validation scheme. According to the Random Forest model built on the finally selected data shape, more than half of the studied territory in the current climate corresponds to a suitability prediction score higher than 0.25. The forecast of habitat suitability in future climate was highly similar for all climate models. Almost the whole studied territory showed the possibility for spread with an average suitability score of 0.4. The mean temperature of the wettest quarter and precipitation of wettest month demonstrated the highest influence on the HS distribution. Thus, currently, the whole study area, excluding the north, may be considered as s territory with a high risk of HS spreading, while in the future suitable locations for the HS habitat will include high latitudes. We showed that chosen geodata pre-processing, and cross-validation based on geospatial blocks reduced significantly the sampling bias. Obtained predictions could help to assess the risks accompanying the studied plant invasion capturing the patterns of the spread, and can be used for the conservation actions planning.
This content is subject to copyright. Terms and conditions apply.
Scientic Reports | (2022) 12:6128 |
Large‑scale forecasting
of Heracleum sosnowskyi habitat
suitability under the climate
change on publicly available data
Diana Koldasbayeva1*, Polina Tregubova2, Dmitrii Shadrin2,3, Mikhail Gasanov2 &
Maria Pukalchik4
This research aims to establish the possible habitat suitability of Heracleum sosnowskyi (HS), one of
the most aggressive invasive plants, in current and future climate conditions across the territory of
the European part of Russia. We utilised a species distribution modelling framework using publicly
available data of plant occurrence collected in citizen science projects (CSP). Climatic variables and
soil characteristics were considered to follow possible dependencies with environmental factors. We
applied Random Forest to classify the study area. We addressed the problem of sampling bias in CSP
data by optimising the sampling size and implementing a spatial cross‑validation scheme. According
to the Random Forest model built on the nally selected data shape, more than half of the studied
territory in the current climate corresponds to a suitability prediction score higher than 0.25. The
forecast of habitat suitability in future climate was highly similar for all climate models. Almost the
whole studied territory showed the possibility for spread with an average suitability score of 0.4.
The mean temperature of the wettest quarter and precipitation of wettest month demonstrated the
highest inuence on the HS distribution. Thus, currently, the whole study area, excluding the north,
may be considered as s territory with a high risk of HS spreading, while in the future suitable locations
for the HS habitat will include high latitudes. We showed that chosen geodata pre‑processing,
and cross‑validation based on geospatial blocks reduced signicantly the sampling bias. Obtained
predictions could help to assess the risks accompanying the studied plant invasion capturing the
patterns of the spread, and can be used for the conservation actions planning.
e relocation and introducing of alien species into new habitats are recognised as one of the major drivers of
global biodiversity loss13. Invasive alien (non-indigenous) species IAS tend to spread rapidly and pose a seri-
ous threat to endemic species due to e.g. the competition in the resource use, allelopathy occurrence, toxicity
of IAS4,5. us, the emergence of IAS can dramatically change the functioning of the natural communities and
overall ecosystem structure68.
Such common occurrences as human living territory expansion, globalization of transport, and changing
of the land-use types favor species invasion. With that, the estimated costs of the elimination of IAS are usually
quite high. e specic of individual IAS limits the implementation of such practices. e other constraints are
the territory’s size that needs to be treated, the possibility of negative side outcomes because of the use of chemi-
cal and biological control agents, andthe development of the invasion process911. IAS disproportionally aect
the most vulnerable communities in poor areas, at the locations of abandoned and disturbed lands. us, their
spread is clearly pulling up the achievement of the Sustainable Development Goals12.
Heracleum sosnowskyi Manden (Hogweed, HS) is one of the examples of extremely dangerous invasive species.
e natural habitat of HS is the central and eastern Caucasus area and adjacent regions, Transcaucasia region
and Turkey13. Large biomass and the ability to live and develop in cold climates became HSa popular crop in
agriculture in the middle of the 20th century14. However, soon the unpleasant odor of milk and meat of animals
1Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russian Federation 121205. 2RAIC,
Skolkovo Institute of Science and Technology, Moscow, Russian Federation 121205. 3Irkutsk National Research
Technical University, Irkutsk, Russian Federation 664074. 4Digital Agriculture Laboratory, Skolkovo Institute of
Science and Technology, Moscow, Russian Federation 121205. *email:
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
that were fed with HS fodder andthe phototoxic eect of above-ground parts of HS were revealed, and as a
result, the cultivation was abandoned.
e need to forecast the potential extinction of dierent species in dierent spatial and temporal contexts,
has led to the Species Distribution Modelling (SDM) development. SDM framework is based on the ecological
concept assuming that the distribution of species is explained by the set of factors, such as environmental require-
ments and interactions with other living organisms, physiology characteristics, evolution history15,16. General
workow of correlative SDM consists of (1) obtaining the data about the species of study occurrences: presence-
only data, presence/absence data, abundance data; environmental characteristics data, sometimes considering
biotic interactions as well, (2) search of the interconnections between these data, and (3) building the map of
predicted distributions across the region of interest. SDM framework is implemented in a variety of packages
and libraries in most common programming languages, such R or Python, and allows to use several dierent
statistical or machine learning (ML) models, e.g., generalized linear models, classication and regression trees,
random forest (RF), support vector machine, articial neural networks, and others, and ensemble of them1721.
In terms of data availability these models mostly dier from each other by the requirements to occurrence
records, i.e., should the occurrence data be represented by two classes—presence and absence, or it can be only
presence data22. e choice of the appropriate modelling method signicantly aects outcomes and depends
on multiple factors: size of the territory of study, type of the environment considering its changing dynamics,
characteristics of modelling species, data availability, while it has become more popular to use ensembles-across-
methods forecasting23. However, there are no strict directives on how to implement the ensemble, e.g. should
one estimate an average prediction or weighted average prediction—thus, this solution is not so straightforward
in comparison with basic modelling methods24. Some studies demonstrate higher performance of a particular
model above others for specic cases. For example, it has been shown that RF approach is highly suitable for
forecasting on large territories with a limited amount of data25, while for marine environments, ensemble models
are recommended to use26.
Correlative SDM has a conceptual limitation—it is assumed to capture realized ecological niche, which
is confusing when IAS is the object of the study27. Another struggle is the quality of using data, precisely, the
occurrence and absence of the species. It is stated that pseudoabsence data should be eld corrected, otherwise
it shows strong bias, decreasing the species prediction perfomance28. In reality, such correction is almost impos-
sible for large territories and requires signicant collections of remote sensing data with appropriate resolution.
It is much more controversial issue when the spread of IAS is the case. In case of a sucient number of veried
absence points of the studied IAS, a question that remains: is this location unsuitable indeed for the selected
IAS, or the IAS has not reached it yet29. However, despite all the mentioned limitations, correlative SDM still is
the primary tool for the IAS distribution modelling30. Another possibility is to use mechanistic SDM, which is
developed on the process-based approach, e.g. phenology model31, but such models require calibration of many
internal parameters.
While it is extremely dicult to eliminate all growing populations of the invasive species, HS including, the
information from the modelling of habitat suitability can aid in prioritizing the management of invaded areas.
Precisely, it can help to mark out the territories where the possibility of development of rapidly growing popula-
tions poses the largest threat to native species, agriculture, and populated areas. Considering this context use of
data from CSP is of particular interest, however, it may have its limitations. In this work large-scale HS distribu-
tion modelling is performed. We estimated habitat suitability for the current climate as average from 2000 to 2018
and possible future climate from 2040 to 2060 according to three climate models—BCC-CSM2-MR, CanESM5,
and CNRM-CM6-1—in two scenarios, the worst and the best in terms of greenhouse gas emissions (Fig.1).
e general workow of the presented research included the following steps: (i) collection of the required
data from public sources, (ii) data pre-processing, (iii) feature selection, (iv) model training and validation, (v)
receiving of the outputs of the best model, (vi) building the maps, showing the spatial distribution of the occur-
rence probability (habitat suitability) across the territory of the study, expressed in the range from 0 to 1, for
current and possible future climate conditions. Presented methodology and results of HS spread modelling can
be used for invasion risk assessment.
Optimisation of the occurrence data distribution based on the thinning procedure. Ideally,
thinning removes the optimal number of records to substantially reduce the eects of sampling bias, as in our
case when most of the locations are concentrated in a few places—while simultaneously retaining most of the
valuable information. Figure2 demonstrates the results of model prediction for (1) initial dataset with data col-
lected from all available sources; (2) dataset with thinning distance 4 km, (3) dataset with thinning distance 7
km, (4) dataset with thinning distance 10 km. It is also important to know how the predictors’ distribution would
change at the dierent thinning intervals. In our case, there were no signicant dierences between distribu-
tions’ shapes of environmental features that corresponded to the dierent thinned data (Fig.S1).
e outputs of models vary signicantly depending on the number of points at dierent input datasets. e
ROC-AUC scores of the models built on the complete data, datasets at 10, 7, and 4 km thinning distances are
0.877, 0.83, 0.85, and 0.82 correspondingly (Fig.S4). Modelling results obtained from the complete dataset
represent the territory of the study as mostly unsuitable for HS spread, 84% of the territory is characterised by
the prediction value of less than 0.25. On the most contrast variant of the model built on the data at 10 km thin-
ning distance, the suitability rose considerably: thepercent of territory wherethe prediction value is above 0.5
increased to 22% compared to 3% inthe case of full data, the area of territory under less than 0.25 prediction
values is decreased to 31% and 44% at thinning distances of 10 and 7 km respectively. We further need to choose
which model to use for the next step of future prediction by nding a reasonable output.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
From the results visualised as maps, we can notice, that the model built on the full dataset is overtted, does
not cover northern latitudes, and poorly represent theoriginal habitat located inthe Caucasus area. It mostly
repeats the points of observation thus the possible distribution of habitat suitability of HS obtained from the
full dataset is built in a learn-by-heart manner. On the contrary, the model built on the dataset obtained due to
thinning at the distanceof 7 km seems to be the most suitable in terms of both prediction results and keepingas
much information as possible. Additionally, while we cannot lean on the evaluation scores to support this con-
clusion, we estimated the variability of prediction values across the territory of the study. It was the highest for
the datasets obtained at 10 and 7 km thinning distances. e outputs of the model built on 7 km distance data
were more diverse at the 100 km blocks, as was used for spatial cross-validation (Figs.S2, S3).
Features selected for modelling. To avoid over-tting because of using redundant variables, animpor-
tant part of the SDM procedure was choosing the most meaningful set of them, corresponding to the observed
HS occurrence. To do this several approaches were combined: search of highly correlated features and estima-
tion of the importance of features bythe Mean Decrease Gini (MDG) andthe Mean Decrease Accuracy (MDA)
scores. us, the general workow consisted of 3 general steps: (1) generation of correlation matrix; (2) estima-
tion of MDG and MDA scores; (3) picking up highly important non-correlated features and choosing the fea-
tures that have correlates but demonstrate higher importance according to both MDG and MDA scores. e rst
step of selection includes a search of highly correlated features andthe formation ofsets of mutually exclusive
covariates according to the absolute value ofPearson correlation coecient greater than 0.8. e correlations are
demonstrated in Fig.S5. From the group of bioclimatic variables, the following subset of features demonstrated
ahigh correlations’ coecients between each other:
BIO1, BIO6, BIO9, BIO11;
BIO4, BIO7, BIO16;
BIO5, BIO10;
BIO16, BIO13;
BIO13, BIO14, BIO18, BIO17, BIO12.
en, based on the variable importance results obtained by MDA and MDG, the most important features were
selected and included in the core list for the predictions: BIO8, BIO10, BIO13, BIO15, BIO19. Additionally,
BIO1 and BIO9 features demonstrated approximately equal importance in corresponding forecastings. us,
we built dierent RF models with the core list of features including only BIO9 for the rst variant and only BIO1
for the second one. By comparing the results from modelling, BIO1 demonstrated higher importance, so it was
included in the nal list of features.
Using the same approach, described above, the selection of soil properties was performed. According to the
correlation matrix (Fig.S5), soil properties do not have correlationcoecients equal to 0.8 or more in absolute
Figure1. Flowchart of the approach.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
values with bioclimatic variables. However, SOC and Sand content demonstrated a high enough correlation. CF,
Silt and Sand did not show high importance in the corresponding analyses. us, from the soil features, the nal
list included only CEC and SOC. erefore, the following list of features was used to train the algorithm: SOC,
CEC, BIO1, BIO8, BIO10, BIO13, BIO15 and BIO19 (Fig.S6).
According to Fig.S6, BIO13 and BIO8 demonstrated the highest importance in predicting HS distribution.
Based on MDA, soil properties are considered to be more important comparedto MDG. BIO1 and BIO10
demonstrated less importance related to MDA, whereas CEC and BIO19 had the same pattern related to MDG.
Possible habitat suitability in the future. Using the set of environmental predictors obtained inatthe
feature selection stage, we modelled the possible future spread of HS across the territory ofthe study. To do this,
we estimated the distribution of bioclimatic variables according to the available global climate models. From
obtained results, we see that CNRM-CM6-1 and BCC-CSM2-MR show almost identical results in general, as
well as between chosen SSP (Fig.3, Fig.S7). According to them, only 6–8 % of the study territory in the future
are characterised by prediction values less than 0.25. e CanESM5 results show a slightly dierent picture: the
percent of the likely unsuitable territory is down even more to 3% at the better SSP126, and to 0.7% at the worst
atmosphere concentration scenario—SSP585.
Figure2. Maps of prediction of possible distribution of HS in current climate conditions using dierent
thinning distances and, consequently, amounts of input points. e quality of prediction varies signicantly,
while the model built on the full dataset is obviously overtted.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
Sampling bias is a frequent problem in SDM because presence-only data is collected predominantly in more
accessible locations32. e spatial sampling bias also leads to environmental bias in the data set33. Ignoring the
problem leads to inaccurate forecasting and can be followed by an inappropriate risk assessment of invasive
species and mistakes in conservation planning. e number of methods to tackle the problem of sampling bias
is limited. For rare records, a new Poisson point process model is proposed34, where environmental eect and
sampling bias are explicitly modelled in separate clusters in a framework of quasi-linear modelling. Another
example of solving this problem is the incorporation of background (pseudo-absence) points with the same
sampling bias asthe distribution of occurrence points33. If the number of records is enough, spatial ltering is
commonly used when some part of occurrence points is discarded. We combined two methods to overcomethe
sampling bias of HS occurrence points: spatial ltering andthe creation of the same eect of sampling bias
among pseudo-absence points.
e main limitation of the data thinning method is that we should assume the particular level of sampling
bias to proceed to the following modelling, while in reality, it is actually unknown. is is whythe search for
the most appropriate amount of discarded records for occurrence pointsis notstraightforward. erefore we
used three distances: 4 km, 7 km and 10 km to compare with the initial dataset. e estimates forthe current
climate (Fig.2) demonstrate signicant dierences in predicted habitat suitability that are highly dependent on
the thinning process. At the same time, using the initial dataset in forecasting leads to inaccurate results, which
demonstrates over-tting. Extensive results carried out show that spatial ltering improves forecasting, the higher
number of discarded points—the higher number of locations that are more suitable for HS. Nevertheless, the
big distance of 10 km excludes the largest amount of points which leads to an increase in the uncertainty of the
model. erefore, the distance of 7 km was considered as optimum in our study. However, it must not be forgot-
ten thatthe suspense ofthe level of sampling bias is one limitation of our implementation.
Figure3. Maps of prediction for possible distribution in future climate conditions on the example of selected
global climate models CNRM-CM6-1, CanESM5, BCC-CSM2-MR, in two scenarios of Shared Socioeconomic
Pathways—SSP126 and SSP585.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
e characterization ofthe possible future expansion of invasive species might be a powerful tool for the
highlighting of the importance of conservation management. Research results allow considering possible shis
in the ecosystem components on dierent levels without actions to be taken.
According to the modelling results the main change between current and possible future distributions is in
the noticeable increase of the average possible suitability across the territory from 0.28 in current climate condi-
tions to 0.37–0.4 at SSP126 and to 0.37–0.43 at SSP585 in future climate considering the results of the selected
models (TableS1). Moreover, in the future conditions, the territory of study is covered more or less uniformly
by the prediction values higher than 0.25—on average 89% is potentially suitable for HS spread, which is two
times more, compared to the current climate predictions based on the selected data shape. At the same time,
the maximum prediction value in future climate is decreased from 1 to 0.7. e territory covered by prediction
values more than 0.25 includes the high latitudes above 60
, especially in the SSP585 scenario according to the
CanESM5 model.
According to established knowledge, the light availability in combination with disturbed upper soil coverage is
necessary for the development of HS plants atthe early stages35. us, the conditions of long daylight in northern
latitudes during the period of active vegetation might especially favourthe plants’ growth in the lighted cover
locations with suitable wetness regime. erefore, HS in future climate might impose a serious danger to the
territories thatare currently characterised by low biodiversity and productivity. It can be additionally considered,
thatthe active spread of HS can inuence the carbon cycle by reducing biodiversity and changing the structure
of soil prole. HS is reported not to form litter36, so the structure of microorganisms’ community should shi,
while soil properties are expected to degrade due toa decrease of carbon stock37.
From the popular algorithms, we chose the Random forest model as the most suitable for our case.e data
required for predictions can be divided into plant occurrence records and environmental features. Bioclimatic
variables and soil properties were selected as the main environmental features. All of the data were obtained
from open sources.
Heracleum Sosnowskiy plant description. Heracleum sosnowskyi is a monocarpic perennial plant of
the Apiaceae family. e height is up to 3–5 m with a straight stem up to 12 cm in diameter. HS compound steam
leaves can reach 150 cm, both long and wide38. e blooming period starts in July and continues until the end of
September. Plant reproduction is performed by seeds only. e seeds’ depth of germination is reported as mainly
in the upper 5 cm down to 15 cm of soil. One plant can produce 10–20,000 seeds39,40. Seeds germinate in the early
spring, while somehave reported that a period of cold stratication for the dormancy break is obligatory for ger-
mination development. Suitable conditions for HS include a temperate climate with warm humid summers and
cold winters, while it is probably not drought resistant. Plants of HS tend to neutral soils with apH range from 6
to 7, rich in nutrients, and being reported as nitrophilous, so the eutrophication of the environment favours HS
development. HS plants do not tolerate shade conditions in the rst growing period.
HS is mostly spread in articial and semi-natural habitats, including grasslands, pastures, parks, roadsides,
agricultural elds, riverbanks or canal sides, and other distributed habitats. Currently, the main pathways of
spread include an involuntary entry with soil on vehicles, machinery, footwear or the use of soil as a commodity
(as the growing medium rich in organic matter)39.
Study area. e area for modelling extends from approximately 41
to 70
N and from 27
to 60
E, and
Kaliningrad region, it equals to approximately 4 mln km2 (Fig.4).
e European part of Russia is the most inhabited part of the country, and it is the home of approximately
80% of the total population of Russia. It includes the East European Plain, Caucasus mountains and Ural moun-
tains, with the predominance of the East European Plain. Environmental characteristics across the territory of
study vary signicantly. e climate is changing from semi-arid in the south to subarctic in the north, including
humid continental climate conditions. Natural vegetation is represented by almost all types of biomes withthe
prevalence of dierent types of forests: broadleaf and mixed forests, coniferous forests, andboreal forests (taiga),
while the area of arable lands is reported to be approximately 650,000 km241,42. e territory is subjected to the
constant land-use types and cover changes due to the urbanization and switch of the status of arable lands—i.e.
reduction of croplands and development of fallows and forests, and, vice versa, returning of some of them into
the cultivation process43. e soil cover is represented by the contrast by their physicochemical properties groups,
in the northern part of Luvisols, Podzols, Histosols, while of the southern part—by Chernozems, Kastanozems,
Collection of the input data. Plant occurrence data. Plant occurrence coordinates were collected from
several publicly available sources related to citizen science projects:the Global Biodiversity Information Facility
database45, iNaturalist database46, andthedatabase of the “Antiborschevik” community47. Records were docu-
mented by human observation and collected from 2000 to 2021. e overall number of initial occurrence points
from combined sources is 7637.
Environmental predictors. Climate data Modelling was performed for current and future climate conditions at
its two scenarios, selected year ranges were 2000–2018 and 2040–2060 respectively.
Climatic variables were collected from the Worldclim database48, containing the average seasonal information
relevant tothe physiological characteristics of species and available at dierent resolutions. We chose 10 arc-
minutes spatial resolution taking into account the size of the studied area. Table1 provides ashort description
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
ofthe used bioclimatic features, and we refer the reader to the Worldclim project for detailed information on
the variables’ calculation.
For the future climate scenarios, we used two Shared Socioeconomic Pathways (SSPs)49—1-2.6 and 5-8.5,
corresponding to the lowest (keeping global mean temperature increase below 2
C) and the highest (at the
increase of population without technological change) predicted future greenhouse gases emission scenarios.
For these data, we took the same resolution (10 arc-minutes) as discussed above.
Figure4. Map of the study area: white colour represents the territory used for prediction, red points
correspond to the dataset of HS occurrence, collected from the available sources.
Table 1. Description of used bioclimatic variables.
Parameter Full name
BIO1 Annual mean temperature
BIO2 Mean diurnal range
BIO3 Isothermality
BIO4 Temperature seasonality
BIO5 Max temperature of warmest month
BIO6 Min temperature of coldest month
BIO7 Temperature annual range
BIO8 Mean temperature of wettest quarter
BIO9 Mean temperature of driest quarter
BIO10 Mean temperature of warmest quarter
BIO11 Mean temperature of coldest quarter
BIO12 Annual precipitation
BIO13 Precipitation of wettest month
BIO14 Precipitation of driest month
BIO15 Precipitation seasonality
BIO16 Precipitation of wettest quarter
BIO17 Precipitation of driest quarter
BIO18 Precipitation of warmest quarter
BIO19 Precipitation of coldest quarter
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
We used the Equilibrium Climate Sensitivity to select the climate model to model future HS distribution.
Equilibrium climate sensitivity (ECS) is dened as the global mean surface air temperature change due to a rapid
doubling of carbon dioxide concentrations as soon as the associated ocean-atmosphere-sea ice system reaches
equilibrium. As the ECS value increases, the model’s sensitivity to the CO
concentration in the atmosphere
increases. We have chosen CanESM5 model (ECS—5.6), CNRM-CM6-1 model (ECS—4.3) and BCC-CSM2-
MR model (ECS—3.0)50.
For the future climate scenarios we selected three climate models:
BCC-CSM2-MR Beijing Climate Center climate system model developed in Beijing Climate Center, China
Meteorological Administration51. Model has horizontal resolution 1.125
by 1.125
CanESM5 Canadian Earth System Model version 5 developed in Canadian Center for Climate Modelling
and Analysis, Canada52. Horizontal resolution 2.81
by 2.81
CNRM-CM6-1 Climate model developed in National Center of Meteorological Research, France53. Horizontal
resolution 1.4
by 1.4
Authors of the WorldClim project prepared historical and future climate data to a uniform spatial (10 arc-
minutes) and temporal resolution.
Soil data Soil data were downloaded from the SoilGrids database54—a system for global digital soil map-
ping. SoilGrids provides continuous data at several depths of the spatial distribution of soil properties across
the globe with selected resolution. It uses a machine learning approach to reconstruct continuous data from
230,000 soil prole observations from the WoSIS (e World Soil Information Service) database and a series of
environmental covariates.
From the whole set of the data provided by SoilGrids several properties were chosen for the forecasting: rela-
tive percentage of silt (Silt, %), sand (Sand, %), a volumetric fraction of coarse fragments (CF, %), cation exchange
capacity (CEC,
) and soil organic carbon (SOC, g/kg) at the depth 5–15 cm, where the HS seeds are
assumed to be located. ese variables are expected to be more stable over time than bioclimatic predictors; thus,
chosen soil properties could be implemented for the future time the same as in the present.
Data pre‑processing. All the data were transformed to the ASCII format by R script and using soware
DIVA-GIS following the tutorial forthe preparation of WorldClim les for use in SDM (http:// www. lep- net. org/
wp- conte nt/ uploa ds/ 2016/ 08/ World Clim_ to_ MaxEnt_ Tutor ial. pdf) with unied selected resolution 340
Optimization of the occurrence points amount. e general problem in using the available data collected from
the databases of the citizen science projects is that the points of observation are distributed non-uniformly. For
instance, the frequency of the records depends on the density of the population directly. e spatial ltering of
the data (reducing the number of points) can be performed to reduce the sampling bias55. We prepared three
datasets with a distance between points of 4, 7 and 10 km with 2402, 1846 and 1504 occurrence points cor-
respondingly lteringthe initial dataset. For the thinning step thin() function was used within the R package
spin with 100 iterations for each of chosen thinning distances. To understand how much data we could lose,
we used the analysis of feature distribution and evaluated the general fairness of the model performance.
Pseudo-absence generation. Due to the availability only of the presence points, it is important to generate
the absence points for further implementation of the selected algorithm. Although the generation of pseudo-
absence points in SDM research is a widespread solution, a closer look at the literature reveals several gaps and
shortcomings. Since the raw dataset of the HS distribution demonstrates strong sampling bias,the generation
of pseudo-absence points usingthe usual ‘random’ strategy can aggravate the sampling bias problem. us, the
combination of the ‘disk’ and ‘random’ strategies was applied for the generation of the pseudo-absence points
using the biomod R package17.
e ‘disk’ strategy is established on the geographic distance works as separation from truth presence and pos-
sible absence points. e optimal geographic distance for HS was chosen as 25 km. is distance was chosen
empirically by trial-and-error. We started with 18 km (because the size of the cell is  9–18 km depending on
location) and nished with 50 km. Using distances such as 30–50 km lead to a positive spatial autocorrela-
tion. us, we decided to set 25 km which nally provided both optimal model performance and reduced
spatial autocorrelation.
e second part ofthe generation was based onthe ‘random’ strategy with ltration: according to the dier-
ent range of climate conditions on the territory of Russia, there are several places where HS is not detected,
thus not growing. e selection of unsuitable places for HS related to the north of Russia, where it is might
be too cold for plant species. From all amount of randomly generated generated points we selected points
with condition latitude
, according to tundra board line.
Features selection procedure. To avoid over-tting and to choose the most conscientious set of parameters
for nal modelling, two approaches were combined. We searched features that are not correlated with others
by aselected threshold is equal to 0.8 in absolute values56 and estimated variable importance using the Mean
Decrease Gini (MDG) and the Mean Decrease Accuracy (MDA) as the result of modelling on enumerated
parameters’ combinations. MDG score is related to the homogeneity of the nodes and leaves coecient. With the
rise of the MDG score the importance of the corresponding feature is also increasing. MDA describes how much
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
accuracy decrease by removing the feature. We selected the most important features according to the MDG and
MDA scores by the highest values of both metrics usinga sequential search from an initial set of variables.
Modelling approach. Random forest. Choosing the appropriate method for creating the tool for accurate
SDM is crucial because the overall performance could vary dramatically, depending on the selected model and
particular use case. ere is a limited amount of acceptable machine learning methods that can be used in SDM.
Several popular methods demonstrated high performance in modelling on large areas: GBM, RF, and GLM.
In particular, for modelling and prediction of the potential distribution of invasive species, GLM and RF were
used57. We decided to use RF because this model was successfully implemented for solvinga variety of tasks such
as predictions of animal and plant distributions, and also was used for making predictions on a large territory58.
e other important advantage that should be noticed is the straightforward interpretability of RF, which means
that it is possible to evaluatethe impact of each environmental parameter on the occurrence of the invasive spe-
Approach to the cross-validation of the model. A unique approach for the model calibration is needed to reduce
spatial autocorrelation caused by the absence of a strict sampling design. In our case, the data was split into
training and testing folds using the spatial blocks technique in a scheme of 13-fold cross-validation. Random
spatial splitting was performed 20 times to calibrate the model, with a distance between blocks set as 100 km. To
calibrate the model we useda spatial blocks approach with random type from R package blockCV.
Evaluation of the model performance. To evaluate the performance of the model a classic approach for ecology
was used—Area Under Curve (AUC ) or Receiver operating characteristic (ROC), related to the independent
threshold techniques16. e principle of methods lies in the standard confusion matrix, where rows and columns
represent actual and predicted classes. e construction of ROC curves uses all possible thresholds to obtain
dierent confusion matrices which leads tothe reproduction of the curve with two-dimensional space: (1) on
y-axis is True Positive Rate (sensitivity, recall); (2) on x-axis is False Positive Rate (equal to 1 − specicity). In our
case true positive (TP, sensitivity) rate means that predicted places where HS grows correspond to actual. Simi-
larly, true negative rate (TN, specicity) indicates correctly classied locations as absence points. In contrast, the
missteps when the model predicted places as presence points for plants that are incorrect are False Positive, FP,
and places where HS is absent, according to the model, while this is not true are recognised as False Negative, FN.
Code availability
For the data, preprocessing and modelling details to reproduce the calculations, we refer the reader to the reposi-
tory of the project https:// github. com/ herac leums osnow skyi/ Herac leum_ Sosno wskyi.
Received: 11 February 2022; Accepted: 30 March 2022
1. Pyšek, P. & Richardson, D. M. Invasive species, environmental change and management, and health. Annu. Rev. Environ. Resour.
35, 25–55 (2010).
2. Gren, M., Campos, M. & Gustafsson, L. Economic development, institutions, and biodiversity loss at the global scale. Reg. Environ.
Change 16, 445–457 (2016).
3. Duenas, M.-A., Hemming, D. J., Roberts, A. & Diaz-Soltero, H. e threat of invasive species to IUCN-listed critically endangered
species: A systematic review. Glob. Ecol. Conserv. 26, e01476 (2021).
4. Charles, H. & Dukes, J. S. Impacts of invasive species on ecosystem services. In Biological Invasions (eds Charles, H. & Dukes, J.
S.) 217–237 (Springer, 2008).
5. R ani, F. et al. From nucleotides to satellite imagery: Approaches to identify and manage the invasive pathogen Xylella fastidiosa
and its insect vectors in europe. Sustainability 12, 4508 (2020).
6. Gallardo, B., Clavero, M., Sánchez, M. I. & Vilà, M. Global ecological impacts of invasive species in aquatic ecosystems. Glob.
Change Biol. 22, 151–163 (2016).
7. Wardle, D. A. & Peltzer, D. A. Impacts of invasive biota in forest ecosystems in an aboveground-belowground context. Biol. Invas.
19, 3301–3316 (2017).
8. Linders, T. E. W. et al. Direct and indirect eects of invasive species: Biodiversity loss is a major mechanism by which an invasive
tree aects ecosystem functioning. J. Ecol. 107, 2660–2672 (2019).
9. Mehta, S. V., Haight, R. G., Homans, F. R., Polasky, S. & Venette, R. C. Optimal detection and control strategies for invasive species
management. Ecol. Econ. 61, 237–245 (2007).
10. Crowley, S. L., Hinchlie, S. & McDonald, R. A. Conict in invasive species management. Front. Ecol. Environ. 15, 133–141 (2017).
11. Epanchin-Niell, R. S. Economics of invasive species policy and management. Biol. Invas. 19, 3333–3354 (2017).
12. Day, R. et al. Invasive Species: e Hidden reat to Sustainable Development (CABI Publications, 2018).
13. Baležentienė, L., Stankevičienė, A. & Snieškienė, V. Heracleum sosnowskyi (Apiaceae) seed productivity and establishment in
dierent habitats of central Lithuania. Ekologija 59, 123 (2013).
14. Nielsen, C. et al. e giant hogweed best practice manual guidelines for the management and control of an invasive weed in Europe.
For. Landsc. Denmark Hoersholm 44, 44 (2005).
15. Williams, M. I. & Dumroese, R. K. Preparing for climate change: Forestry and assisted migration. J. Fo r. 111, 287–297 (2013).
16. Pecchi, M. et al. Reviewing climatic traits for the main forest tree species in Italy. iForest-Biogeosci. For. 12, 173 (2019).
17. uiller, W., Lafourcade, B., Engler, R. & Araújo, M. B. BIOMOD—A platform for ensemble forecasting of species distributions.
Ecography 32, 369–373 (2009).
18 . Naimi, B. & Araújo, M. B. sdm: A reproducible and extensible R platform for species distribution modelling. Ecography 39, 368–375
19. Brown, J. L., Bennett, J. R. & French, C. M. SDMtoolbox2.0: e next generation Python-based GIS toolkit for landscape genetic,
biogeographic and species distribution model analyses. PeerJ 5, e4095 (2017).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
20. Golding, N. et al. e zoon R package for reproducible and shareable species distribution modelling. Methods Ecol. Evol. 9, 260–268
21. Tikhonov, G. et al. Joint species distribution modelling with the r-package Hmsc. Methods Ecol. Evol. 11, 442–447 (2020).
22. Elith, J. & Graham, C. H. Do they? how do they? Why do they dier? On nding reasons for diering performances of species
distribution models. Ecography 32, 66–77 (2009).
23. Guo, Y., Li, X., Zhao, Z. & Nawaz, Z. Predicting the impacts of climate change, soils and vegetation types on the geographic dis-
tribution of Polyporus umbellatus in China. Sci. Total Environ. 648, 1–11 (2019).
24. Hao, T., Elith, J., Guillera-Arroita, G. & Lahoz-Monfort, J. J. A review of evidence about use and performance of species distribu-
tion modelling ensembles like BIOMOD. Divers. Distrib. 25, 839–852 (2019).
25 . Mi, C., Huettmann, F., Guo, Y., Han, X. & Wen, L. Why choose R andom Forest to predict rare species distribution with few samples
in large undersampled areas? ree Asian crane species models provide supporting evidence. PeerJ 5, e2849 (2017).
26. Robinson, N. M., Nelson, W. A., Costello, M. J., Sutherland, J. E. & Lundquist, C. J. A systematic review of marine-based species
distribution models (SDMs) with recommendations for best practice. Front. Mar. Sci. 4, 421 (2017).
27. Jarvie, S. & Svenning, J.-C. Using species distribution modelling to determine opportunities for trophic rewilding under future
scenarios of climate change. Philos. Trans. R. Soc. B Biol. Sci. 373, 20170446 (2018).
28. Hertzog, L. R., Besnard, A. & Jay-Robert, P. Field validation shows bias-corrected pseudo-absence selection is the best method for
predictive species-distribution modelling. Divers. Distrib. 20, 1403–1413 (2014).
29. Hattab, T. et al. A unied framework to model the potential and realized distributions of invasive species within the invaded range.
Divers. Distrib. 23, 806–819 (2017).
30. Bertolino, S. et al. Spatially explicit models as tools for implementing eective management strategies for invasive alien mammals.
Mamm. Rev. 50, 187–199 (2020).
31. Chapman, D. S., Scalone, R., Štefanić, E. & Bullock, J. M. Mechanistic species distribution modeling reveals a niche shi during
invasion. Ecology 98, 1671–1680 (2017).
32. Reddy, S. & Dávalos, L. M. Geographical sampling bias and its implications for conservation priorities in Africa. J. Biogeogr. 30,
1719–1727 (2003).
33. Phillips, S. J. et al. Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence
data. Ecol. Appl. 19, 181–197 (2009).
34. Komori, O., Eguchi, S., Saigusa, Y., Kusumoto, B. & Kubota, Y. Sampling bias correction in species distribution models by quasi-
linear Poisson point process. Ecol. Inform. 55, 101015 (2020).
35. Ozerova, N., Shirokova, V., Krivosheina, M. & Petrosyan, V. e spatial distribution of Sosnowsky’s hogweed (Heracleum sosnow-
skyi) in the valleys of big and medium rivers of the East European Plain (on materials of eld studies 2008–2016). Russ. J. Biol.
Invas. 8, 327–346 (2017).
36. Glushakova, A., Kachalkin, A. & Chernov, I. Y. Soil yeast communities under the aggressive invasion of Sosnowsky’s hogweed
(Heracleum sosnowskyi). Eurasian Soil Sci. 48, 201–207 (2015).
37. Zhu, H., Gong, L., Ding, Z. & Li, Y. Eects of litter and root manipulations on soil carbon and nitrogen in a Schrenk’s spruce (Picea
schrenkiana) forest. PLoS ONE 16, e0247725 (2021).
38. Dalke, I. V. et al. Traits of Heracleum sosnowskyi plants in monostand on invaded area. PLoS ONE 10, e0142833 (2015).
39. EPPO. Report of a Pest Risk Analysis, 09-15075 (2009).
40. Jakubowicz, O. et al. Heracleum sosnowskyi Manden.. Ann. Agric. Environ. Med. 19, 327 (2012).
41. Maltsev, K. & Yermolaev, O. Potential soil loss from erosion on arable lands in the European part of Russia. Eurasian Soil Sci. 52,
1588–1597 (2019).
42. Blinnikov, M. S. A Geography of Russia and Its Neighbors (Guilford Publications, 2021).
43. Gusarov, A. V. Land-use/-cover changes and their eect on soil erosion and river suspended sediment load in dierent landscape
zones of European Russia during 1970–2017. Wate r 13, 1631 (2021).
44. WRB. World Reference Base for Soil Resources 2014, Update 2015 International Soil Classication System for Naming Soils and
Creating Legends for Soil Maps 192 (WRB, 2015).
45. www. GBIF. org. (Accessed 17 August 2021).
46. www. iNatu ralist. org. (Accessed 17 August 2021).
47. http:// Antib orsch evik. info. (Accessed 17 August 2021).
48. Fick, S. E. & Hijmans, R. J. WorldClim 2: New 1-km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 37,
4302–4315 (2017).
49. O’Neill, B. C. et al. e scenario model intercomparison project (ScenarioMIP) for CMIP6. Geosci. Model Dev. 9, 3461–3482.
https:// doi. org/ 10. 5194/ gmd-9- 3461- 2016 (2016).
50. Meehl, G. A. et al. Context for interpreting equilibrium climate sensitivity and transient climate response from the CMIP6 Earth
system models. Sci. Adv. 6, eaba1981. https:// doi. org/ 10. 1126/ sciadv. aba19 81 (2020).
51. Wu, T. et al. e Beijing climate center climate system model (BCC-CSM): e main progress from CMIP5 to CMIP6. Geosci.
Model Dev. 12, 1573–1600. https:// doi. org/ 10. 5194/ gmd- 12- 1573- 2019 (2019).
52. Swart, N. C. et al. e Canadian earth system model version 5 (CanESM5.0.3). Geosci. Model Dev. 12, 4823–4873 (2019).
53. Voldoire, A. et al. Evaluation of CMIP6 DECK experiments with CNRM-CM6-1. J. Adv. Model. Earth Syst. 11, 2177–2213. https://
doi. org/ 10. 1029/ 2019M S0016 83 (2019).
54. Hengl, T. et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 12, e0169748 (2017).
55. Faurby, S. & Araújo, M. B. Anthropogenic range contractions bias species climate change forecasts. Nat. Clim. Change 8, 252–256
56. Dormann, C. F. et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance.
Ecography 36, 27–46 (2013).
57. Adhikari, A., Rew, L. J., Mainali, K. P., Adhikari, S. & Maxwell, B. D. Future distribution of invasive weed species across the major
road network in the state of Montana, USA. Reg. Environ. Change 20, 60 (2020).
58. Marx, M. & Quillfeldt, P. Species distribution models of European Turtle Doves in Germany are more reliable with presence only
rather than presence absence data. Sci. Rep. 8, 1–13 (2018).
The work was supported by the Analytical center under the RF Government (subsidy agreement
000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). We thank the “Antiborschevik” community,
Maria Popova and Nikolay Petrov personally, for help with territorial research and providing the occurrence data.
Author contributions
M.P., P.T., D.S., M.G. contributed to the original concept of the study, D.K., M.G., P.T. collected the data, D.K.
developed the modelling design, performed calculations, D.K., P.T., D.S., M.G. reviewed and interpreted model-
ling output, D.K., P.T., M.G. visualised the results. D.K., P.T., D.S., M.G. wrote original dra.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2022) 12:6128 |
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https:// doi. org/
10. 1038/ s41598- 022- 09953-9.
Correspondence and requests for materials should be addressed to D.K.
Reprints and permissions information is available at
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2022
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... A practical example of ML implementation in the environmental context is [31], in which the authors investigate the distribution of the plant Heracleum Sosnowskyi, which is considered an invasive species in northern and central Russia and some parts of Europe. Originally, the plant habitat was subalpine meadows and highland forests in the Caucasus region and northern parts of the Middle East, but in the 1970s, it was introduced in Europe as a silage plant and quickly spread in rural areas [32]. ...
... In [31], the authors proposed several prediction models based on publicly available environmental data to estimate whether hogweed will grow in certain areas by the year 2060. They implemented several climate models with the worst and best scenarios in terms of carbon emissions [34] and found that the most influential climatic parameters were the mean rainfall of the wettest months as well as the mean temperature of the wettest year quarter. ...
... They implemented several climate models with the worst and best scenarios in terms of carbon emissions [34] and found that the most influential climatic parameters were the mean rainfall of the wettest months as well as the mean temperature of the wettest year quarter. The findings are alarming: There are significant chances of rapid plant propagation to the north of 60 latitude, where it does not currently grow (as of 2022) due to the cold climate [31]. Nevertheless, the spread of hogweed can be monitored and controlled by assessing the flora images obtained from UAVs. ...
Full-text available
Artificial intelligence (AI) is a rapidly advancing area of research that encompasses numerical methods to solve various prediction, optimization, and classification/clustering problems. Recently, AI tools were proposed to address the environmental, social, and governance (ESG) challenges associated with sustainable business development. While many publications discuss the potential of AI, few focus on practical cases in the three ESG domains altogether, and even fewer highlight the challenges that AI may pose in terms of ESG. The current paper fills this gap by reviewing practical AI applications with a main focus on IT and engineering implementations. The considered cases are based on almost one hundred publicly available research manuscripts and reports obtained via online search engines. This review involves the study of typical business and production problems associated with each ESG domain, gives background details on several selected cases (such as carbon neutrality, land management, and ESG scoring), and lists challenges that the smart algorithms can pose (such as fake news generation and increased electricity consumption). Overall, it is concluded that, while many practical cases already exist, AI in ESG is still very far away from reaching its full potential; however, one should always remember that AI itself can lead to some ESG risks.
... In certain fields, particularly in environmental modeling, this challenge is not merely theoretical but manifests itself in tangible ways [7]. Several research studies illustrate how distribution bias can distort models, resulting in erroneous predictions and potentially misguided decisions [14]. ...
Full-text available
In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples.
... The 'battle' against giant hogweed has become ingrained in both everyday life and politics over the course of the post-soviet period, being reinforced by language and narratives of warfare in the official discourse. In May 2022, scientists from the Skolkovo Institute of Science and Technology used Artificial Intelligence to predict that by 2040, there may be no areas left in the European part of Russia that are not overrun by hogweed (Koldasbayeva et al. 2022). While the language of the original report was rather neutral, the media was mostly using verbs like 'conquered', 'colonised', 'invaded'. ...
Full-text available
Human intervention in ecosystems has led to the accelerated dissemination of species deemed ‘invasive’, often accompanied by a perception of inherent malevolence. In Russia, the rampant spread of the giant hogweed has emerged as one of the most debated ecological issues in recent decades. The giant hogweed ( Heracleum ), a herbaceous monocarpic plant first discovered in the Caucasus in 1944, now proliferates throughout the country, from Sochi to Yamal and from the Arctic to downtown Moscow. It is estimated that the giant hogweed occupies over 10% of continental Europe within Russia, with projections suggesting an increase to nearly 100% within the next 30 years. Frequently, the plant is likened to a botanical emblem of Russia or a symbol of Putin's regime, reflecting the social tensions that oscillate between apathy and antipathy toward the hogweed in media and activism spheres.
... A random forest is now widely used in SDMs (e.g., Mi et al., 2017;Watt et al., 2021;Koldasbayeva et al., 2022). "A random forest is an ensemble of trees, where diversity among the predictors is obtained by bagging, and additionally by changing the feature set during learning. ...
Full-text available
Brown spot needle blight (BSNB), caused by Lecanosticta acicola (Thüm.) Syd., is an emerging forest disease of Pinus species originating from North America and introduced to Europe and Asia. Severity and spread of the disease has increased in the last two decades in North America and Europe as a response to climate change. No modeling work on spread, severity, climatic suitability, or potential distribution has been done for this important emerging pathogen. This study utilizes a global dataset of 2,970 independent observations of L. acicola presence and absence from the geodatabase, together with Pinus spp. distribution data and 44 independent climatic and environmental variables. The objectives were to (1) identify which bioclimatic and environmental variables are most influential in the distribution of L. acicola ; (2) compare four modeling approaches to determine which modeling method best fits the data; (3) examine the realized distribution of the pathogen under climatic conditions in the reference period (1971–2000); and (4) predict the potential future global distribution of the pathogen under various climate change scenarios. These objectives were achieved using a species distribution modeling. Four modeling approaches were tested: regression-based model, individual classification trees, bagging with three different base learners, and random forest. Altogether, eight models were developed. An ensemble of the three best models was used to make predictions for the potential distribution of L. acicola : bagging with random tree, bagging with logistic model trees, and random forest. Performance of the model ensemble was very good, with high precision (0.87) and very high AUC (0.94). The potential distribution of L. acicola was computed for five global climate models (GCM) and three combined pathways of Shared Socioeconomic Pathway (SSP) and Representative Concentration Pathway (SSP-RCP): SSP1-RCP2.6, SSP2-RCP4.5, and SSP5-RCP8.5. The results of the five GCMs were averaged on combined SSP-RCP (median) per 30-year period. Eight of 44 studied factors determined as most important in explaining L. acicola distribution were included in the models: mean diurnal temperature range, mean temperature of wettest quarter, precipitation of warmest quarter, precipitation seasonality, moisture in upper portion of soil column of wettest quarter, surface downwelling longwave radiation of driest quarter, surface downwelling shortwave radiation of warmest quarter and elevation. The actual distribution of L. acicola in the reference period 1971–2000 covered 5.9% of Pinus spp. area globally. However, the model ensemble predicted potential distribution of L. acicola to cover an average of 58.2% of Pinus species global cover in the reference period. Different climate change scenarios (five GCMs, three SSP-RCPs) showed a positive trend in possible range expansion of L. acicola for the period 1971–2100. The average model predictions toward the end of the century showed the potential distribution of L. acicola rising to 62.2, 61.9, 60.3% of Pinus spp. area for SSP1-RCP2.6, SSP2-RCP4.5, SSP5-RCP8.5, respectively. However, the 95% confidence interval encompassed 35.7–82.3% of global Pinus spp. area in the period 1971–2000 and 33.6–85.8% in the period 2071–2100. It was found that SSP-RCPs had a little effect on variability of BSNB potential distribution (60.3–62.2% in the period 2071–2100 for medium prediction). In contrast, GCMs had vast impact on the potential distribution of L. acicola (33.6–85.8% of global pines area). The maps of potential distribution of BSNB will assist forest managers in considering the risk of BSNB. The results will allow practitioners and policymakers to focus surveillance methods and implement appropriate management plans.
... [25,[37][38][39], to forecasting of natural processes and time series, e.g. [25,37,38,40]. The usage of the ANN is similar to that of DMD or POD: the general structure of ANN consists from timestamp as an input and prediction of the approximate solution of the studied system of equations as an output. ...
Full-text available
Two approaches for the data-driven modeling of aggregation kinetics, described by Smoluchowski equations, are analyzed for binary and ternary coagulation. The first approach uses the dynamic mode decomposition (DMD) and the second one is based on the artificial neural networks (ANN). We obtain the numerical solution of the Smoluchowski equations and compare it with the predictions yielding by DMD and ANN methods. To construct the forecast for the solution, the initial stage of the system evolution was used. We demonstrate that the DMD approach can accurately predict the size distribution of the aggregates up to the time, five times larger than the training time. In contrast, the straightforward application of the ANN approach fails to provide an accurate forecast. Hence we conclude that the DMD approach is an effective tool for modeling aggregation kinetics, even for complex aggregation events. At the same time the application of the ANN approach requires its further adaptation for the studied system, perhaps by implementation of physically-informed ANN and specially tailored loss functions.
... Statistical or data-driven machine learning approaches, on the other hand, are more flexible and lightweight. That makes them popular in various research applications, even those distant from explicit computer simulations [24][25][26]. At the same time, when dealing with the weather and sea ice modeling, they do not need a complex physical model of processes going on in the ocean and atmosphere to work. ...
Full-text available
Global warming has made the Arctic increasingly available for marine operations and created a demand for reliable operational sea ice forecasts to increase safety. Because ocean-ice numerical models are highly computationally intensive, relatively lightweight ML-based methods may be more efficient for sea ice forecasting. Many studies have exploited different deep learning models alongside classical approaches for predicting sea ice concentration in the Arctic. However, only a few focus on daily operational forecasts and consider the real-time availability of data needed for marine operations. In this article, we aim to close this gap and investigate the performance of the U-Net model trained in two regimes for predicting sea ice for up to the next 10 days. We show that this deep learning model can outperform simple baselines by a significant margin, and we can improve the model’s quality by using additional weather data and training on multiple regions to ensure its generalization abilities. As a practical outcome, we build a fast and flexible tool that produces operational sea ice forecasts in the Barents Sea, the Labrador Sea, and the Laptev Sea regions.
... For instance, such species as Heracleum (aka hogweed) and their distribution is of particular concern. Hogweeds are undesirable invaders, whose spread is controlled by demography, dispersal and variation in invisibility [14]. Leaves and fruits of hogweed are rich in essential oils and contain furanocoumarins [15], which leads to burns of the skin when they contact with ultraviolet radiation [16]. ...
Full-text available
Sodium-ion battery technology rapidly develops in the post-lithium-ion landscape. Among the variety of studied anode materials, hard carbons appear to be the realistic candidates because of their electrochemical performance and relative ease of production. This class of materials can be obtained from a variety of precursors, and the most ecologically important and interesting route is the synthesis from biomass. In the present work, for the first time, hard carbons were obtained from Heracleum sosnowskyi, a highly invasive plant, which is dangerous for humans and can cause skin burns but produces a large amount of green biomass in a short time. We proposed a simple synthesis method that includes the pretreatment stage and further carbonization at 1300 °C. The effect of the pretreatment of giant hogweed on the hard carbon electrochemical properties was studied. Obtained materials demonstrate >220 mAh g−1 of the discharge capacity, high values of the initial Coulombic efficiency reaching 87% and capacity retention of 95% after 100 charge-discharge cycles in sodium half-cells. Key parameters of the materials were examined by means of different analytical, spectroscopic and microscopic techniques. The possibility of using the giant hogweed-based hard carbons in real batteries is demonstrated with full sodium-ion cells with NASICON-type Na3V2(PO4)3 cathode material.
Background: As an invasive pest from North America, grey squirrels (GS, Sciurus carolinensis Gmelin) are displacing native squirrels in Europe. However, the climatic niche and range dynamics of GS in Europe remain largely unknown. Through niche and range dynamic models, we investigated climatic niche and range shifts between introduced GS in Europe and native GS in North America. Results: GS in North America can survive in more variable climatic conditions, and have much wider climatic niche breadth than do GS in Europe. Based on climate, the potential range of GS in Europe included primarily Britain, Ireland, and Italy, whereas the potential range of GS in North America included vast regions of western and southern Europe. If GS in Europe could occupy the same climatic niche space and potential range as GS in North America, they would occupy an area ca. 2.45 times the size of their current range. The unfilling ranges of GS in Europe relative to those of GS in North America were primarily in France, Italy, Spain, Croatia, and Portugal. Conclusion: Our observations implied that GS in Europe have significant invasion potential, and that range projections based on their occurrence records in Europe may underestimate their invasion risk. Given that small niche shifts between GS in Europe and in North America could lead to large range shifts, niche shifts could be a sensitive indicator in invasion risk assessment. The identified unfilling ranges of the GS in Europe should be prioritized in combating GS invasions in the future.
Full-text available
Sosnovsky’s hogweed is an invasive species that suppresses natural meadow biocenoses, but at the same time it can be a source of various biological substances (raw materials). Hogweed can be processed to produce cellulose. The obvious advantage of cellulose from Sosnovsky’s hogweed is the unsuitability of the raw material for other uses, i.e., while valuable resources that are now being used to produce cellulose can be saved, the stems of Sosnovsky’s hogweed are waste products obtained because of getting rid of the plant. Despite this, there is an actual problem of including hogweed in the production chain. To solve this problem, business models can be built that are aimed at using the biproducts of processing hogweed. It is important that business models not only reflect the process of producing added value but also can solve the main problem of processing weed plants: the finiteness of the specified resource. Specifically, entrepreneurs starting such a business should not get into a situation where they destroy their only resource. This article is focused on a comparison of business models according to the following criteria: feasibility, profitability, and environmental impact. Business models that involve constructing a processing plant, using mobile laboratories, and industrial symbiosis models are presented. The overall result of this work is a business model that meets the specified criteria. Similar business models can be used for other plants with the possibility of obtaining valuable raw materials. Research shows how Sosnovsky hogweed can be processed into bioethanol or cellulose.
Full-text available
Contemporary trends in cultivated land and their influence on soil/gully erosion and river suspended sediment load were analyzed by various landscape zones within the most populated and agriculturally developed part of European Russia, covering 2,222,390 km2. Based on official statistics from the Russian Federation and the former Soviet Union, this study showed that after the collapse of the Soviet Union in 1991, there was a steady downward trend in cultivated land throughout the study region. From 1970–1987 to 2005–2017, the region lost about 39% of its croplands. Moreover, the most significant relative reduction in cultivated land was noted in the forest zone (south taiga, mixed and broadleaf forests) and the dry steppes and the semi-desert of the Caspian Lowland—about 53% and 65%, respectively. These territories are with climatically risky agriculture and less fertile soils. There was also a widespread reduction in agricultural machinery on croplands and livestock on pastures of the region. A decrease in soil/gully erosion rates over the past decades was also revealed based on state hydrological monitoring data on river suspended sediment load as one of the indicators of the temporal variability of erosion intensity in river basins and the published results of some field research in various parts of the studied landscape zones. The most significant reduction in the intensity of erosion and the load of river suspended sediment was found in European Russia’s forest-steppe zone. This was presumably due to a favorable combination of the above changes in land cover/use and climate change.
Full-text available
Plant detritus represents the major source of soil carbon (C) and nitrogen (N), and changes in its quantity can influence below-ground biogeochemical processes in forests. However, we lack a mechanistic understanding of how above- and belowground detrital inputs affect soil C and N in mountain forests in an arid land. Here, we explored the effects of litter and root manipulations (control (CK), doubled litter input (DL), removal of litter (NL), root exclusion (NR), and a combination of litter removal and root exclusion (NI)) on soil C and N concentrations, enzyme activity and microbial biomass during a 2-year field experiment. We found that DL had no significant effect on soil total organic carbon (SOC) and total nitrogen (TN) but significantly increased soil dissolved organic carbon (DOC), microbial biomass C, N and inorganic N as well as soil cellulase, phosphatase and peroxidase activities. Conversely, NL and NR reduced soil C and N concentrations and enzyme activities. We also found an increase in the biomass of soil bacteria, fungi and actinomycetes in the DL treatment, while NL reduced the biomass of gram-positive bacteria, gram-negative bacteria and fungi by 5.15%, 17.50% and 14.17%, respectively. The NR decreased the biomass of these three taxonomic groups by 8.97%, 22.11% and 21.36%, respectively. Correlation analysis showed that soil biotic factors (enzyme activity and microbial biomass) and abiotic factors (soil moisture content) significantly controlled the change in soil C and N concentrations ( P < 0.01). In brief, we found that the short-term input of plant detritus could markedly affect the concentrations and biological characteristics of the C and N fractions in soil. The removal experiment indicated that the contribution of roots to soil nutrients is greater than that of the litter.
Full-text available
We conducted a comprehensive review of the research literature on the impacts of invasive species on critically endangered species that are included in the International Union for Conservation of Nature (IUCN) Red List. This review reveals that, globally, invasive species threaten 14% (28% on islands) of all critically endangered terrestrial vertebrate species (birds, mammals and reptiles), threatened predominantly by a few invasive mammal predators (mainly rodents and feral cat). The chytrid fungal pathogen is the main threat for amphibians. The control and management of the invasive species identified in this study should be a high priority for global biological conservation, thereby contributing towards the achievement of the goals of the Post-2020 Framework of the Convention on Biological Diversity. Further research on the impacts of invasive species and interactions with other drivers will be essential for the conservation of highly threatened species.
Full-text available
For the current generation of earth system models participating in the Coupled Model Intercomparison Project Phase 6 (CMIP6), the range of equilibrium climate sensitivity (ECS, a hypothetical value of global warming at equilibrium for a doubling of CO 2 ) is 1.8°C to 5.6°C, the largest of any generation of models dating to the 1990s. Meanwhile, the range of transient climate response (TCR, the surface temperature warming around the time of CO 2 doubling in a 1% per year CO 2 increase simulation) for the CMIP6 models of 1.7°C (1.3°C to 3.0°C) is only slightly larger than for the CMIP3 and CMIP5 models. Here we review and synthesize the latest developments in ECS and TCR values in CMIP, compile possible reasons for the current values as supplied by the modeling groups, and highlight future directions. Cloud feedbacks and cloud-aerosol interactions are the most likely contributors to the high values and increased range of ECS in CMIP6.
Full-text available
Biological invasions represent some of the most severe threats to local communities and ecosystems. Among invasive species, the vector-borne pathogen Xylella fastidiosa is responsible for a wide variety of plant diseases and has profound environmental, social and economic impacts. Once restricted to the Americas, it has recently invaded Europe, where multiple dramatic outbreaks have highlighted critical challenges for its management. Here, we review the most recent advances on the identification, distribution and management of X. fastidiosa and its insect vectors in Europe through genetic and spatial ecology methodologies. We underline the most important theoretical and technological gaps that remain to be bridged. Challenges and future research directions are discussed in the light of improving our understanding of this invasive species, its vectors and host-pathogen interactions. We highlight the need of including different, complimentary outlooks in integrated frameworks to substantially improve our knowledge on invasive processes and optimize resources allocation. We provide an overview of genetic, spatial ecology and integrated approaches that will aid successful and sustainable management of one of the most dangerous threats to European agriculture and ecosystems.
Full-text available
Invasive plant species are a significant global problem, with the potential to alter structure and function of ecosystems and cause economic damage to managed landscapes. An effective course of action to reduce the spread of invasive plant species is to identify potential habitat incorporating changing climate scenarios. In this study, we used a suite of species distribution models (SDMs) to project habitat suitability of the eleven most abundant invasive weed species across road networks of Montana, USA, under current (2005) conditions and future (2040) projected climates. We found high agreement between different model predictions for most species. Among the environmental predictors, February minimum temperature, monthly precipitation, solar radiation, and December vapor pressure deficit accounted for the most variation in projecting habitat suitability for most of the invasive weed species. The model projected that habitat suitability along roadsides would expand for seven species ranging from + 5 to + 647% and decline for four species ranging from − 11 to − 88% under high representative concentration pathway (RCP 8.5) greenhouse gas (GHG) trajectory. When compared with current distribution, the ensemble model projected the highest expansion habitat suitability with six-fold increase for St. John’s Wort (Hypericum perforatum), whereas habitat suitability of leafy spurge (Euphorbia esula) was reduced by − 88%. Our study highlights the roadside areas that are currently most invaded by our eleven target species across 55 counties of Montana, and how this will change with climate. We conclude that the projected range shift of invasive weeds challenges the status quo, and requires greater investment in detection and monitoring to prevent expansion. Though our study focuses across road networks of a specific region, we expect our approach will be globally applicable as the predictions reflect fundamental ecological processes.
Full-text available
Quantitative assessment of the potential soil loss on arable lands in the European part of Russia (4 million km²) was performed at the regional level of generalization corresponding to the scale of 1 : 500 000. Mathematical modeling based on the use of the equation for calculating potential soil loss from erosion developed in the research Laboratory of Soil Erosion and Channel Processes of Moscow State University was applied for this purpose in combination with geoinformation technologies. The calculations were performed for a raster model of representation of spatial data, including slope gradients, slope lengths, soil properties, rainstorm precipitation, layer of snowmelt runoff, intra-annual distribution of rainfall, and land use types. New data were obtained on the erosional soil loss during the periods of snowmelt runoff and rainstorm runoff and on the total annual loss. An electronic map of erosional soil loss on the arable lands of the European part of Russia was developed. The average soil losses reach 11 t/ha per year under black fallow and 3 t/ha per year under crops with their soil-protective capacity. About a half of the territory is located under conditions of the potential soil loss of less than 0.5 t/ha per year, whereas soil loss of 10 to 15 t/ha per year predominates on the rest of the territory. The rates of soil erosion on arable lands in the European part of Russia decrease from the taiga-forest to the steppe landscape zones. The belt of maximum erosion intensity extends in the sublatitudinal direction within the subzone of mixed and broadleaved forests with a very high percent of plowed land. In addition, potential soil loss from water erosion was determined for 50 subjects of the Russian Federation in the studied area for black fallow and agrocenoses.
Full-text available
Joint Species Distribution Modelling (JSDM) is becoming an increasingly popular statistical method for analysing data in community ecology. Hierarchical Modelling of Species Communities (HMSC) is a general and flexible framework for fitting JSDMs. HMSC allows the integration of community ecology data with data on environmental covariates, species traits, phylogenetic relationships and the spatio‐temporal context of the study, providing predictive insights into community assembly processes from non‐manipulative observational data of species communities. The full range of functionality of HMSC has remained restricted to Matlab users only. To make HMSC accessible to the wider community of ecologists, we introduce H msc 3.0, a user‐friendly r implementation. We illustrate the use of the package by applying H msc 3.0 to a range of case studies on real and simulated data. The real data consist of bird counts in a spatio‐temporally structured dataset, environmental covariates, species traits and phylogenetic relationships. Vignettes on simulated data involve single‐species models, models of small communities, models of large species communities and models for large spatial data. We demonstrate the estimation of species responses to environmental covariates and how these depend on species traits, as well as the estimation of residual species associations. We demonstrate how to construct and fit models with different types of random effects, how to examine MCMC convergence, how to examine the explanatory and predictive powers of the models, how to assess parameter estimates and how to make predictions. We further demonstrate how H msc 3.0 can be applied to normally distributed data, count data and presence–absence data. The package, along with the extended vignettes, makes JSDM fitting and post‐processing easily accessible to ecologists familiar with r .
Invasive alien species are major drivers of global change that can have severe impacts on biodiversity and human well-being. Management strategies implemented to mitigate these impacts are based on a hierarchical approach, from prevention of invasion, via early warning and rapid response, to invasive species management. We evaluated how different classes of spatially explicit models have been used as predictive tools to improve the effectiveness of management strategies. A review of literature published between 2000 and 2019 was undertaken to retrieve studies addressing alien mammal species through these models. We collected 62 studies, dealing with 70 (26.8%) of the 261 mammal species that are considered to be introduced worldwide. Most of the studies dealt with species from the orders Rodentia (34%) and Artiodactyla and Carnivora (both 24%); the most commonly studied families were Sciuridae (13%) and Muridae (12%). Most of the studies (73%) provided spatial predictions of potential species spread, while only ca. 15% of the studies included evaluations of management options. About 29% of the studies were considered useful in risk assessment procedures, but only because they presented climatic suitability predictions worldwide, while studies modelling suitability before a species was introduced locally are still lacking for mammals. With some exceptions, spatially explicit population models are still little used, probably because of the perceived need for detailed information on life history parameters. Spatially explicit models have been used in relatively few studies dealing with invasive mammals, and most of them covered a restricted pool of species. Most of the studies used climate matching to evaluate the suitability of geographic areas worldwide or the possibility of species that were already established spreading further. Modelling procedures could be a useful tool to assess the risk of establishment for species not yet present in an area but likely to arrive; however, such studies are lacking for mammals.