Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
1
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports
Large‑scale forecasting
of Heracleum sosnowskyi habitat
suitability under the climate
change on publicly available data
Diana Koldasbayeva1*, Polina Tregubova2, Dmitrii Shadrin2,3, Mikhail Gasanov2 &
Maria Pukalchik4
This research aims to establish the possible habitat suitability of Heracleum sosnowskyi (HS), one of
the most aggressive invasive plants, in current and future climate conditions across the territory of
the European part of Russia. We utilised a species distribution modelling framework using publicly
available data of plant occurrence collected in citizen science projects (CSP). Climatic variables and
soil characteristics were considered to follow possible dependencies with environmental factors. We
applied Random Forest to classify the study area. We addressed the problem of sampling bias in CSP
data by optimising the sampling size and implementing a spatial cross‑validation scheme. According
to the Random Forest model built on the nally selected data shape, more than half of the studied
territory in the current climate corresponds to a suitability prediction score higher than 0.25. The
forecast of habitat suitability in future climate was highly similar for all climate models. Almost the
whole studied territory showed the possibility for spread with an average suitability score of 0.4.
The mean temperature of the wettest quarter and precipitation of wettest month demonstrated the
highest inuence on the HS distribution. Thus, currently, the whole study area, excluding the north,
may be considered as s territory with a high risk of HS spreading, while in the future suitable locations
for the HS habitat will include high latitudes. We showed that chosen geodata pre‑processing,
and cross‑validation based on geospatial blocks reduced signicantly the sampling bias. Obtained
predictions could help to assess the risks accompanying the studied plant invasion capturing the
patterns of the spread, and can be used for the conservation actions planning.
e relocation and introducing of alien species into new habitats are recognised as one of the major drivers of
global biodiversity loss1–3. Invasive alien (non-indigenous) species IAS tend to spread rapidly and pose a seri-
ous threat to endemic species due to e.g. the competition in the resource use, allelopathy occurrence, toxicity
of IAS4,5. us, the emergence of IAS can dramatically change the functioning of the natural communities and
overall ecosystem structure6–8.
Such common occurrences as human living territory expansion, globalization of transport, and changing
of the land-use types favor species invasion. With that, the estimated costs of the elimination of IAS are usually
quite high. e specic of individual IAS limits the implementation of such practices. e other constraints are
the territory’s size that needs to be treated, the possibility of negative side outcomes because of the use of chemi-
cal and biological control agents, andthe development of the invasion process9–11. IAS disproportionally aect
the most vulnerable communities in poor areas, at the locations of abandoned and disturbed lands. us, their
spread is clearly pulling up the achievement of the Sustainable Development Goals12.
Heracleum sosnowskyi Manden (Hogweed, HS) is one of the examples of extremely dangerous invasive species.
e natural habitat of HS is the central and eastern Caucasus area and adjacent regions, Transcaucasia region
and Turkey13. Large biomass and the ability to live and develop in cold climates became HSa popular crop in
agriculture in the middle of the 20th century14. However, soon the unpleasant odor of milk and meat of animals
OPEN
1Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russian Federation 121205. 2RAIC,
Skolkovo Institute of Science and Technology, Moscow, Russian Federation 121205. 3Irkutsk National Research
Technical University, Irkutsk, Russian Federation 664074. 4Digital Agriculture Laboratory, Skolkovo Institute of
Science and Technology, Moscow, Russian Federation 121205. *email: Diana.Koldasbayeva@skoltech.ru
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
that were fed with HS fodder andthe phototoxic eect of above-ground parts of HS were revealed, and as a
result, the cultivation was abandoned.
e need to forecast the potential extinction of dierent species in dierent spatial and temporal contexts,
has led to the Species Distribution Modelling (SDM) development. SDM framework is based on the ecological
concept assuming that the distribution of species is explained by the set of factors, such as environmental require-
ments and interactions with other living organisms, physiology characteristics, evolution history15,16. General
workow of correlative SDM consists of (1) obtaining the data about the species of study occurrences: presence-
only data, presence/absence data, abundance data; environmental characteristics data, sometimes considering
biotic interactions as well, (2) search of the interconnections between these data, and (3) building the map of
predicted distributions across the region of interest. SDM framework is implemented in a variety of packages
and libraries in most common programming languages, such R or Python, and allows to use several dierent
statistical or machine learning (ML) models, e.g., generalized linear models, classication and regression trees,
random forest (RF), support vector machine, articial neural networks, and others, and ensemble of them17–21.
In terms of data availability these models mostly dier from each other by the requirements to occurrence
records, i.e., should the occurrence data be represented by two classes—presence and absence, or it can be only
presence data22. e choice of the appropriate modelling method signicantly aects outcomes and depends
on multiple factors: size of the territory of study, type of the environment considering its changing dynamics,
characteristics of modelling species, data availability, while it has become more popular to use ensembles-across-
methods forecasting23. However, there are no strict directives on how to implement the ensemble, e.g. should
one estimate an average prediction or weighted average prediction—thus, this solution is not so straightforward
in comparison with basic modelling methods24. Some studies demonstrate higher performance of a particular
model above others for specic cases. For example, it has been shown that RF approach is highly suitable for
forecasting on large territories with a limited amount of data25, while for marine environments, ensemble models
are recommended to use26.
Correlative SDM has a conceptual limitation—it is assumed to capture realized ecological niche, which
is confusing when IAS is the object of the study27. Another struggle is the quality of using data, precisely, the
occurrence and absence of the species. It is stated that pseudoabsence data should be eld corrected, otherwise
it shows strong bias, decreasing the species prediction perfomance28. In reality, such correction is almost impos-
sible for large territories and requires signicant collections of remote sensing data with appropriate resolution.
It is much more controversial issue when the spread of IAS is the case. In case of a sucient number of veried
absence points of the studied IAS, a question that remains: is this location unsuitable indeed for the selected
IAS, or the IAS has not reached it yet29. However, despite all the mentioned limitations, correlative SDM still is
the primary tool for the IAS distribution modelling30. Another possibility is to use mechanistic SDM, which is
developed on the process-based approach, e.g. phenology model31, but such models require calibration of many
internal parameters.
While it is extremely dicult to eliminate all growing populations of the invasive species, HS including, the
information from the modelling of habitat suitability can aid in prioritizing the management of invaded areas.
Precisely, it can help to mark out the territories where the possibility of development of rapidly growing popula-
tions poses the largest threat to native species, agriculture, and populated areas. Considering this context use of
data from CSP is of particular interest, however, it may have its limitations. In this work large-scale HS distribu-
tion modelling is performed. We estimated habitat suitability for the current climate as average from 2000 to 2018
and possible future climate from 2040 to 2060 according to three climate models—BCC-CSM2-MR, CanESM5,
and CNRM-CM6-1—in two scenarios, the worst and the best in terms of greenhouse gas emissions (Fig.1).
e general workow of the presented research included the following steps: (i) collection of the required
data from public sources, (ii) data pre-processing, (iii) feature selection, (iv) model training and validation, (v)
receiving of the outputs of the best model, (vi) building the maps, showing the spatial distribution of the occur-
rence probability (habitat suitability) across the territory of the study, expressed in the range from 0 to 1, for
current and possible future climate conditions. Presented methodology and results of HS spread modelling can
be used for invasion risk assessment.
Results
Optimisation of the occurrence data distribution based on the thinning procedure. Ideally,
thinning removes the optimal number of records to substantially reduce the eects of sampling bias, as in our
case when most of the locations are concentrated in a few places—while simultaneously retaining most of the
valuable information. Figure2 demonstrates the results of model prediction for (1) initial dataset with data col-
lected from all available sources; (2) dataset with thinning distance 4 km, (3) dataset with thinning distance 7
km, (4) dataset with thinning distance 10 km. It is also important to know how the predictors’ distribution would
change at the dierent thinning intervals. In our case, there were no signicant dierences between distribu-
tions’ shapes of environmental features that corresponded to the dierent thinned data (Fig.S1).
e outputs of models vary signicantly depending on the number of points at dierent input datasets. e
ROC-AUC scores of the models built on the complete data, datasets at 10, 7, and 4 km thinning distances are
0.877, 0.83, 0.85, and 0.82 correspondingly (Fig.S4). Modelling results obtained from the complete dataset
represent the territory of the study as mostly unsuitable for HS spread, 84% of the territory is characterised by
the prediction value of less than 0.25. On the most contrast variant of the model built on the data at 10 km thin-
ning distance, the suitability rose considerably: thepercent of territory wherethe prediction value is above 0.5
increased to 22% compared to 3% inthe case of full data, the area of territory under less than 0.25 prediction
values is decreased to 31% and 44% at thinning distances of 10 and 7 km respectively. We further need to choose
which model to use for the next step of future prediction by nding a reasonable output.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
From the results visualised as maps, we can notice, that the model built on the full dataset is overtted, does
not cover northern latitudes, and poorly represent theoriginal habitat located inthe Caucasus area. It mostly
repeats the points of observation thus the possible distribution of habitat suitability of HS obtained from the
full dataset is built in a learn-by-heart manner. On the contrary, the model built on the dataset obtained due to
thinning at the distanceof 7 km seems to be the most suitable in terms of both prediction results and keepingas
much information as possible. Additionally, while we cannot lean on the evaluation scores to support this con-
clusion, we estimated the variability of prediction values across the territory of the study. It was the highest for
the datasets obtained at 10 and 7 km thinning distances. e outputs of the model built on 7 km distance data
were more diverse at the 100 km blocks, as was used for spatial cross-validation (Figs.S2, S3).
Features selected for modelling. To avoid over-tting because of using redundant variables, animpor-
tant part of the SDM procedure was choosing the most meaningful set of them, corresponding to the observed
HS occurrence. To do this several approaches were combined: search of highly correlated features and estima-
tion of the importance of features bythe Mean Decrease Gini (MDG) andthe Mean Decrease Accuracy (MDA)
scores. us, the general workow consisted of 3 general steps: (1) generation of correlation matrix; (2) estima-
tion of MDG and MDA scores; (3) picking up highly important non-correlated features and choosing the fea-
tures that have correlates but demonstrate higher importance according to both MDG and MDA scores. e rst
step of selection includes a search of highly correlated features andthe formation ofsets of mutually exclusive
covariates according to the absolute value ofPearson correlation coecient greater than 0.8. e correlations are
demonstrated in Fig.S5. From the group of bioclimatic variables, the following subset of features demonstrated
ahigh correlations’ coecients between each other:
• BIO1, BIO6, BIO9, BIO11;
• BIO6, BIO4, BIO7;
• BIO4, BIO7, BIO16;
• BIO5, BIO10;
• BIO16, BIO13;
• BIO13, BIO14, BIO18, BIO17, BIO12.
en, based on the variable importance results obtained by MDA and MDG, the most important features were
selected and included in the core list for the predictions: BIO8, BIO10, BIO13, BIO15, BIO19. Additionally,
BIO1 and BIO9 features demonstrated approximately equal importance in corresponding forecastings. us,
we built dierent RF models with the core list of features including only BIO9 for the rst variant and only BIO1
for the second one. By comparing the results from modelling, BIO1 demonstrated higher importance, so it was
included in the nal list of features.
Using the same approach, described above, the selection of soil properties was performed. According to the
correlation matrix (Fig.S5), soil properties do not have correlationcoecients equal to 0.8 or more in absolute
Figure1. Flowchart of the approach.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
values with bioclimatic variables. However, SOC and Sand content demonstrated a high enough correlation. CF,
Silt and Sand did not show high importance in the corresponding analyses. us, from the soil features, the nal
list included only CEC and SOC. erefore, the following list of features was used to train the algorithm: SOC,
CEC, BIO1, BIO8, BIO10, BIO13, BIO15 and BIO19 (Fig.S6).
According to Fig.S6, BIO13 and BIO8 demonstrated the highest importance in predicting HS distribution.
Based on MDA, soil properties are considered to be more important comparedto MDG. BIO1 and BIO10
demonstrated less importance related to MDA, whereas CEC and BIO19 had the same pattern related to MDG.
Possible habitat suitability in the future. Using the set of environmental predictors obtained inatthe
feature selection stage, we modelled the possible future spread of HS across the territory ofthe study. To do this,
we estimated the distribution of bioclimatic variables according to the available global climate models. From
obtained results, we see that CNRM-CM6-1 and BCC-CSM2-MR show almost identical results in general, as
well as between chosen SSP (Fig.3, Fig.S7). According to them, only 6–8 % of the study territory in the future
are characterised by prediction values less than 0.25. e CanESM5 results show a slightly dierent picture: the
percent of the likely unsuitable territory is down even more to 3% at the better SSP126, and to 0.7% at the worst
CO
2
atmosphere concentration scenario—SSP585.
Figure2. Maps of prediction of possible distribution of HS in current climate conditions using dierent
thinning distances and, consequently, amounts of input points. e quality of prediction varies signicantly,
while the model built on the full dataset is obviously overtted.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
Discussion
Sampling bias is a frequent problem in SDM because presence-only data is collected predominantly in more
accessible locations32. e spatial sampling bias also leads to environmental bias in the data set33. Ignoring the
problem leads to inaccurate forecasting and can be followed by an inappropriate risk assessment of invasive
species and mistakes in conservation planning. e number of methods to tackle the problem of sampling bias
is limited. For rare records, a new Poisson point process model is proposed34, where environmental eect and
sampling bias are explicitly modelled in separate clusters in a framework of quasi-linear modelling. Another
example of solving this problem is the incorporation of background (pseudo-absence) points with the same
sampling bias asthe distribution of occurrence points33. If the number of records is enough, spatial ltering is
commonly used when some part of occurrence points is discarded. We combined two methods to overcomethe
sampling bias of HS occurrence points: spatial ltering andthe creation of the same eect of sampling bias
among pseudo-absence points.
e main limitation of the data thinning method is that we should assume the particular level of sampling
bias to proceed to the following modelling, while in reality, it is actually unknown. is is whythe search for
the most appropriate amount of discarded records for occurrence pointsis notstraightforward. erefore we
used three distances: 4 km, 7 km and 10 km to compare with the initial dataset. e estimates forthe current
climate (Fig.2) demonstrate signicant dierences in predicted habitat suitability that are highly dependent on
the thinning process. At the same time, using the initial dataset in forecasting leads to inaccurate results, which
demonstrates over-tting. Extensive results carried out show that spatial ltering improves forecasting, the higher
number of discarded points—the higher number of locations that are more suitable for HS. Nevertheless, the
big distance of 10 km excludes the largest amount of points which leads to an increase in the uncertainty of the
model. erefore, the distance of 7 km was considered as optimum in our study. However, it must not be forgot-
ten thatthe suspense ofthe level of sampling bias is one limitation of our implementation.
Figure3. Maps of prediction for possible distribution in future climate conditions on the example of selected
global climate models CNRM-CM6-1, CanESM5, BCC-CSM2-MR, in two scenarios of Shared Socioeconomic
Pathways—SSP126 and SSP585.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
e characterization ofthe possible future expansion of invasive species might be a powerful tool for the
highlighting of the importance of conservation management. Research results allow considering possible shis
in the ecosystem components on dierent levels without actions to be taken.
According to the modelling results the main change between current and possible future distributions is in
the noticeable increase of the average possible suitability across the territory from 0.28 in current climate condi-
tions to 0.37–0.4 at SSP126 and to 0.37–0.43 at SSP585 in future climate considering the results of the selected
models (TableS1). Moreover, in the future conditions, the territory of study is covered more or less uniformly
by the prediction values higher than 0.25—on average 89% is potentially suitable for HS spread, which is two
times more, compared to the current climate predictions based on the selected data shape. At the same time,
the maximum prediction value in future climate is decreased from 1 to 0.7. e territory covered by prediction
values more than 0.25 includes the high latitudes above 60
◦
, especially in the SSP585 scenario according to the
CanESM5 model.
According to established knowledge, the light availability in combination with disturbed upper soil coverage is
necessary for the development of HS plants atthe early stages35. us, the conditions of long daylight in northern
latitudes during the period of active vegetation might especially favourthe plants’ growth in the lighted cover
locations with suitable wetness regime. erefore, HS in future climate might impose a serious danger to the
territories thatare currently characterised by low biodiversity and productivity. It can be additionally considered,
thatthe active spread of HS can inuence the carbon cycle by reducing biodiversity and changing the structure
of soil prole. HS is reported not to form litter36, so the structure of microorganisms’ community should shi,
while soil properties are expected to degrade due toa decrease of carbon stock37.
Methods
From the popular algorithms, we chose the Random forest model as the most suitable for our case.e data
required for predictions can be divided into plant occurrence records and environmental features. Bioclimatic
variables and soil properties were selected as the main environmental features. All of the data were obtained
from open sources.
Heracleum Sosnowskiy plant description. Heracleum sosnowskyi is a monocarpic perennial plant of
the Apiaceae family. e height is up to 3–5 m with a straight stem up to 12 cm in diameter. HS compound steam
leaves can reach 150 cm, both long and wide38. e blooming period starts in July and continues until the end of
September. Plant reproduction is performed by seeds only. e seeds’ depth of germination is reported as mainly
in the upper 5 cm down to 15 cm of soil. One plant can produce 10–20,000 seeds39,40. Seeds germinate in the early
spring, while somehave reported that a period of cold stratication for the dormancy break is obligatory for ger-
mination development. Suitable conditions for HS include a temperate climate with warm humid summers and
cold winters, while it is probably not drought resistant. Plants of HS tend to neutral soils with apH range from 6
to 7, rich in nutrients, and being reported as nitrophilous, so the eutrophication of the environment favours HS
development. HS plants do not tolerate shade conditions in the rst growing period.
HS is mostly spread in articial and semi-natural habitats, including grasslands, pastures, parks, roadsides,
agricultural elds, riverbanks or canal sides, and other distributed habitats. Currently, the main pathways of
spread include an involuntary entry with soil on vehicles, machinery, footwear or the use of soil as a commodity
(as the growing medium rich in organic matter)39.
Study area. e area for modelling extends from approximately 41
◦
to 70
◦
N and from 27
◦
to 60
◦
E, and
Kaliningrad region, it equals to approximately 4 mln km2 (Fig.4).
e European part of Russia is the most inhabited part of the country, and it is the home of approximately
80% of the total population of Russia. It includes the East European Plain, Caucasus mountains and Ural moun-
tains, with the predominance of the East European Plain. Environmental characteristics across the territory of
study vary signicantly. e climate is changing from semi-arid in the south to subarctic in the north, including
humid continental climate conditions. Natural vegetation is represented by almost all types of biomes withthe
prevalence of dierent types of forests: broadleaf and mixed forests, coniferous forests, andboreal forests (taiga),
while the area of arable lands is reported to be approximately 650,000 km241,42. e territory is subjected to the
constant land-use types and cover changes due to the urbanization and switch of the status of arable lands—i.e.
reduction of croplands and development of fallows and forests, and, vice versa, returning of some of them into
the cultivation process43. e soil cover is represented by the contrast by their physicochemical properties groups,
in the northern part of Luvisols, Podzols, Histosols, while of the southern part—by Chernozems, Kastanozems,
Solonetz44.
Collection of the input data. Plant occurrence data. Plant occurrence coordinates were collected from
several publicly available sources related to citizen science projects:the Global Biodiversity Information Facility
database45, iNaturalist database46, andthedatabase of the “Antiborschevik” community47. Records were docu-
mented by human observation and collected from 2000 to 2021. e overall number of initial occurrence points
from combined sources is 7637.
Environmental predictors. Climate data Modelling was performed for current and future climate conditions at
its two scenarios, selected year ranges were 2000–2018 and 2040–2060 respectively.
Climatic variables were collected from the Worldclim database48, containing the average seasonal information
relevant tothe physiological characteristics of species and available at dierent resolutions. We chose 10 arc-
minutes spatial resolution taking into account the size of the studied area. Table1 provides ashort description
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
ofthe used bioclimatic features, and we refer the reader to the Worldclim project for detailed information on
the variables’ calculation.
For the future climate scenarios, we used two Shared Socioeconomic Pathways (SSPs)49—1-2.6 and 5-8.5,
corresponding to the lowest (keeping global mean temperature increase below 2
◦
C) and the highest (at the
increase of population without technological change) predicted future greenhouse gases emission scenarios.
For these data, we took the same resolution (10 arc-minutes) as discussed above.
Figure4. Map of the study area: white colour represents the territory used for prediction, red points
correspond to the dataset of HS occurrence, collected from the available sources.
Table 1. Description of used bioclimatic variables.
Parameter Full name
BIO1 Annual mean temperature
BIO2 Mean diurnal range
BIO3 Isothermality
BIO4 Temperature seasonality
BIO5 Max temperature of warmest month
BIO6 Min temperature of coldest month
BIO7 Temperature annual range
BIO8 Mean temperature of wettest quarter
BIO9 Mean temperature of driest quarter
BIO10 Mean temperature of warmest quarter
BIO11 Mean temperature of coldest quarter
BIO12 Annual precipitation
BIO13 Precipitation of wettest month
BIO14 Precipitation of driest month
BIO15 Precipitation seasonality
BIO16 Precipitation of wettest quarter
BIO17 Precipitation of driest quarter
BIO18 Precipitation of warmest quarter
BIO19 Precipitation of coldest quarter
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
We used the Equilibrium Climate Sensitivity to select the climate model to model future HS distribution.
Equilibrium climate sensitivity (ECS) is dened as the global mean surface air temperature change due to a rapid
doubling of carbon dioxide concentrations as soon as the associated ocean-atmosphere-sea ice system reaches
equilibrium. As the ECS value increases, the model’s sensitivity to the CO
2
concentration in the atmosphere
increases. We have chosen CanESM5 model (ECS—5.6), CNRM-CM6-1 model (ECS—4.3) and BCC-CSM2-
MR model (ECS—3.0)50.
For the future climate scenarios we selected three climate models:
• BCC-CSM2-MR Beijing Climate Center climate system model developed in Beijing Climate Center, China
Meteorological Administration51. Model has horizontal resolution 1.125
◦
by 1.125
◦
.
• CanESM5 Canadian Earth System Model version 5 developed in Canadian Center for Climate Modelling
and Analysis, Canada52. Horizontal resolution 2.81
◦
by 2.81
◦
.
• CNRM-CM6-1 Climate model developed in National Center of Meteorological Research, France53. Horizontal
resolution 1.4
◦
by 1.4
◦
.
Authors of the WorldClim project prepared historical and future climate data to a uniform spatial (10 arc-
minutes) and temporal resolution.
Soil data Soil data were downloaded from the SoilGrids database54—a system for global digital soil map-
ping. SoilGrids provides continuous data at several depths of the spatial distribution of soil properties across
the globe with selected resolution. It uses a machine learning approach to reconstruct continuous data from
230,000 soil prole observations from the WoSIS (e World Soil Information Service) database and a series of
environmental covariates.
From the whole set of the data provided by SoilGrids several properties were chosen for the forecasting: rela-
tive percentage of silt (Silt, %), sand (Sand, %), a volumetric fraction of coarse fragments (CF, %), cation exchange
capacity (CEC,
cmol
c
/kg
) and soil organic carbon (SOC, g/kg) at the depth 5–15 cm, where the HS seeds are
assumed to be located. ese variables are expected to be more stable over time than bioclimatic predictors; thus,
chosen soil properties could be implemented for the future time the same as in the present.
Data pre‑processing. All the data were transformed to the ASCII format by R script and using soware
DIVA-GIS following the tutorial forthe preparation of WorldClim les for use in SDM (http:// www. lep- net. org/
wp- conte nt/ uploa ds/ 2016/ 08/ World Clim_ to_ MaxEnt_ Tutor ial. pdf) with unied selected resolution 340 sq.km.
Optimization of the occurrence points amount. e general problem in using the available data collected from
the databases of the citizen science projects is that the points of observation are distributed non-uniformly. For
instance, the frequency of the records depends on the density of the population directly. e spatial ltering of
the data (reducing the number of points) can be performed to reduce the sampling bias55. We prepared three
datasets with a distance between points of 4, 7 and 10 km with 2402, 1846 and 1504 occurrence points cor-
respondingly lteringthe initial dataset. For the thinning step thin() function was used within the R package
spin with 100 iterations for each of chosen thinning distances. To understand how much data we could lose,
we used the analysis of feature distribution and evaluated the general fairness of the model performance.
Pseudo-absence generation. Due to the availability only of the presence points, it is important to generate
the absence points for further implementation of the selected algorithm. Although the generation of pseudo-
absence points in SDM research is a widespread solution, a closer look at the literature reveals several gaps and
shortcomings. Since the raw dataset of the HS distribution demonstrates strong sampling bias,the generation
of pseudo-absence points usingthe usual ‘random’ strategy can aggravate the sampling bias problem. us, the
combination of the ‘disk’ and ‘random’ strategies was applied for the generation of the pseudo-absence points
using the biomod R package17.
• e ‘disk’ strategy is established on the geographic distance works as separation from truth presence and pos-
sible absence points. e optimal geographic distance for HS was chosen as 25 km. is distance was chosen
empirically by trial-and-error. We started with 18 km (because the size of the cell is 9–18 km depending on
location) and nished with 50 km. Using distances such as 30–50 km lead to a positive spatial autocorrela-
tion. us, we decided to set 25 km which nally provided both optimal model performance and reduced
spatial autocorrelation.
• e second part ofthe generation was based onthe ‘random’ strategy with ltration: according to the dier-
ent range of climate conditions on the territory of Russia, there are several places where HS is not detected,
thus not growing. e selection of unsuitable places for HS related to the north of Russia, where it is might
be too cold for plant species. From all amount of randomly generated generated points we selected points
with condition latitude
>64◦
, according to tundra board line.
Features selection procedure. To avoid over-tting and to choose the most conscientious set of parameters
for nal modelling, two approaches were combined. We searched features that are not correlated with others
by aselected threshold is equal to 0.8 in absolute values56 and estimated variable importance using the Mean
Decrease Gini (MDG) and the Mean Decrease Accuracy (MDA) as the result of modelling on enumerated
parameters’ combinations. MDG score is related to the homogeneity of the nodes and leaves coecient. With the
rise of the MDG score the importance of the corresponding feature is also increasing. MDA describes how much
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
accuracy decrease by removing the feature. We selected the most important features according to the MDG and
MDA scores by the highest values of both metrics usinga sequential search from an initial set of variables.
Modelling approach. Random forest. Choosing the appropriate method for creating the tool for accurate
SDM is crucial because the overall performance could vary dramatically, depending on the selected model and
particular use case. ere is a limited amount of acceptable machine learning methods that can be used in SDM.
Several popular methods demonstrated high performance in modelling on large areas: GBM, RF, and GLM.
In particular, for modelling and prediction of the potential distribution of invasive species, GLM and RF were
used57. We decided to use RF because this model was successfully implemented for solvinga variety of tasks such
as predictions of animal and plant distributions, and also was used for making predictions on a large territory58.
e other important advantage that should be noticed is the straightforward interpretability of RF, which means
that it is possible to evaluatethe impact of each environmental parameter on the occurrence of the invasive spe-
cies.
Approach to the cross-validation of the model. A unique approach for the model calibration is needed to reduce
spatial autocorrelation caused by the absence of a strict sampling design. In our case, the data was split into
training and testing folds using the spatial blocks technique in a scheme of 13-fold cross-validation. Random
spatial splitting was performed 20 times to calibrate the model, with a distance between blocks set as 100 km. To
calibrate the model we useda spatial blocks approach with random type from R package blockCV.
Evaluation of the model performance. To evaluate the performance of the model a classic approach for ecology
was used—Area Under Curve (AUC ) or Receiver operating characteristic (ROC), related to the independent
threshold techniques16. e principle of methods lies in the standard confusion matrix, where rows and columns
represent actual and predicted classes. e construction of ROC curves uses all possible thresholds to obtain
dierent confusion matrices which leads tothe reproduction of the curve with two-dimensional space: (1) on
y-axis is True Positive Rate (sensitivity, recall); (2) on x-axis is False Positive Rate (equal to 1 − specicity). In our
case true positive (TP, sensitivity) rate means that predicted places where HS grows correspond to actual. Simi-
larly, true negative rate (TN, specicity) indicates correctly classied locations as absence points. In contrast, the
missteps when the model predicted places as presence points for plants that are incorrect are False Positive, FP,
and places where HS is absent, according to the model, while this is not true are recognised as False Negative, FN.
Code availability
For the data, preprocessing and modelling details to reproduce the calculations, we refer the reader to the reposi-
tory of the project https:// github. com/ herac leums osnow skyi/ Herac leum_ Sosno wskyi.
Received: 11 February 2022; Accepted: 30 March 2022
References
1. Pyšek, P. & Richardson, D. M. Invasive species, environmental change and management, and health. Annu. Rev. Environ. Resour.
35, 25–55 (2010).
2. Gren, M., Campos, M. & Gustafsson, L. Economic development, institutions, and biodiversity loss at the global scale. Reg. Environ.
Change 16, 445–457 (2016).
3. Duenas, M.-A., Hemming, D. J., Roberts, A. & Diaz-Soltero, H. e threat of invasive species to IUCN-listed critically endangered
species: A systematic review. Glob. Ecol. Conserv. 26, e01476 (2021).
4. Charles, H. & Dukes, J. S. Impacts of invasive species on ecosystem services. In Biological Invasions (eds Charles, H. & Dukes, J.
S.) 217–237 (Springer, 2008).
5. R ani, F. et al. From nucleotides to satellite imagery: Approaches to identify and manage the invasive pathogen Xylella fastidiosa
and its insect vectors in europe. Sustainability 12, 4508 (2020).
6. Gallardo, B., Clavero, M., Sánchez, M. I. & Vilà, M. Global ecological impacts of invasive species in aquatic ecosystems. Glob.
Change Biol. 22, 151–163 (2016).
7. Wardle, D. A. & Peltzer, D. A. Impacts of invasive biota in forest ecosystems in an aboveground-belowground context. Biol. Invas.
19, 3301–3316 (2017).
8. Linders, T. E. W. et al. Direct and indirect eects of invasive species: Biodiversity loss is a major mechanism by which an invasive
tree aects ecosystem functioning. J. Ecol. 107, 2660–2672 (2019).
9. Mehta, S. V., Haight, R. G., Homans, F. R., Polasky, S. & Venette, R. C. Optimal detection and control strategies for invasive species
management. Ecol. Econ. 61, 237–245 (2007).
10. Crowley, S. L., Hinchlie, S. & McDonald, R. A. Conict in invasive species management. Front. Ecol. Environ. 15, 133–141 (2017).
11. Epanchin-Niell, R. S. Economics of invasive species policy and management. Biol. Invas. 19, 3333–3354 (2017).
12. Day, R. et al. Invasive Species: e Hidden reat to Sustainable Development (CABI Publications, 2018).
13. Baležentienė, L., Stankevičienė, A. & Snieškienė, V. Heracleum sosnowskyi (Apiaceae) seed productivity and establishment in
dierent habitats of central Lithuania. Ekologija 59, 123 (2013).
14. Nielsen, C. et al. e giant hogweed best practice manual guidelines for the management and control of an invasive weed in Europe.
For. Landsc. Denmark Hoersholm 44, 44 (2005).
15. Williams, M. I. & Dumroese, R. K. Preparing for climate change: Forestry and assisted migration. J. Fo r. 111, 287–297 (2013).
16. Pecchi, M. et al. Reviewing climatic traits for the main forest tree species in Italy. iForest-Biogeosci. For. 12, 173 (2019).
17. uiller, W., Lafourcade, B., Engler, R. & Araújo, M. B. BIOMOD—A platform for ensemble forecasting of species distributions.
Ecography 32, 369–373 (2009).
18 . Naimi, B. & Araújo, M. B. sdm: A reproducible and extensible R platform for species distribution modelling. Ecography 39, 368–375
(2016).
19. Brown, J. L., Bennett, J. R. & French, C. M. SDMtoolbox2.0: e next generation Python-based GIS toolkit for landscape genetic,
biogeographic and species distribution model analyses. PeerJ 5, e4095 (2017).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
20. Golding, N. et al. e zoon R package for reproducible and shareable species distribution modelling. Methods Ecol. Evol. 9, 260–268
(2018).
21. Tikhonov, G. et al. Joint species distribution modelling with the r-package Hmsc. Methods Ecol. Evol. 11, 442–447 (2020).
22. Elith, J. & Graham, C. H. Do they? how do they? Why do they dier? On nding reasons for diering performances of species
distribution models. Ecography 32, 66–77 (2009).
23. Guo, Y., Li, X., Zhao, Z. & Nawaz, Z. Predicting the impacts of climate change, soils and vegetation types on the geographic dis-
tribution of Polyporus umbellatus in China. Sci. Total Environ. 648, 1–11 (2019).
24. Hao, T., Elith, J., Guillera-Arroita, G. & Lahoz-Monfort, J. J. A review of evidence about use and performance of species distribu-
tion modelling ensembles like BIOMOD. Divers. Distrib. 25, 839–852 (2019).
25 . Mi, C., Huettmann, F., Guo, Y., Han, X. & Wen, L. Why choose R andom Forest to predict rare species distribution with few samples
in large undersampled areas? ree Asian crane species models provide supporting evidence. PeerJ 5, e2849 (2017).
26. Robinson, N. M., Nelson, W. A., Costello, M. J., Sutherland, J. E. & Lundquist, C. J. A systematic review of marine-based species
distribution models (SDMs) with recommendations for best practice. Front. Mar. Sci. 4, 421 (2017).
27. Jarvie, S. & Svenning, J.-C. Using species distribution modelling to determine opportunities for trophic rewilding under future
scenarios of climate change. Philos. Trans. R. Soc. B Biol. Sci. 373, 20170446 (2018).
28. Hertzog, L. R., Besnard, A. & Jay-Robert, P. Field validation shows bias-corrected pseudo-absence selection is the best method for
predictive species-distribution modelling. Divers. Distrib. 20, 1403–1413 (2014).
29. Hattab, T. et al. A unied framework to model the potential and realized distributions of invasive species within the invaded range.
Divers. Distrib. 23, 806–819 (2017).
30. Bertolino, S. et al. Spatially explicit models as tools for implementing eective management strategies for invasive alien mammals.
Mamm. Rev. 50, 187–199 (2020).
31. Chapman, D. S., Scalone, R., Štefanić, E. & Bullock, J. M. Mechanistic species distribution modeling reveals a niche shi during
invasion. Ecology 98, 1671–1680 (2017).
32. Reddy, S. & Dávalos, L. M. Geographical sampling bias and its implications for conservation priorities in Africa. J. Biogeogr. 30,
1719–1727 (2003).
33. Phillips, S. J. et al. Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence
data. Ecol. Appl. 19, 181–197 (2009).
34. Komori, O., Eguchi, S., Saigusa, Y., Kusumoto, B. & Kubota, Y. Sampling bias correction in species distribution models by quasi-
linear Poisson point process. Ecol. Inform. 55, 101015 (2020).
35. Ozerova, N., Shirokova, V., Krivosheina, M. & Petrosyan, V. e spatial distribution of Sosnowsky’s hogweed (Heracleum sosnow-
skyi) in the valleys of big and medium rivers of the East European Plain (on materials of eld studies 2008–2016). Russ. J. Biol.
Invas. 8, 327–346 (2017).
36. Glushakova, A., Kachalkin, A. & Chernov, I. Y. Soil yeast communities under the aggressive invasion of Sosnowsky’s hogweed
(Heracleum sosnowskyi). Eurasian Soil Sci. 48, 201–207 (2015).
37. Zhu, H., Gong, L., Ding, Z. & Li, Y. Eects of litter and root manipulations on soil carbon and nitrogen in a Schrenk’s spruce (Picea
schrenkiana) forest. PLoS ONE 16, e0247725 (2021).
38. Dalke, I. V. et al. Traits of Heracleum sosnowskyi plants in monostand on invaded area. PLoS ONE 10, e0142833 (2015).
39. EPPO. Report of a Pest Risk Analysis, 09-15075 (2009).
40. Jakubowicz, O. et al. Heracleum sosnowskyi Manden.. Ann. Agric. Environ. Med. 19, 327 (2012).
41. Maltsev, K. & Yermolaev, O. Potential soil loss from erosion on arable lands in the European part of Russia. Eurasian Soil Sci. 52,
1588–1597 (2019).
42. Blinnikov, M. S. A Geography of Russia and Its Neighbors (Guilford Publications, 2021).
43. Gusarov, A. V. Land-use/-cover changes and their eect on soil erosion and river suspended sediment load in dierent landscape
zones of European Russia during 1970–2017. Wate r 13, 1631 (2021).
44. WRB. World Reference Base for Soil Resources 2014, Update 2015 International Soil Classication System for Naming Soils and
Creating Legends for Soil Maps 192 (WRB, 2015).
45. www. GBIF. org. (Accessed 17 August 2021).
46. www. iNatu ralist. org. (Accessed 17 August 2021).
47. http:// Antib orsch evik. info. (Accessed 17 August 2021).
48. Fick, S. E. & Hijmans, R. J. WorldClim 2: New 1-km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 37,
4302–4315 (2017).
49. O’Neill, B. C. et al. e scenario model intercomparison project (ScenarioMIP) for CMIP6. Geosci. Model Dev. 9, 3461–3482.
https:// doi. org/ 10. 5194/ gmd-9- 3461- 2016 (2016).
50. Meehl, G. A. et al. Context for interpreting equilibrium climate sensitivity and transient climate response from the CMIP6 Earth
system models. Sci. Adv. 6, eaba1981. https:// doi. org/ 10. 1126/ sciadv. aba19 81 (2020).
51. Wu, T. et al. e Beijing climate center climate system model (BCC-CSM): e main progress from CMIP5 to CMIP6. Geosci.
Model Dev. 12, 1573–1600. https:// doi. org/ 10. 5194/ gmd- 12- 1573- 2019 (2019).
52. Swart, N. C. et al. e Canadian earth system model version 5 (CanESM5.0.3). Geosci. Model Dev. 12, 4823–4873 (2019).
53. Voldoire, A. et al. Evaluation of CMIP6 DECK experiments with CNRM-CM6-1. J. Adv. Model. Earth Syst. 11, 2177–2213. https://
doi. org/ 10. 1029/ 2019M S0016 83 (2019).
54. Hengl, T. et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 12, e0169748 (2017).
55. Faurby, S. & Araújo, M. B. Anthropogenic range contractions bias species climate change forecasts. Nat. Clim. Change 8, 252–256
(2018).
56. Dormann, C. F. et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance.
Ecography 36, 27–46 (2013).
57. Adhikari, A., Rew, L. J., Mainali, K. P., Adhikari, S. & Maxwell, B. D. Future distribution of invasive weed species across the major
road network in the state of Montana, USA. Reg. Environ. Change 20, 60 (2020).
58. Marx, M. & Quillfeldt, P. Species distribution models of European Turtle Doves in Germany are more reliable with presence only
rather than presence absence data. Sci. Rep. 8, 1–13 (2018).
Acknowledgements
The work was supported by the Analytical center under the RF Government (subsidy agreement
000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). We thank the “Antiborschevik” community,
Maria Popova and Nikolay Petrov personally, for help with territorial research and providing the occurrence data.
Author contributions
M.P., P.T., D.S., M.G. contributed to the original concept of the study, D.K., M.G., P.T. collected the data, D.K.
developed the modelling design, performed calculations, D.K., P.T., D.S., M.G. reviewed and interpreted model-
ling output, D.K., P.T., M.G. visualised the results. D.K., P.T., D.S., M.G. wrote original dra.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2022) 12:6128 | https://doi.org/10.1038/s41598-022-09953-9
www.nature.com/scientificreports/
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https:// doi. org/
10. 1038/ s41598- 022- 09953-9.
Correspondence and requests for materials should be addressed to D.K.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2022
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Content uploaded by Polina Tregubova
Author content
All content in this area was uploaded by Polina Tregubova on Apr 12, 2022
Content may be subject to copyright.
Content uploaded by Diana Koldasbayeva
Author content
All content in this area was uploaded by Diana Koldasbayeva on Apr 12, 2022
Content may be subject to copyright.