ArticlePDF Available

Abstract and Figures

There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0 – 20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R²) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including stepwise regression (SWR), recursive feature elimination (RFE) and the genetic algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART).
Content may be subject to copyright.
Smart Agricultural Technology 3 (2023) 100106
Available online 10 August 2022
2772-3755/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (
Optimized modelling of countrywide soil organic carbon levels via an
interpretable decision tree
Ndiye M. Kebonye
, Prince C. Agyeman
, James K.M. Biney
Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany
DFG Cluster of Excellence Machine Learning: New Perspectives for Science, University of Tübingen, AI Research Building, Maria-von-Linden-Str. 6, Tübingen 72076,
Department of Soil Science and Soil Protection, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýck´
a 129, Prague,
Suchdol 165 00, Czech Republic
Department of Landscape Ecology, The Silva Tarouca Research Institute for Landscape and Ornamental Gardening, Lidick´
a 25/27, Brno 602 00, Czech Republic
Intelligible models
Model parsimony
Czech Republic
Digital soil mapping (DSM)
There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs)
such as decision trees while varying conditions like data splitting strategies and feature selection methods in
digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist,
regular models like the Classication and Regression Tree (CART) are least applied despite being more intelli-
gible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model
for DSM while still beneting from its intelligibility, interpretability and intuition potential. Soil organic carbon
(SOC) levels across the Czech Republic are predicted at 30 m ×30 m resolution using selected covariates coupled
with respective CART models. For this work, 440 topsoils (020 cm) for the Czech Republic were retrieved from
the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and
Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall
model results were compared using accuracy metrics including the root mean square error (RMSE). One of the
satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/
kg and a coefcient of determination (R
) of 0.52. The cLHS proves robust for model data splitting. Feature
selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic
Algorithm (GA) were considered computationally efcient for identifying relevant covariates. Generally, the
study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection
methods for improving SOC modelling via a decision tree (CART).
1. Introduction
Preliminary large-scale decision-making depends on reliable yet ac-
curate digital soil property maps [14]. Even though all soil properties
are important, SOC which is a crucial indicator of soil health and quality
has been widely studied across distinct spatial extents. Fairly, the use of
machine learning algorithms (MLAs) [e.g. Random Forest (RF), Support
Vector Machines (SVM), Neural Networks (NN), etc.] is indeed popular
for mapping SOC with very promising results [3,5,6]. Thus, the
invaluable benet of machine learning algorithms cannot be overlooked
in digital soil mapping (DSM).
However, fear and scepticism remain that most MLAs are black-box
models that lack intuitiveness, intelligibility, interpretability as well as
contribute marginally towards our understanding of pedogenic func-
tions, processes and or activities. Black box models get this name from the
fact that they provide minimal knowledge and understanding of how the
model executes predictions or classications. Nonetheless, there are
some MLAs such as decision trees that allow for reasonable interpret-
ability and intelligibility. Decision trees have a clear history and track
[7]. In DSM, some studies including those by Taghizadeh-Mehrjardi
et al. [8], Henderson et al. [9] and Sun et al. [10] have explored deci-
sion trees for various purposes. Yet in DSM, seldom is decision trees
applied because most times researchers resort to using famous more
powerful black-box models (e.g. RF, SVM, etc.) at the cost of interpret-
ability. Moreover, relatively few to no studies have explicitly explored
decision trees, particularly on account to improve their performance or
* Corresponding author at: Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany.
E-mail address: (N.M. Kebonye).
Contents lists available at ScienceDirect
Smart Agricultural Technology
journal homepage:
Received 13 June 2022; Received in revised form 7 August 2022; Accepted 9 August 2022
Smart Agricultural Technology 3 (2023) 100106
accuracy while appraising distinct data splitting and feature selection
method combinations. These simple yet basic MLA tactics may be key
conduits to better improve the performance of decision trees for DSM.
Here, we applied a Classication and Regression Tree (CART) [11]
with the main question, can we be able to improve on SOC prediction
via a decision tree while varying data splitting strategies and feature
selection approaches?This way, not only is interpretability prioritized
but also some improvement in model performance is sure. We mainly
aim to: (i) compare SOC predictions for the Czech Republic while
varying CART models based on data splitting strategies [i.e. Random,
Fig. 1. Important maps in the study include a) soil organic carbon (SOC) sampling points distributed across the Czech Republic and b) the environmental covariates
applied in the prediction of soil organic carbon (SOC) all resampled or aggregated to approximately 30 m ×30 m. [Note: Covariate names in full are: Aspect (aspect),
Elevation (elevation), Enhanced Vegetation Index (evi), Two bands Enhanced Vegetation Index (evi2), Flow direction (owirection), Modied Soil Adjusted
Vegetation Index (msavi), Normalized Difference Vegetation Index (ndvi), Normalized Difference Water Index (ndwi), Roughness (roughness), Soil Adjusted
Vegetation Index (savi), Slope (slope), Topographic Position Index (tpi), Terrain Ruggedness Index (tri), XCoordinates/Longitude (x-axis) and YCoordinates/
Latitude (y-axis)].
N.M. Kebonye et al.
Smart Agricultural Technology 3 (2023) 100106
SPlit and the Conditional Latin Hypercube Sampling (cLHS)] and feature
selection methods [e.g. Boruta, LASSO, Genetic Algorithm (GA)] and (ii)
map the spatial distribution of SOC across the Czech Republic based on
promising CART model results. Appraising different data splitting stra-
tegies and feature selection method combinations while using CART
should improve SOC predictions. Specically, CART models with fewer
estimators are intuitively expected to yield better overall performances
while still rationally explaining soil relationships. With the need to un-
derstand soil processes and activities while making inferences on vari-
able interactions, interpretable models seem appropriate for complex
associations in DSM [12]. Hence, for this study, we only focus on a de-
cision tree (CART) as an example interpretable model.
2. Materials and methods
2.1. Soil samples and organic carbon measurement
The Czech Republic 2015 LUCAS topsoil (020 cm) dataset of
n =440 SOC measurements was used in this study. Mainly the samples
were distributed across the county following distinct land cover classes
including cropland, bareland, grassland, woodland, articial land,
water, wetland and shrubland. Therefore, issues of sample heterogeneity
are catered for through the LUCAS methodology and sampling design.
Safe to say, despite having few samples collected, these reasonably
represent most of the landscape being evaluated (Fig. 1A). For these SOC
measurements, a dry combustion elementary analysis method was
2.2. Modelling and feature selection
As previously mentioned, we applied a decision tree, particularly the
CART model [11]. This model imitates how people make decisions,
making it simple to comprehend. Moreover, according to Wu [13],
CART is one of the most important models. With CART models we can
assess non-linear associations that exist between variables while
beneting from its intuitiveness. A CART model often takes the form of a
tree with nodes (i.e., features), links (i.e., rules) as well as individual
leaves (i.e. nal decision) [14]. We applied terrain derivatives, spectral
indices and position features as the environmental covariates for the
models. A 30 m digital elevation model (DEM) of the area was obtained
from the USGS Earth Explorer ( and
used to generate the terrain derivatives in the software R. Conversely,
the USGS Earth Engine Data catalogue (
m/earth-engine/datasets/catalogue/landsat-8) was used to source for
the atmospherically corrected Landsat 8 data for the area. For the
Landsat 8 data, we averaged the images between the years 2014 to 2018.
The nal image was then imported to software R to extract the indi-
vidual bands and eventually compute for each of the spectral indices
used in this study (Fig. 1B).
The CART models being developed were built from a stack consisting
of 15 covariates that were all resampled or aggregated to approximately
30 m ×30 m (Fig. 1B). For the rst CART models, all covariates were
used to predict SOC using while splitting the data into calibration (80%)
and validation (20%) based on the Random, novel SPlit [15] or cLHS
[16] sampling strategies respectively. Using the calibration sets in each
case, 10-fold cross-validation was repeated ve times to identify the
optimal values for the hyperparameter cp. The relevant software R
packages used were caret [17], raster [18], SPlit [19], clhs [20].
Subsequent CART models involved systematically selecting cova-
riates based on 7 different feature selection methods, Boruta (B), LASSO
Regression (LR), Random Forest (RF), Recursive Feature Elimination
(RFE), Stepwise Regression (SWR), Relief Based Feature Selection
(RBAs) and Genetic Algorithm (GA). The Boruta method centres on the
RF algorithm in that it executes several iterations to eliminate unnec-
essary features [21]. The regularization method LASSO imposes a pen-
alty that causes the coefcients to shrink. The best choice of features is
then generated by the shrinkage in coefcients [22]. Random Forest is
an ensemble model composed of many decision trees that make pre-
dictions separately to obtain an average prediction [23]. During this
process, the relative importance of each feature can be quantied and
In the Recursive Feature Elimination (RFE) method, features are
ordered recursively based on their relevance [24]. For this study, we
implement a Random Forest selection function within the RFE method.
The SWR simply involves tting a simple regression model. Based on the
output of the regression model, each feature is ranked according to its
signicance [25]. Conversely, the RBAs method employs a lter-based
approach to selecting important features in the modelling procedure
[26]. A Genetic Algorithm selects important features in a model by
following biological principles. In general, features are chosen via the
natural selection method [27]. For the feature selection methods, soft-
ware R packages used were Boruta [28], GA [29], caret [17], mlbench
[30], FSelector [31], MASS [32] and leaps [33]. The feature selection
results are summarised in Fig. 2. However, for further information
regarding the feature selection plots and statistics, we refer readers to
the supplementary data associated with this study.
It should be noted that position covariates (i.e. x-axis and y-axis)
were not always included in the feature selection procedures but rather
during the modelling process. Mainly, this was to ensure that at all times
these covariates are included in the modelling process to ensure that the
CART models had spatial features and characteristics which some MLA
studies may ignore. All already stated were repeated, this time using cp
plus other hyperparameters including the minsplit, minbucket and
maxdepth at optimal values of 10, 5 and 15 respectively to further prune
the CART models. For the various CART models, a model with the least
root mean square error (RMSE) was considered promising although
other accuracy indicators were also used including the coefcient of
determination (R
), Lins concordance correlation coefcient (CCC),
mean square error (MSE) and bias (i.e. Eqs. (1) to (5)). The symbols
denoted by n,
2 and
reect the number of observations, correlation
coefcient between the predicted and the observed values, variance and
standard deviation correspondingly. All procedures were performed in
the software R.
i=1(predicted observed)2
i=1(observed observed)2(1)
CCC =2
predicted +
observed + (predicted observed)2(2)
MSE =1
(observed predicted)2(3)
RMSE =
(observed predicted)2
bias =1
(observed predicted)(5)
3. Results and discussion
3.1. Soil organic carbon predictions
The SOC levels ranged between 3.5215.7 g/kg which was a slightly
higher range than 0.839.9 g/kg by ˇ
zala et al. [5]. This was reasonable
since ˇ
zala et al. [5] only evaluate SOC levels in agricultural soils of the
Czech Republic. Conversely, the higher SOC levels connotate to forested
areas where most SOC stocks are expected (Fig. 1) [34]. The mean SOC
level was 26.12 g/kg, which is slightly lower (43.93 g/kg) [3] and
higher (13.00 g/kg) [6] than the mean levels recorded in other
large-scale monitoring studies in Europe. ˇ
zala et al. [5] obtain 1.46%
N.M. Kebonye et al.
Smart Agricultural Technology 3 (2023) 100106
SOC at 030 cm in Czech agricultural soils. The various CART SOC
model performance results while applying different splitting strategies
and feature selection methods are documented in Tables 1 and 2.
Generally, different data splitting strategies, feature selection methods
and hyperparameter combinations resulted in varying outcomes. There
was some consistency in validation RMSE results obtained through
Random and SPlit strategies relative to the cLHS despite the feature
selection method applied. However, we observe some of the lowest
RMSE estimates while applying the cLHS strategy, an effect attributed to
its robust sampling approach [16]. In both Tables 1 and 2, SWR, RFE and
GA yield more parsimonious models that showed consistently better
overall results following the respective data splitting strategies. This
study validates the computational efciency of the abovementioned
feature selection methods. Also, applying fewer and optimally selected
estimators can improve CART model performance aside from maybe
using all covariates. Certainly, feature selection is necessary for a ma-
chine learning pipeline and ultimately for deriving reliable digital soil
3.2. Soil organic carbon spatial characteristics and patterns
The SOC prediction maps for the Czech Republic (020 cm and 30
m ×30 m resolution) based on promising CART models are shown in
Fig. 3. Generally, the spatial distribution patterns of SOC corroborate
those already observed by ˇ
zala et al. [5] and Borůvka et al. [34]. High
SOC predictions are noticed in consistently forested areas of the Czech
Republic [34]. Moreover, these areas have the highest altitude (Fig. 1B)
and already Borůvka et al. [34] demonstrate the importance of elevation
as a key covariate for SOC prediction. Some studies also show consistent
effects of elevation on SOC spatial distribution at a large scale in
temperate zones [35,36]. There is consistently much lower than 40 g/kg
SOC across the Czech Republic associated with other soil types (i.e.
agricultural soils). Further assessing the SOC maps, particularly the
SWR.Random maps (Fig. 3), we observe some unrealistic orthogonal
artefacts [37] introduced by the CART models. In future, a solution to
this may be to apply oblique geographic coordinates proposed by Møller
et al. [37].
3.3. Future outlooks
Although the current study had demonstrated promising results,
there is yet a need to explore other perspectives regarding the problem.
For instance, directions towards evaluating the associated effects of (1)
sample size and spatial extent variation, (2) comparing other data
splitting strategies not current used (3) possible methodologies and
designs used for soil sampling as well as (4) varying the effects of both
global and local inuencing factors. For instance, the majority of
covariates applied in this work were mostly of global than local inu-
ence. Another option may be to (5) include other covariates such as soil
information from SoilGrids 2 (e.g., clay, silt, sand, soil type, etc.) and
climatic variables [38,39]. Digital soil mapping products are insufcient
without quantied uncertainty estimates; therefore these ought to be
Fig. 2. Schematic showing the covariate combinations concluded by each feature selection method.
N.M. Kebonye et al.
Smart Agricultural Technology 3 (2023) 100106
Table 1
Model validation results for the different splitting strategies versus feature selection method while using cp as a hyperparameter.
Selection criteria Model assessment metricscp as the only tuning parameter
Random sampling strategy SPlit sampling strategy cLHS sampling strategy
Data R
CCC MSE RMSE bias Data R
CCC MSE RMSE bias Data R
ALL Cal (n =355) 0.50 0.66 314.19 17.73 0.00 Cal (n =352) 0.50 0.66 257.52 16.05 0.00 Cal (n =352) 0.44 0.61 297.67 17.25 0.00
Val (n =85) 0.11 0.31 417.74 20.44 4.04 Val (n =88) 0.17 0.38 615.68 24.81 0.19 Val (n =88) 0.10 0.27 603.19 24.56 2.27
B Cal (n =355) 0.49 0.65 320.76 17.91 0.00 Cal (n =352) 0.51 0.68 256.86 16.03 0.00 Cal (n =352) 0.40 0.57 303.89 17.43 0.00
Val (n =85) 0.11 0.31 405.22 20.13 3.71 Val (n =88) 0.10 0.30 644.59 25.39 0.10 Val (n =88) 0.24 0.39 538.12 23.20 1.56
LR Cal (n =355) 0.49 0.65 318.57 17.85 0.00 Cal (n =352) 0.38 0.55 330.41 18.18 0.00 Cal (n =352) 0.42 0.59 364.29 19.09 0.00
Val (n =85) 0.11 0.31 407.69 20.19 3.25 Val (n =88) 0.28 0.48 429.25 20.72 1.81 Val (n =88) 0.35 0.56 181.56 13.47 4.18
RF Cal(n =355) 0.50 0.66 314.19 17.73 0.00 Cal (n =352) 0.49 0.66 258.84 16.09 0.00 Cal (n =352) 0.42 0.59 371.86 19.28 0.00
Val (n =85) 0.11 0.31 417.74 20.44 4.04 Val (n =88) 0.15 0.36 662.05 25.73 0.48 Val (n =88) 0.42 0.58 146.33 12.10 5.26
RFE Cal (n =355) 0.49 0.65 318.69 17.85 0.00 Cal (n =352) 0.40 0.57 317.48 17.82 0.00 Cal (n =352) 0.43 0.60 303.03 17.41 0.00
Val (n =85) 0.11 0.31 406.11 20.15 3.17 Val (n =88) 0.33 0.50 410.24 20.25 0.99 Val (n =88) 0.32 0.49 404.25 20.11 2.78
SWR Cal (n =355) 0.48 0.65 321.05 17.92 0.00 Cal (n =352) 0.48 0.65 271.91 16.49 0.00 Cal (n =352) 0.43 0.60 292.39 17.10 0.00
Val (n =85) 0.11 0.31 403.41 20.09 3.67 Val (n =88) 0.16 0.38 587.56 24.24 0.78 Val (n =88) 0.30 0.44 470.92 21.70 0.10
RBAs Cal (n =355) 0.41 0.58 367.90 19.18 0.00 Cal (n =352) 0.37 0.54 316.25 17.78 0.00 Cal (n =352) 0.47 0.64 337.88 18.38 0.00
Val (n =85) 0.03 0.16 484.71 22.02 4.89 Val (n =88) 0.13 0.26 628.46 25.07 2.44 Val (n =88) 0.12 0.32 157.42 12.55 3.40
GA Cal (n =355) 0.50 0.67 310.93 17.63 0.00 Cal (n =352) 0.47 0.64 269.91 16.43 0.00 Cal (n =352) 0.43 0.60 296.21 17.21 0.00
Val (n =85) 0.11 0.31 423.66 20.58 3.36 Val (n =88) 0.13 0.33 640.66 25.31 0.47 Val (n =88) 0.43 0.48 396.97 19.92 0.26
Note: By selection criteria, we referred to the feature selection method. ALL =All covariates, B =Boruta selected covariates, LR =LASSO Regression selected covariates, RF =Random Forest selected covariates,
RFE =Recursive Feature Elimination selected covariates, SWR =Stepwise Regression selected covariates, RBAs =Relief Based Feature Selection covariates and GA =Genetic Algorithm covariates. Cal =Calibration
dataset and Val =Validation dataset.
Table 2
Model validation results for the different splitting strategies versus feature selection method while using cp and other hyperparameters.
Selection criteria Model assessment metricscp, minsplit (10), minbucket (5) and maxdepth (15) as tuning parameters
Random sampling strategy SPlit sampling strategy cLHS sampling strategy
Data R
CCC MSE RMSE bias Data R
CCC MSE RMSE bias Data R
ALL Cal (n =355) 0.55 0.71 282.19 16.80 0.00 Cal (n =352) 0.53 0.69 238.29 15.44 0.00 Cal (n =352) 0.56 0.72 229.74 15.16 0.00
Val (n =85) 0.11 0.32 383.26 19.58 2.73 Val (n =88) 0.17 0.37 609.76 24.69 0.46 Val (n =88) 0.03 0.17 725.05 26.93 2.25
B Cal (n =355) 0.56 0.72 273.64 16.54 0.00 Cal (n =352) 0.58 0.74 219.32 14.81 0.00 Cal (n =352) 0.56 0.72 223.36 14.95 0.00
Val (n =85) 0.10 0.30 380.96 19.52 2.14 Val (n =88) 0.12 0.32 645.24 25.40 0.14 Val (n =88) 0.15 0.31 611.67 24.73 0.68
LR Cal (n =355) 0.56 0.72 273.64 16.54 0.00 Cal (n =352) 0.57 0.73 228.23 15.11 0.00 Cal (n =352) 0.53 0.69 297.13 17.24 0.00
Val (n =85) 0.10 0.30 380.96 19.52 2.14 Val (n =88) 0.48 0.67 326.43 18.07 2.82 Val (n =88) 0.47 0.65 130.14 11.41 3.70
RF Cal(n =355) 0.55 0.71 282.19 16.80 0.00 Cal (n =352) 0.53 0.69 238.15 15.43 0.00 Cal (n =352) 0.56 0.72 279.45 16.72 0.00
Val (n =85) 0.11 0.32 383.26 19.58 2.73 Val (n =88) 0.11 0.31 727.13 26.97 0.47 Val (n =88) 0.38 0.49 277.89 16.67 6.63
RFE Cal (n =355) 0.55 0.71 279.36 16.71 0.00 Cal (n =352) 0.60 0.75 213.33 14.61 0.00 Cal (n =352) 0.58 0.73 225.48 15.02 0.00
Val (n =85) 0.09 0.29 381.07 19.52 2.03 Val (n =88) 0.52 0.65 299.24 17.30 1.59 Val (n =88) 0.45 0.64 333.63 18.27 3.06
SWR Cal (n =355) 0.56 0.72 273.64 16.54 0.00 Cal (n =352) 0.52 0.69 249.75 15.80 0.00 Cal (n =352) 0.46 0.63 276.39 16.63 0.00
Val (n =85) 0.10 0.30 380.96 19.52 2.14 Val (n =88) 0.16 0.37 578.12 24.04 0.19 Val (n =88) 0.33 0.44 457.90 21.40 0.15
RBAs Cal (n =355) 0.57 0.72 268.66 16.39 0.00 Cal (n =352) 0.43 0.59 289.91 17.03 0.00 Cal (n =352) 0.51 0.68 313.49 17.71 0.00
Val (n =85) 0.00 0.04 573.37 23.95 4.38 Val (n =88) 0.07 0.20 702.23 26.50 1.56 Val (n =88) 0.04 0.20 187.10 13.68 2.80
GA Cal (n =355) 0.56 0.71 276.47 16.63 0.00 Cal (n =352) 0.54 0.70 237.40 15.41 0.00 Cal (n =352) 0.55 0.71 233.37 15.28 0.00
Val (n =85) 0.12 0.33 383.15 19.57 2.84 Val (n =88) 0.12 0.32 677.61 26.03 0.62 Val (n =88) 0.36 0.39 445.98 21.12 0.17
Note: By selection criteria, we referred to the feature selection method. ALL =All covariates, B =Boruta selected covariates, LR =LASSO Regression selected covariates, RF =Random Forest selected covariates,
RFE =Recursive Feature Elimination selected covariates, SWR =Stepwise regression selected covariates, RBAs =Relief Based Feature Selection covariates and GA =Genetic Algorithm covariates. Cal =Calibration
dataset and Val =Validation dataset.
N.M. Kebonye et al.
Smart Agricultural Technology 3 (2023) 100106
included in future studies as well. In general, since much of machine
learning is data-driven, it should be emphasized that soil expert
knowledge remains crucial in the DSM exercise particularly when it
comes to understanding soil functions and processes.
4. Conclusions
This study demonstrates how a decision tree (CART) performance
can be improved by simply varying data splitting strategies and feature
selection methods for DSM. A total of 15 global covariates (30 m ×30 m)
are used to predict surface SOC across the Czech Republic via CART
models following different conditions. These conditions include varying
data splitting strategies and feature selection methods. The cLHS data
splitting strategy shows some of the lowest RMSE estimates in the study
while also having the overall best performance while using cp as a
hyperparameter. Also, RMSE estimates seemed to decrease after inte-
grating tree pruning hyperparameters (minsplit, minbucket and max-
depth). Conversely, of the feature selection methods applied, we
observed better model performances while applying SWR, RFE and GA.
Soil organic carbon levels were more in high-altitude forested areas and
altogether the spatial distribution patterns corroborate with previous
studies. We can benet from interpretability while improving model
(CART) performance based on simple conditions such as data splitting
and feature selection.
Data and code availability
Data used in this work can be requested free from the European
Commission Joint Research Centre. The codes used in this study can also
be freely requested from the corresponding author.
CRediT authorship contribution statement
Ndiye M. Kebonye: Conceptualization, Methodology, Visualization,
Formal analysis, Investigation, Software, Data curation, Writing orig-
inal draft, Writing review & editing. Prince C. Agyeman: Methodol-
ogy, Conceptualization, Data curation, Writing review & editing.
James K.M. Biney: Methodology, Conceptualization, Data curation,
Writing review & editing.
Fig. 3. Soil organic carbon spatial distribution maps at 30 m ×30 m based on the best decision tree (CART) models derived using the different tuning parameters.
[Note: The SWR.Random is the Stepwise Regression based on a Random Split, RFE.SPlit is the Recursive Feature Elimination based on SPlit and GA.cLHS is the
Genetic Algorithm based on a Conditional Latin Hypercube Sampling. First column maps are those generated using cp alone and the second column are those
generated using cp coupled with other tuning parameters, minsplit (10), minbucket (5) and maxdepth (15)].
N.M. Kebonye et al.
Smart Agricultural Technology 3 (2023) 100106
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
The rst author acknowledges the funding provided through the
Central Ofce of the Cluster of Excellence Machine Learning: New
Perspectives for Science, University of Tübingen. Moreover, the authors
greatly appreciate the European Commission, Joint Research Centre for
allowing us to use the LUCAS 2015 topsoil dataset for the current study.
Supplementary materials
Supplementary material associated with this article can be found, in
the online version, at doi:10.1016/j.atech.2022.100106.
[1] K. Adhikari, A.E. Hartemink, B. Minasny, R.B. Kheir, M.B. Greve, M.H. Greve,
Digital mapping of soil organic carbon contents and stocks in Denmark, PLoS One 9
(2014), e105519,
[2] S. Chen, Z. Liang, R. Webster, G. Zhang, Y. Zhou, H. Teng, B. Hu, D. Arrouays,
Z. Shi, A high-resolution map of soil pH in China made by hybrid modelling of
sparse soil data and environmental covariates and its implications for pollution,
Sci. Total Environ. 655 (2019) 273283,
[3] T. Zhou, Y. Geng, C. Ji, X. Xu, H. Wang, J. Pan, J. Bumberger, D. Haase, A. Lausch,
Prediction of soil organic carbon and the C:N ratio on a national scale using
machine learning and satellite data: a comparison between Sentinel-2, Sentinel-3
and Landsat-8 images, Sci. Total Environ. 755 (2021), 142661,
[4] T. Loiseau, D. Arrouays, A.C. Richer-de-Forges, P. Lagacherie, C. Ducommun,
B. Minasny, Density of soil observations in digital soil mapping: a study in the
Mayenne region, France, Geoderma Reg. 24 (2021) e00358,
[5] D. ˇ
zala, R. Minaˇ
rík, J. Sk´
ala, H. Beitlerov´
a, A. Juˇ
a, J.Reyes Rojas,
V. Peníˇ
zek, T. Z´
a, High-resolution agriculture soil property maps from
digital soil mapping methods, Czech Republic, Catena 212 (2022), 106024,
[6] G. Szatm´
ari, L. P´
asztor, G.B.M. Heuvelink, Estimating soil organic carbon stock
change at multiple scales using machine learning and multivariate geostatistics,
Geoderma 403 (2021), 115356,
[7] J.N. Morgan, J.A. Sonquist, Problems in the analysis of survey data, and a proposal,
J. Am. Stat. Assoc. 58 (1963) 415434,
[8] R. Taghizadeh-Mehrjardi, F. Sarmadian, B. Minasny, J. Triantalis, M. Omid,
Digital mapping of soil classes using decision tree and auxiliary data in the Ardakan
Region, Iran, Arid Land Res. Manag. 28 (2014) 147168,
[9] B.L. Henderson, E.N. Bui, C.J. Moran, D.A.P. Simon, Australia-wide predictions of
soil properties using decision trees, Geoderma 124 (2005) 383398, https://doi.
[10] Y. Sun, X. Li, H. Shi, J. Cui, W. Wang, H. Ma, N. Chen, Modeling salinized
wasteland using remote sensing with the integration of decision tree and multiple
validation approaches in Hetao irrigation district of China, Catena 209 (2022),
[11] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classication and Regression
Trees, 1st ed., Routledge, 2017
[12] A.M.J.-C. Wadoux, C. Molnar, Beyond prediction: methods for interpreting
complex models of soil variation, Geoderma 422 (2022), 115953,
[13] X. Wu, V. Kumar, J.Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan,
A. Ng, B. Liu, P.S. Yu, Z.-.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg, Top 10
algorithms in data mining, Knowl. Inf. Syst. 14 (2008) 137,
[14] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (2009)
[15] N.M. Kebonye, Exploring the novel support points-based split method on a soil
dataset, Measurement 186 (2021), 110131,
[16] B. Minasny, A.B. McBratney, A conditioned Latin hypercube method for sampling
in the presence of ancillary information, Comput. Geosci. 32 (2006) 13781388,
[17] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, Z.
Mayer, B. Kenkel, R. Core Team, M. Benesty, R. Lescarbeau, A. Ziem, L. Scrucca, Y.
Tang, C. Candan, T. Hunt, Package ‘caret,2022.
packages/caret/caret.pdf (accessed July 24, 2022).
[18] R.J. Hijmans, Introduction to the raster package (version 2.5-2), 2015. https
[19] A. Vakayil, R. Joseph, S. Mak, Package ‘SPlit,2022. https://cran.revolutionana (accessed July 24, 2022).
[20] P. Roudier, C. Brugnard, D. Beaudette, B. Louis, K. Daust, D. Clifford, Package
‘clhs,2021. (accessed July
24, 2022).
[21] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, J. Stat.
Softw. 36 (2010),
[22] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. 58
(1996) 267288,
[23] L. Breiman, Random Forests, Mach. Learn. 45 (2001) 532,
[24] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classication
using Support Vector Machines, Mach. Learn. 46 (2002) 389422,
[25] L. Wilkinson, Tests of signicance in Stepwise Regression, Psychol. Bull. 86 (1979)
[26] R.J. Urbanowicz, M. Meeker, W.La Cava, R.S. Olson, J.H. Moore, Relief-based
feature selection: introduction and review, J. Biomed. Inform. 85 (2018) 189203,
[27] J.H. Holland, Genetic Algorithms, Sci. Am. 267 (1992) 6673.
[28] M.B. Kursa, W.R. Rudnicki, Package ‘Boruta, 2020.
b/packages/Boruta/Boruta.pdf (accessed July 24, 2022).
[29] L. Scrucca, Package ‘GA,2021.
pdf (accessed July 24, 2022).
[30] F. Leisch, Dimitriadou, Package ‘mlbench, 2021.
packages/mlbench/mlbench.pdf (accessed July 24, 2022).
[31] P. Romanski, L. Kotthoff, P. Schratz, Package ‘FSelector, 2021. https://cran.r-pro (accessed July 24, 2022).
[32] B. Ripley, B. Venables, D.M. Bates, K. Hornik, A. Gebhardt, D. Firth, Package
(accessed July 24, 2022).
[33] T. Lumley, Package ‘leaps, 2020.
leaps.pdf (accessed July 24, 2022).
[34] L. Borůvka, R. Vaˇ
at, V. ˇ
amek, K. Neudertov´
a Hellebrandov´
a, V. Fadrhonsov´
M. S´
nka, L. Pavlů, O. S´
nka, O. Vacek, K. Nˇ
cek, S. Nozari, V.Y. Oppong
Sarkodie, Predictors for digital mapping of forest soil organic carbon stocks in
different types of landscape, Soil Water Res. 17 (2022) 6979,
[35] Z. Liang, S. Chen, Y. Yang, Y. Zhou, Z. Shi, High-resolution three-dimensional
mapping of soil organic carbon in China: effects of SoilGrids products on national
modeling, Sci. Total Environ. 685 (2019) 480489,
[36] S. Chen, V.L. Mulder, G.B.M. Heuvelink, L. Poggio, M. Caubet, M. Rom´
an Dobarco,
C. Walter, D. Arrouays, Model averaging for mapping topsoil organic carbon in
France, Geoderma 366 (2020), 114237,
[37] A.B. Møller, A.M. Beucher, N. Pouladi, M.H. Greve, Oblique geographic
coordinates as covariates for digital soil mapping, SOIL 6 (2020) 269289, https://
[38] S. Chen, D. Arrouays, V. Leatitia Mulder, L. Poggio, B. Minasny, P. Roudier,
Z. Libohova, P. Lagacherie, Z. Shi, J. Hannam, J. Meersmans, A.C. Richer-de-
Forges, C. Walter, Digital mapping of GlobalSoilMap soil properties at a broad
scale: a review, Geoderma 409 (2022), 115567,
[39] L. Poggio, L.M. de Sousa, N.H. Batjes, G.B.M. Heuvelink, B. Kempen, E. Ribeiro,
D. Rossiter, SoilGrids 2.0: producing soil information for the globe with quantied
spatial uncertainty, SOIL 7 (2021) 217240,
N.M. Kebonye et al.
Full-text available
The benefits of digital soil maps cannot be overemphasised. For many years, researchers have mapped different soil classes, properties and processes while identifying the spatial associations between soil properties using side-by-side visualization maps. Although this is acceptable, it may be difficult to identify complex spatial associations between the mapped soil properties. For some, the task may be challenging owing to multiple times of side-by-side placing of the maps and the possible application of none user-friendly colour palettes and or schemes. Innovative tools are proposed for visualizing and identifying spatial associations between digital soil maps (raster layers) using bivariate and trivariate maps. These tools are applied in a case study to identify the spatial interactions between pH and selected macro-nutrients [nitrogen (N) and potassium (K)] of similar locality (Czech Republic), resolution and scale. This study further gives a brief overview of the applicability of bivariate and trivariate maps following the digital soil mapping process. Results show that bivariate and trivariate maps are effective for visualizing complex associations between pH and macro-nutrients. However, precautionary measures should be taken while applying bivariate and trivariate maps to ensure they are self-explanatory and that the legend colour schemes applied are user-friendly. Also, the variables mapped should be related. In this case, pH is a key soil quality indicator that affects macro-nutrient availability in soils. Graphical abstract
Full-text available
Although valuable for discovering geographical relationships and spatial statistics, bivariate maps or colour schemes are rarely employed in soil-related studies. For high-resolution mapping of the spatial relationships and patterns between the carbon-to-nitrogen (CN) ratio and its uncertainty throughout the Czech Republic, we assessed the application of a bivariate colour scheme. A random forest (RF) model was used to forecast CN ratio levels derived from the LUCAS topsoil dataset (n = 440 topsoil samples) using a stack of 22 environmental covariates. Of these covariates, Landsat 8 predictors (i.e. b6—SWIR 1 and b2—BLUE) had the highest relative value in the RF model. Additionally, partial dependence plots (PDPs) revealed that the aforementioned predictors had a comparable marginal impact on the CN model prediction. The 30 m × 30 m pixels CN ratio and uncertainty maps (at 0–20 cm) were able to distinguish the level contents evenly across the entire country while displaying distinct spatial features for each map. The uncertainty map and CN ratio prediction were both utilized to logically construct a bivariate colour scheme at 60 m × 60 m pixels, which enabled a once-off visualization of the two maps. The approach was deemed promising and proved generalizable for large-scale geographical evaluations in the focus area. Based on the output of the bivariate map visual, the spatial relationships and patterns between the CN ratio prediction and its uncertainty could be studied.
Full-text available
Understanding the spatial variation of soil properties is central to many sub-disciplines of soil science. Commonly in soil mapping studies, a soil map is constructed through prediction by a statistical or non-statistical model calibrated with measured values of the soil property and environmental covariates of which maps are available. In recent years, the field has gradually shifted attention towards more complex statistical and algorithmic tools from the field of machine learning. These models are particularly useful for their predictive capabilities and are often more accurate than classical models, but they lack interpretability and their functioning cannot be readily visualized. There is a need to understand how these models can be used for purposes other than making accurate prediction and whether it is possible to extract information on the relationships among variables found by the models. In this paper we describe and evaluate a set of methods for the interpretation of complex models of soil variation. An overview is presented of how model-independent methods can serve the purpose of interpreting and visualizing different aspects of the model. We illustrate the methods with the interpretation of two mapping models in a case study mapping topsoil organic carbon in France. We reveal the importance of each driver of soil variation, their interaction, as well as the functional form of the association between environmental covariate and the soil property. Interpretation is also conducted locally for an area and two spatial locations with distinct land use and climate. We show that in all cases important insights can be obtained, both into the overall model functioning and into the decision made by the model for a prediction at a location. This underpins the importance of going beyond accurate prediction in soil mapping studies. Interpretation of mapping models reveal how the predictions are made and can help us formulating hypotheses on the underlying soil processes and mechanisms driving soil variation.
Full-text available
Soils are essential for supporting food production and providing ecosystem services but are under pressure due to population growth, higher food demand, and land use competition. Because of the effort to ensure the sustainable use of soil resources, demand for current, updatable soil information capable of supporting decisions across scales is increasing. Digital soil mapping (DSM) addresses the drawbacks of conventional soil mapping and has been increasingly used for delivering soil information in a time-and cost-efficient manner with higher spatial resolution, better map accuracy, and quantified uncertainty estimates. We reviewed 244 articles published between January 2003 and July 2021 and then summarised the progress in broad-scale (spatial extent >10,000 km 2) DSM, focusing on the 12 mandatory soil properties for GlobalSoilMap. We observed that DSM publications continued to increase exponentially; however, the majority (74.6%) focused on applications rather than methodology development. China, France, Australia, and the United States were the most active countries, and Africa and South America lacked country-based DSM products. Approximately 78% of articles focused on mapping soil organic matter/carbon content and soil organic carbon stocks because of their significant role in food security and climate regulation. Half the articles focused on soil information in topsoil only (<30 cm), and studies on deep soil (100-200 cm) were less represented (21.7%). Relief, organisms, and climate were the three most frequently used environmental covariates in DSM. Nonlinear models (i.e. machine learning) have been increasingly used in DSM for their capacity to manage complex interactions between soil information and environmental covariates. Soil pH was the best predicted soil property (average R 2 of 0.60, 0.63, and 0.56 at 0-30, 30-100, and 100-200 cm). Other relatively well-predicted soil properties were clay, silt, sand, soil organic carbon (SOC), soil organic matter (SOM), SOC stocks, and bulk density, and coarse fragments and soil depth were poorly predicted (R 2 < 0.28). In addition, decreasing model performance with deeper depth intervals was found for most soil properties. Further research should pursue rescuing legacy data, sampling new data guided by well-designed sampling schemas, collecting representative environmental covariates, improving the performance and interpretability of advanced spatial predictive models, relating performance indicators such as accuracy and precision to cost-benefit and risk assessment analysis for improving decision support; moving from static DSM to dynamic DSM; and providing high-quality, fine-resolution digital soil maps to address global challenges related to soil resources.
Full-text available
Many national and international initiatives rely on spatially explicit information on soil organic carbon (SOC) stock change at multiple scales to support policies aiming at land degradation neutrality and climate change mitigation. In this study, we used regression cokriging with random forest and spatial stochastic cosimulation to predict the SOC stock change between two years (i.e. 1992 and 2010) in Hungary at multiple aggregation levels (i.e. point support, 1 × 1 km, 10 × 10 km square blocks, Hungarian counties and entire Hungary). We also quantified the uncertainty associated with these predictions in order to identify and delimit areas with statistically significant SOC stock change. Our study highlighted that prediction of spatial totals and averages with quantified uncertainty requires a geostatistical approach and cannot be solved by machine learning alone, because it does not account for spatial correlation in prediction errors. The total topsoil SOC stock for Hungary was predicted to increase between 1992 and 2010 with 14.9 Tg, with lower and upper limits of a 90% prediction interval equal to 11.2 Tg and 18.2 Tg, respectively. Results also showed that both the predictions and uncertainties of the average SOC stock change were smaller for larger spatial supports, while spatial aggregation also made it easier to obtain statistically significant SOC stock changes. The latter is important for carbon accounting studies that need to prove in Measurement, Reporting and Verification protocols that observed SOC stock changes are not only practically but also statistically significant.
Full-text available
SoilGrids produces maps of soil properties for the entire globe at medium spatial resolution (250 m cell size) using state-of-the-art machine learning methods to generate the necessary models. It takes as inputs soil observations from about 240 000 locations worldwide and over 400 global environmental covariates describing vegetation, terrain morphology, climate, geology and hydrology. The aim of this work was the production of global maps of soil properties, with cross-validation, hyper-parameter selection and quantification of spatially explicit uncertainty, as implemented in the SoilGrids version 2.0 product incorporating state-of-the-art practices and adapting them for global digital soil mapping with legacy data. The paper presents the evaluation of the global predictions produced for soil organic carbon content, total nitrogen, coarse fragments, pH (water), cation exchange capacity, bulk density and texture fractions at six standard depths (up to 200 cm). The quantitative evaluation showed metrics in line with previous global, continental and large-region studies. The qualitative evaluation showed that coarse-scale patterns are well reproduced. The spatial uncertainty at global scale highlighted the need for more soil observations, especially in high-latitude regions.
Forest soils have a high potential to store carbon and thus mitigate climate change. The information on spatial distribution of soil organic carbon (SOC) stocks is thus very important. This study aims to analyse the importance of environmental predictors for forest SOC stock prediction at the regional and national scale in the Czech Republic. A big database of forest soil data for more than 7 000 sites was compiled from several surveys. SOC stocks were calculated from SOC content and bulk density for the topsoil mineral layer 0–30 cm. Spatial prediction models were developed separately for individual natural forest areas and for four subsets with different altitude range, using random forest method. The importance of environmental predictors in the models strongly differs between regions and altitudes. At lower altitudes, forest edaphic series and soil classes are strong predictors, while at higher altitudes the predictors related to topography become more important. The importance of soil classes depends on the pedodiversity level and on the difference in SOC stock between the classes. The contribution of forest types as predictors is limited when one (mostly coniferous) type dominates. Better prediction results can be obtained in smaller, but consistent regions, like some natural forest areas.
Detailed maps of soil properties are essential for soil protection planning and management; however, creating very high-resolution maps at the national level with sufficient accuracy is a challenging task and remains unavailable for most countries. For the Czech Republic, very high resolution (20 m·pixel⁻¹) soil property maps (soil organic carbon—SOC, texture, pH, bulk density, soil depth) were created using digital soil mapping methods, combined with a wide database of soil legacy and current samples. The latest approaches were employed for predictive mapping: a quantile random forest model with the determination of prediction intervals, a mosaic of bare soils from Sentinel-2 satellite data, a Gaussian pyramid of terrain attributes, and a buffer distance map. These variables were found to be among the most important in the resulting models. The properties were mapped with an RMSE accuracy of 0.43% SOC, 5.56–11.14% for texture fractions, 0.70 pH, 0.13 g·cm⁻³ bulk density, and 20.03 cm for soil depth, thus providing detailed data on soil cover. Greater levels of inaccuracy were found in areas with extreme values, for which further investigation is necessary either through more detailed sampling based on active learning, or adapted methods for enhanced predictive ability.
The salinized wasteland (SW) is a buffer zone to accumulate salt introduced from irrigation water in salinized irrigation districts. It is crucial to identify the locations and sizes of SW to achieve effective land management and reclamation, as well as agricultural sustainable development. However, the SW is less explored due to the lack of low-cost and effective techniques. In this study, a decision tree-based remote sensing model was proposed to distinguished SW from other land types using the high-resolution Gaofen-1 (GF-1) images in the Hetao Irrigation District of China during 2017–2020. Meanwhile, multiple validation approaches including points (field and random points) and area (small and medium-sized SWs, i.e., SWA, SWB) matching validations were adopted. To establish the model of SW, four spectral indices including Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Normalized Difference Water Index (NDWI), and Salinity Index (SI), as well as a series of multi-period and multi-year remote sensing images were used. The threshold values of the four spectral indices were determined based on the probability density curve. The SW remote sensing model was highly accurate and reliable. The average producer accuracy (PA) and average user accuracy (UA) were respectively 88.0% and 81.6% for field investigation (field points matching validation), and 85.9% and 80.1% for Google Earth image investigation (random points matching validation). Besides, the average overlapping rate and misclassification rate were respectively 95.6% and 2.3% for small-sized typical SWs (area validation), and 92.9% and 4.7% for medium-sized typical SWs during 2017–2020. The model indicated that the SW area in the region was respectively 9969, 9694, 9660, and 9608 ha from 2017 to 2020. The area of small and medium-sized SWs decreased with time, while that of large-sized SWs was the opposite.
Data splitting is an integral step in machine learning that ensures good model generalization. The novel support points-based split method has been evaluated on several datasets (e.g. Iris dataset, etc.) and has shown to be promising than conventional methods (e.g. the random data split). However, this method has never been applied in soil-based research. Therefore, the current study compared soil organic carbon (SOC) RMSE prediction results generated through the conventional random split and the novel support points-based split methods. While applying the above-mentioned methods, data were partitioned into train and test sets based on four percentage ratios of 60/40, 70/30, 75/25 and 80/20. Generally, test RMSE results based on the two split methods as well as percentage ratios were comparable. Nonetheless, the novel method is more reliable and robust since it applies iterations to perform the splitting process while utilizing control points to establish an optimal data partition.
The density of soil observations is a major determinant of digital soil mapping (DSM) prediction accuracy. In this study, we investigated the effect of soil sampling density on the performance of DSM to predict topsoil particle-size distribution in the Mayenne region of France. We tested two prediction algorithms, namely ordinary kriging (OK) and quantile random forest (QRF). The study area is a region of ~5000 km² with the highest density of field soil observations in France (1 profile per 0.64 km²). The number of training sites was progressively reduced (from n = 7500 to n = 400, corresponding to 1 profile per 0.7 km² to 1 profile per 13 km²) to simulate the different density of observations. For OK and QRF, we tested random subsampling for splitting the data into training and testing datasets using k-fold cross validation. For QRF we also tested conditioned Latin hypercube sampling based on the point coordinates or the covariates. The results indicated that, with increasing density of observations, OK performed as well or even better than QRF, depending on the particle-size fraction. For silt prediction, OK was systematically better than QRF. However, the prediction intervals were much larger for OK than for QRF, and OK did not seem to estimate uncertainty correctly. Overall, the performance indicators increased with the density of observations with a threshold at about 1 profile per 2 km² which suggests that the main limitation of DSM prediction accuracy using QRF is the amount of data collected in the field, not the type of calibration sampling strategy. Future DSM activities should focus on gathering more field observations.