ChapterPDF Available

Abstract and Figures

Decision tree learning is among the most popular machine learning techniques used for ecological modelling. Decision trees can be used to predict the value of one or several target (dependent) variables. They are hierarchical structures, where each internal node contains a test on an attribute, each branch corresponding to an outcome of the test, and each leaf node giving a prediction for the value of the class variable. Depending on whether we are dealing with a classification (discrete target) or a regression problem (continuous target), the decision tree is called a classification or a regression tree, respectively. The common way to induce decision trees is the so-called Top-Down Induction of Decision Tress (TDIDT). In this chapter, we introduce different types of decision trees, present basic algorithms to learn them, and give an overview of their applications in ecological modelling. The applications include modelling population dynamics and habitat suitability for different organisms (e.g. soil fauna, red deer, brown bears, bark beetles) in different ecosystems (e.g. aquatic, arable and forest ecosystems) exposed to different environmental pressures (e.g. agriculture, forestry, pollution, global warming).
Content may be subject to copyright.
Life Sciences - Ecology | Modelling Complex Ecological Dynamics
Modelling Complex Ecological Dynamics
An Introduction into Ecological Modelling for Students, Teachers & Scientists
Illustrations by Melanie Trexler
Jopp, Fred; Reuter, Hauke; Breckling, Broder (Eds.)
1st Edition., 2011, 400 p. 131 illus., Softcover
ISBN: 978-3-642-05028-2
Due: March 2011
27,99 €
Offers a comprehensive overview of methods, approaches and applications of modelling in ecology
Includes cases from different parts of the world
Leading specialists explore different biomes and explore their interaction of different types of organisms
Model development is of vital importance for understanding and management of ecological processes. Identifying the
complex relationships between ecological patterns and processes is a crucial task. Ecological modelling—both
qualitatively and quantitatively—plays a vital role in analysing ecological phenomena and for ecological theory. This
textbook provides a unique overview of modelling approaches. Representing the state-of-the-art in modern ecology, it
shows how to construct and work with various different model types. It introduces the background of each approach
and its application in ecology. Differential equations, matrix approaches, individual-based models and many other
relevant modelling techniques are explained and demonstrated with their use. The authors provide links to software
tools and course materials. With chapters written by leading specialists, “Modelling Complex Ecological Dynamics” is
an essential contribution to expand the qualification of students, teachers and scientists alike.
Content Level » Upper undergraduate
Keywords » ABM - Agent-based Models - Cellular Automata - Cellular Automaton - Complex - Complexity -
Data Mining - Dynamical - Dynamics - Ecological - Ecology - Expert-Systems - Fuzzy Logic - Geo-Ecology -
Habitat Suitability Models - IBM - Individual-based Model - Invasion Models - Landscape Ecology - Landscape
Management - Learning how to model - Leslie-Matrices - Model Coupling - Model Validation - Modeling -
Modeling Concepts - Modeling in Ecology - Modelling - Modelling Concepts - Modelling in Ecology - Nature
Conservation - ODE - Ordinary-Differential Equations - PDE - Partial-Differential Equations - Sensitivity
Analysis - Space in Ecology - Spatially-Explicit - Steady-State models - Tree-Decisions Models - Variabilities -
Variability - Variables
Related subjects » Ecology
Introduction; Theoretical backgrounds: Scope and general definitions used in ecological modelling; What are the
general conditions, under which models can be applied?; History of ecological modelling, and modern developments.-
Modelling Techniques and Approaches; Context assessment and Systems analysis; Steady State Models of
Ecological Systems; Ordinary Differential Equations; Partial Differential Equations Cellular Automata; Leslie-Matrices;
Fuzzy Sets; L-Systems, and Fractals; Agent- and Individual-based Models; Modelling landscape dynamics and
Habitat Suitability; Decision trees and data mining.- Application fields, case studies and examples; Conceptual
About this textbook
e 1 of 2Modellin
Complex Ecolo
ical D
© Springer is part of Springer Science+Business Media
and theoretical models (Analysing Landscape Structure using Neutral Models); Case Studies: Application of
ecological models in management (Coral-Algae-Interaction; Stage-structured invasion models; Trophic Cascades
and Food Web Stability in Fish Communities of the Everglades); Case Studies: Model Coupling and Multi-scale
issues (Bio-Physical Models: An Evolving Tool in Marine Ecological Research; Potentials of GIS – Model Coupling;
Modelling the Everglades Ecosystem).- Strategies of model development; Model validation, Parameter estimation,
Sensitivity Analysis.
e 2 of 2Modellin
Complex Ecolo
ical D
Modelling Complex Ecological Dynamics—MCED
Jopp, Fred; Reuter, Hauke; Breckling, Broder (Eds.) 1st Edition., 2011, 400 p.
131 illus., Softcover ISBN: 978-3-642-05028-2; DOI: 10.1007/978-3-642-05029-9
Table of Contents
Part I Introduction
1 Backgrounds and Scope of Ecological Modelling – Between Intellectual Adventure
and Scientific Routine
Broder Breckling, Fred Jopp, and Hauke Reuter
2 What Are the General Conditions Under Which Ecological Models Can Be Applied?
Felix Müller, Broder Breckling, Fred Jopp, and Hauke Reuter
3 Historical Background of Ecological Modelling and its Importance for Modern
Broder Breckling, Fred Jopp, and Hauke Reuter
Part II Modelling Techniques and Approaches
4 System Analysis and Context Assessment
Broder Breckling, Fred Jopp, and Hauke Reuter
5 Steady State Models of Ecological Systems – EcoPath Approach to Mass-
Balanced System Descriptions
Matthias Wolff and Marc Taylor
6 Ordinary Differential Equations
Broder Breckling, Fred Jopp, and Hauke Reuter
7 Partial Differential Equations
Michael Sieber and Horst Malchow
8 Cellular Automata in Ecological Modelling
Broder Breckling, Guy Pe'er, and Yiannis G. Matsinos
9 Leslie Matrices
Dagmar Söndgerath
10 Modelling Ecological Processes with Fuzzy Logic Approaches
Agnese Marchini
11 Grammar-Based Models and Fractals
Winfried Kurth and Dirk Lanwert
12 Individual-Based Models
Hauke Reuter, Broder Breckling, and Fred Jopp
Modelling Complex Ecological Dynamics—MCED
Jopp, Fred; Reuter, Hauke; Breckling, Broder (Eds.) 1st Edition., 2011, 400 p.
131 illus., Softcover ISBN: 978-3-642-05028-2; DOI: 10.1007/978-3-642-05029-9
13 Modelling Species’ Distributions
Carsten F. Dormann
14 Decision Trees in Ecological Modelling
Marko Debeljak and Sašo Džeroski
Part III Application fields, case studies and examples
15 Neutral Models and the Analysis of Landscape Structure
Robert H. Gardner
16 Stage-Structured Integro-Differential Models: Application to Invasion Ecology
Aurélie Garnier and Jane Lecomte
17 Modelling Resilience and Phase Shifts in Coral Reefs – Application of Different
Modelling Approaches
Andreas Kubicek and Esther Borell
18 Trophic Cascades and Food Web Stability in Fish Communities of the Everglades
Fred Jopp, Donald L. DeAngelis, and Joel C. Trexler
19 Lake Glumsø – Case Study on Modelling a Small Danish Lake
Søren Nors Nielsen and Sven Erik Jørgensen
20 Biophysical Models: An Evolving Tool in Marine Ecological Research
Alejandro Gallego
21 Modelling the Everglades Ecosystem
Fred Jopp and Donald L. DeAngelis
22 Model Integration: Application in Ecology and for Management
Dietmar Kraft
Part IV Integrative Approaches in Ecological Modeling
23 How Valid Are Model Results? Assumptions, Validity Range and Documentation
Hauke Reuter, Fred Jopp, Broder Breckling, Christoph Lange, and Gerd Weigmann
24 Perspectives in Ecological Modelling
Fred Jopp, Broder Breckling, Hauke Reuter, and Donald L. DeAngelis
Subject Index
14 Decision Trees in Ecological Modelling
Marko Debeljak, Sašo Džeroski
Decision tree learning is among the most popular machine learning techniques
used for ecological modeling. Decision trees can be used to predict the value of
one or several target (dependent) variables. They are hierarchical structures, where
each internal node contains a test on an attribute, each branch corresponds to an
outcome of the test, and each leaf node gives a prediction for the value of the class
variable. Depending on whether we are dealing with a classification (discrete tar-
get) or a regression problem (continuous target), the decision tree is called a clas-
sification or a regression tree, respectively. The common way to induce decision
tree is the so-called Top-Down Induction of Decision Tress (TDIDT). In this
chapter, we introduce different types of decision trees, present basic algorithms to
learn them, and give an overview of their applications in ecological modeling. The
applications include modeling population dynamics and habitat suitability for dif-
ferent organisms (e.g., soil fauna, red deers, brown bears, bark beetles) in different
ecosystems (e.g., aquatic, arable and forest ecosystem) exposed to different en-
vironmental pressures (e.g., agriculture, forestry, pollution, global warming).
14.1 Introduction
Machine learning is one of the essential and most active research areas in the field
of artificial intelligence. In short, it studies computer programs that automatically
improve with experience (Mitchell, 1997). The most investigated type of machine
learning is inductive machine learning, where the experience is given in the form
of learning examples. Supervised inductive machine learning, sometimes also
called predictive modelling, assumes that each learning example includes some
target property, which should be predicted. The final goal is then to learn a pre-
dictive model (such as a decision tree or a set of rules) that accurately predicts this
Machine learning (and in particular predictive modelling) can be used to automate
the construction of certain ecological models, such as models of habitat suitability
and models of population dynamics from measured data. The most popular ma-
chine learning techniques used for ecological modelling include decision tree in-
duction (Breiman et al. 1984), rule induction (Clark and Boswell 1991), and neur-
al networks (Lek and Guegan 1999).
This chapter first introduces the task of predictive modeling. It then describes the
different types of decision trees (classification, regression and multi-target trees)
and presents techniques for learning them. Finally, it gives examples of the use of
decision trees in ecological modeling, including examples of both population dy-
namics and habitat suitability modeling.
14.2 The Machine Learning Task of Predictive Modeling
The input to a machine learning algorithm is most commonly a single flat table
comprising a number of fields (columns) and records (rows). In general, each row
represents an object and each column represents a property (of the object). In ma-
chine learning terminology, rows are called examples and columns are called at-
tributes (or sometimes features). Attributes that have numeric (real) values are
called continuous attributes. Attributes that have nominal values are called dis-
crete attributes.
The tasks of classification and regression are the two most commonly addressed
tasks in machine learning. They deal with predicting the value of one field from
the values of other fields. The target field is called the class (dependent variable in
statistical terminology). The other fields are called attributes (independent vari-
ables in statistical terminology).
If the class is continuous, the task at hand is called regression. If the class is dis -
crete (it has a finite set of nominal values), the task at hand is called classification.
In both cases, a set of data (dataset) is taken as input, and a predictive model is
generated. This model can then be used to predict values of the class for new data.
The common term predictive modeling refers to both classification and regression.
Given a set of data (a table), only a part of it is typically used to generate (induce,
learn) a predictive model. This part is referred to as the training set. The remaining
(hold-out) part is reserved for evaluating the quality of the learned model and is
called the testing set. The testing set is used to estimate the quality of the model
when applied to unseen data, i.e., the predictive performance of the model.
More reliable estimates of performance on new data (not seen in the process of
learning) are obtained by using cross-validation (Alpaydin 2010). Cross-validation
partitions the entire set of data into k (with k typically set to 10) subsets of roughly
equal size. Each of these subsets is in turn used as a testing set, with all of the re-
maining data used as a training set. The performance figures for each of the testing
sets are averaged to obtain an overall estimate of the performance on unseen data.
14.3 Decision Tree Induction
1.3.1 Types of decision trees
Decision trees (Breiman et al. 1984) are hierarchical structures, where each intern-
al node contains a test on an attribute, each branch corresponds to an outcome of
the test, and each leaf (terminal) node gives a prediction for the value of the class
variable. Depending on whether we are dealing with a classification or a regres-
sion problem, the decision tree is called a classification or a regression tree, re-
Classification trees predict the values of a discrete variable with a final set of
nominal values. An example classification tree modeling the habitat of oilseed
rape by plant abundance is given in Fig. 14.5. The tree has been derived from real-
world data by using decision tree induction (Debeljak et al. 2008).
Regression tree leaves contain constant values as predictions for the class value.
They thus represent piece-wise constant functions. Model trees, a type of regres-
sion trees where leaf nodes can contain linear models predicting the class value,
represent piece-wise linear functions. An example model tree that predicts the
abundance of anecic earthworms is given in Fig. 14.1 (Debeljak et al. 2007).
Multi-target trees (Blockell et al. 1998), sometimes also called multi-objective
trees (Struyf and Džeroski 2006) generalise decision trees to the prediction of sev-
eral target attributes simultaneously. The leaves of a multi target tree store a vec-
tor of class values, one for each target, instead of storing a single class value for
one target. Each component of this vector is a prediction for one of the target at-
Depending on whether the targets are all discrete-valued or real-valued, we can
talk about multi-target classification trees or multi-objective regression trees. An
example of a multi-objective regression tree, giving predictions for three real-val-
ued targets, is given in Fig. 14.2 (Demšar et al. 2006). The tree predicts three tar-
gets simultaneously: the abundance of mites and Collembola, as well as the biod-
iversity of these in soil.
14.3.2 Learning decision trees
Given a set of training examples, we want to find a decision tree that fits the data
well and is as small (and thus as understandable) as possible. Finding the smallest
decision tree that would fit a given data set is known to be computationally ex-
pensive. Heuristic search is thus employed to build decision trees, guided by
measures of impurity or dispersion of the target attribute. Greedy search, consider-
ing only one test/split at a time, is typically used.
The typical way to induce decision trees is the so-called Top-Down Induction of
Decision Trees (TDIDT, Quinlan 1986). Tree construction proceeds recursively
starting with the entire set of training examples (entire table). At each step, the al-
gorithm first checks if the stopping criterion is satisfied (e.g., all examples belong
to the same class): If not, an attribute (test) is selected as the root of the (sub-)tree,
the current training set is split into subsets according to the values of the selected
attribute, and the algorithm is called recursively on each of the subsets. The attrib-
ute/test is chosen so that the resulting subsets have as homogeneous values of the
class as possible.
Consider for example the tree in Fig. 14.1. At the root node, the algorithm con-
siders each of the independent variables (incl. silt, clay, pH and time since sow-
ing) and selects one variable (clay) / test (clay > 7.8) that splits the entire set of ex-
amples best, i.e., results in subsets with homogeneous values of the class (as com-
pared to other attributes/tests). The examples are then split into two subsets (those
with clay > 7.8 go down the right branch, the others to the left), and the algorithm
is started again twice, once for the left and once for the right subset. In each of the
two cases, only the examples in the respective branch are used to build the re-
spective subtree (examples going down the left/right branch are used to construct
the left/right subtree).
For discrete attributes, a branch of the tree is typically created for each possible
value of the attribute. For continuous attributes, a threshold is selected and two
branches are created based on that threshold. For the subsets of training examples
in each branch, the tree construction algorithm is called recursively. Tree con-
struction stops when the examples in a node are sufficiently pure (i.e., all are of
the same class) or if some other stopping criterion is satisfied (e.g., there is no
good attribute/test to add at that point). Such (terminal) nodes are called leaves
and are labeled with the corresponding values of the class.
Different measures can be used to select an attribute in the attribute selection step.
Common to all of them is that they measure the homogeneity (or the opposite, dis-
persion) of the values of the target and its increase (resp. decrease) after selecting
the attribute/test for the current node. They differ for classification and regression
trees (Breiman et al. 1984) and a number of choices exists for each case. For clas-
sification, Quinlan (1986) uses information gain, which is the expected reduction
in entropy (uncertainty) of the class value resulting from knowing the value of the
given attribute and the outcome of the test. Other attribute selection measures,
such as the Gini index, a measure of the statistical dispersion of the target variable
(Breiman et al. 1984), can and have been used in classification tree induction. In
regression tree induction, the expected reduction in the variance (also a measure of
statistical dispersion, but for continuous targets) of the class value can be used.
Multi-target trees are constructed with the same recursive partitioning algorithm
as single-target trees. The key difference is in the test selection procedure. For
classification, the heuristic impurity function used for selecting the attribute tests
(that define the internal nodes) is defined as
with N the num-
ber of examples in the node, T the number of target variables, and Var[yt] = En-
tropy[yt] the entropy of target variable yt in the node. For regression, the sum of
variance reductions along each of the targets is used to select tests.
Multi-target trees are an instantiation of the predictive clustering trees (PCTs)
framework (Blockeel et al. 1998). In this framework, a tree is viewed as a hier-
archy of clusters: a node corresponds to a cluster. PCTs have been used to handle
different types of targets: multiple target variables, both discrete and continuous
(Struyf and Džeroski 2006, Debeljak et al. 2009), time series (Džeroski et al.
2007) and hierarchies of classes, with multiple class-labels per example (Vens et
al. 2008).
An important mechanism used to improve decision tree performance is tree prun-
ing. Pruning reduces the size of a decision tree by removing sections of the tree
(subtrees) that are unreliable and do not contribute to the predictive performance
of the tree. When a subtree rooted in a certain node of the tree is pruned, it is re-
moved from the tree and the node replaced by a leaf. The dual goal of pruning is
to reduce the complexity of the final tree as well as to achieve better predictive ac-
curacy by the reduction of over-fitting and removal of sections of the tree that may
be based on noisy or erroneous data.
There are two major approaches to decision tree pruning. Pruning can be em-
ployed during tree construction (pre-pruning) or after the tree has been constructed
(post-pruning). Typically, a minimum number of examples in branches can be pre-
scribed for pre-pruning and a confidence level in accuracy estimates for leaves for
14.3.3 Systems for building decision trees
The CART (Classification And Regression Trees) system (Breiman et al. 1984) is
the first widely known and used system for learning decision trees. It has been sur-
passed in popularity only by the C4.5 system for learning classification trees
(Quinlan 1986), succeeded by C5.0 (RuleQuest 2009). Nowadays, probably the
most commonly used implementation of classification trees is J4.8, the Java reim-
plementation of C4.5 within the WEKA suite (Frank and Witten 2005).
Besides CART, the M5 system (Quinlan 1992) builds regression trees. As com-
pared to CART, the novelty in M5 is that it can also build model trees (with linear
models in the leaves). The commercial successor of M5 is Cubist (RuleQuest
2009), which transcribes the learned regression and model trees into rules (which
are further postprocessed/simplified). The publicly available reimplementation of
M5 is called M5’ and is part of the WEKA suite (Frank and Witten 2005).
The construction of multi-target trees is implemented in the software system
CLUS (Blockeel and Struyf 2002, Struyf and Džeroski 2006, Struyf et al. 2010).
CLUS can build trees predicting a single target or multiple targets. It can also con-
sider discrete and continuous targets, i.e., can build multi-target classification and
regression trees. The system MT-SMOTI (Appice and Dzeroski 2007) builds mul-
ti-target model trees, whose leaves can contain multiple linear equations for pre-
dicting the values of each target.
An overview of the different systems for building different types of decision trees
is given in Tab. 14.1.
Tab. 14.1: An overview of decision tree types and systems for learning them, with respect to the
number and type of target variables (targets)
Number of targets
Type of targets (decision trees) Single-target Multi-target
Discrete (classification trees) C4.5, C5.0, J4.8, CART, CLUS CLUS
Continuous (regression trees)
Continuous (model trees)
M5, M5’, Cubist
14.4 Modelling Population Dynamics with Decision Tree Approaches
Population dynamics studies changes of the size and structure of populations over
time taking into account environmental and biological processes influencing these
changes. For example, one might study the size of brown bear population as af-
fected by its initial size, sex and age structure, reproduction age, fertility and mor-
tality of different age classes. The modelling formalism most often used by ecolo-
gical experts is the formalism of differential equations, which describe the change
of state of a dynamic system over time (see chapters 6, 7, 9). A typical approach
to modelling population dynamics can be as follows: an ecological expert writes a
set of differential equations that capture the most important relationships in the
domain. These are often linear differential equations. The coefficients of these
equations are then determined (calibrated) using measured data.
Relationships among attributes describing internal demographic properties of a
population and the set of external environmental attributes influencing changes of
population’s parameters can be highly non predictable and nonlinear. This has
caused a surge of interest in the use of different nonlinear modelling techniques
for modeling population dynamics (see e.g. chapters 8, 10, 12). Furthermore these
include neural networks (Lek and Guegan 1999, Recknagel et al. 1997, Schleiter
et al. 1999), equation discovery (Džeroski et al. 1999, Todorovski et al. 1998) and
decision trees.
Classification and regression trees can be used for modeling population dynamics
as follows. The task of predictive modeling is to forcast the future state of the pop-
ulation or the change in the state of the population over a specified time period,
given the current state of the population and the environment. E.g. Kompare and
Džeroski (1995) use regression trees discovery to model the growth of the domin-
ant species of algae (Ulva rigida) in the lagoon of Venice in relation to water tem-
perature, dissolved nitrogen and phosphorus, and dissolved oxygen.
In the area of forestry, decision trees have been successfully used to model popu-
lation dynamics of red dear and spruce bark beetles population dynamics in forest
ecosystem. The study about the population dynamics of red dear was focused on
the effects of different meteorological conditions, habitat properties and hunting
regimes on the population dynamic of red dear (Stankovski et al. 1998, Debeljak
et al. 1999). A highlight of the results of the red deer studies is the discovery of
the strong influence of meteorological parameters on the browsing intensity for
new growth of woody plants (beech and maple) and consequently the body weight
of 1-year-olds, 2-year-olds, and hinds (important parameters of the studied red
deer population). These results challenge previous simplistic approaches, assum-
ing simpler and more direct relationships between the density of the red deer pop-
ulation and its parameters and the browsing rate of forest new growth.
The study of spruce bark beetles (Ogris and Jurc 2010 ) focused on environmental
conditions that stimulate population growth of the spruce bark beetles Ips typo-
graphus and Pityogenes chalcographus. The results show a strong correlation
between the appearance of I. typographus at Northeast (NE) expositions, while P.
chalcographus prefers West (W) and North (N) sites. The discovered habitat pref-
erences of bark beetles confirm the adaptation of spruce to drought conditions at
southern expositions, where its root system penetrates deeper in the soil. At N, NE
and W sites, the individual trees are more sensitive to drought and mechanical
destabilisation due to the shallow root system and thus they are more prone to at-
tack by bark beetles.
Decision trees are also used in agro-ecology. The population dynamics of soil or-
ganisms is affected by the changes of different biological and physicochemical en-
vironmental attributes and agricultural practices. A study about the effects of
growing Bt-maize cultivation on abundances of earthworms populations (Oligo-
chaeta) (Debeljak et al. 2007) used farming practices, soil parameters, the biolo-
gical structure of soil communities, and the type and age of the crop at the time of
sampling as attributes to predict the total abundance of three functional groups of
earthworms (epigeic– live and feed on plant litter (Fig. 14.1), endogeic geo-
phagus and live in the soil, anecic- live in soil but feed on plant litter on the sur-
face. The highly accurate (r2= 0,83) regression tree model for anecic worms (Fig.
14.1) shows that this functional group of earthworms prefers less clay and more
silt soil with medium pH. It has been shown that the seasonal effect
(autumn/spring sampling) have stronger influence on anecic biomass compared to
the inter-annual effect (autumn 2002/autumn 2003). Indeed, it is very well known
that in temperate arable ecosystems, anecic earthworms reach their minimum in
winter, due to low temperature, and their maximum in autumn, after spring and
summer reproduction and development. Finally, agricultural practices, such as till-
age or maize variety have no effects on anecic earthworm biomass. *** Insert fig-
ure 14.1 here*****
Fig. 14.1: Regression tree for predicting the abundance of anecic earthworms. The additional in-
formation given in each node is the min / mean / max of earthworm biomass. In the leaves, this
information is extended with the number of examples and relative root mean square error (Debel-
jak et al, 2007) Upper right: Epigeic earthworm Eisenia fetida (Lumbricidae). Courtesy of Paul
Henning Krogh.
Soil dwelling populations in arable ecosystems are exposed to various anthropo-
genic pressures. To identify attributes influencing the abundance of soil mites and
springtails and the biodiversity of soil microarthropods, a multi objective regres-
sion tree has been induced from data collected under different crop management
practices (Demšar et al. 2006). Fig. 14.2 shows an example of such a decision tree
predicting the target attributes abundances of Acari (r2=0.653) and Collembola
(r2=0.675) and the diversity of Collembola (r2=0.562). The model indicates that
the most important parameters are the soil type, the time (number of months) since
the establishment of the current situation, and the different forms of tillage. Hence,
the model can adequately reproduce the known empirical knowledge on this phe-
nomenon. *** Insert figure 14.2 here*****
Fig. 14.2: The multi-objective regression tree modelling Acari abundance, Collembola abund-
ance and biodiversity. The numbers in the leaves are the number of Acari individuals divided by
1000, the number of Collembola individuals divided by 1000 and diversity, respectively (Demšar
et al. 2006). Upper right: Two Collembolan species Protaphorura fimata (Onychiuridae) - the
largest white one and Proisotoma minuta (Isotomidae) - small gray ones. (Courtesy of Paul Hen-
ning Krogh and Thomas Larsen )
14.5 Habitat Modelling Using Decision Trees
Habitat modelling typically relates properties of the environment with the pres-
ence, abundance or diversity of organisms (for other detailed examples, see chap.
13 on spatial distribution models). For example, one might study the influence of
soil characteristics, such as soil temperature, water content, and proportion of min-
eral soil on the abundance and species richness of Collembola (springtails; the
most abundant insects in soil (Kampichler et al. 2000)). Habitat modelling can be
also linked with spatial information derived from geographic information systems
(GIS) on the studied area (Debeljak et al. 2001; Jerina et al. 2003) (see also chap.
A number of habitat-suitability modelling applications of other machine learning
methods (e.g. neural networks, genetic algorithms) are surveyed by Fielding
(1999). Lek et al. (1999) uses neural networks to build a number of predictive
models for Collembola diversity. Bell (1999) uses decision trees to describe the
winter habitat of pronghorn antelope. Jeffers (1999) uses a genetic algorithm to
discover rules that describe habitat preferences for aquatic species in British
rivers. Rule inductions was also used to relate the presence or absence of a number
of species in Slovenian rivers to physical and chemical properties of river water,
such as temperature, dissolved oxygen, pollutant concentrations, chemical oxygen
demand, etc. (Džeroski and Grbovi 1995). ć
Decision trees are applied widely in habitat modelling. Džeroski and Drumm
(2003) have used classification tree models to predict the suitability for the sea cu-
cumber species Holothuria leucospilota on Rarotonga, Cook Island. Kobler and
Adami (1999) have used decision tree models to identify locations for construcč-
tion of wildlife bridges across highways in Slovenia. Decision trees were used to
model habitat suitability for red deer in Slovenian forests using GIS data, such as
elevation, slope, and forest composition (Debeljak at al. 2001). Model of potential
and actual habitat for brown bears have been induced from GIS data and data on
brown bear sightings using decision trees (Jerina et al. 2003). Ogris and Jurc
(2007) have applied decision trees to identify potential habitats for different tree
species under varying climate change scenarios. Decisions trees are used in habitat
modelling of soil organisms that are under the influence of different soil character-
istics and crop practices (Kampichler et al. 2000, Debeljak et al. 2007). ****Insert
figure 14.3 here****
Fig. 14.3: A classification tree modelling the presence of oilseed rape feral populations. The per-
centages give the predicted probability of presence of a feral population in 2003 according to the
situation. For instance, this probability on the whole area is 14% (at the root); in the absence of
an adjacent field in 2002 (Field02 = “0”), which is the best attribute to explain the presence of a
feral population in 2003, it is only 11% (left branch); while in the presence of such a field
(Field02 = “1”), it increases to 38% (right branch) (Debeljak et al. 2008).
Habitat modelling is getting relevant also in agriculture due to problems with
crops, such as oilseed rape, sunflower, wheat or sorghum which can escape from
cultivation, and colonise field margins as feral populations. To control the pro-
cesses leading to the formation of new feral populations, habitat models would en-
able us to identify suitable growing conditions for new potential feral population.
Such research has been conducted on a 41 km2 production area of winter oilseed
rape in Loir-et-Cher region, France (Pivard et al. 2008). Based on attributes de-
scribing locations of all cultivated oilseed rape fields and feral populations and
their demographic properties, a habitat model for feral oil seed rape was de-
veloped (Fig. 14.3). The model predicts the probability of the presence of a feral
population in the studied area.
Side effects of cultivation of oilseed rape (OSR) include volunteer plants that
emerge on the field after cultivation of OSR and may cause crop impurity or weed
control problems. To understand the suitable conditions for formation of volunteer
populations of OSR, a habitat model to predict presence and abundance of volun-
teer oilseed rape (Brassica napus L.) has been induced from a dataset about the
seedbank at 257 arable fields used for baseline sampling in the British Farm Scale
Evaluations of genetically modified herbicide tolerant (GMHT) crops (Debeljak et
al. 2008). Volunteer OSR was most likely present if a previous OSR crop had
been grown in the same field (Fig. 14.4). However, machine learning also indic-
ated previously unknown correlations between the abundance of volunteer oilseed
rape, total seedbank and several other factors like the percent of nitrogen and car-
bon in the soil. Once OSR has been cultivated at a site volunteers are not excluded
specifically from any part of the country or from sites having particular abiotic
characters such as high pH or low % of nitrogen. Volunteers had, moreover, be-
come present at 24% sites where there had been no OSR crop in the last 8 years,
presumably as a result of a previous crop (beyond the 8 years recorded) or impor-
ted to the site with farm machinery. Their abundance, moreover, varied systemat-
ically with factors that are generally associated with the intensity of farming, not-
ably total seedbank abundance, species number and plant life history groups (Fig.
14.5), and most consistently with percentage of nitrogen and carbon in the soil.
All these factors were linked to an extent with geographical region, being smallest
in the arable south-central and south-east and largest in the north and south-west.
****Insert figure 14.4 here**** ****Insert figure 14.5 here****
Fig. 14.4: Classification of presence of oilseed rape by crop type (C2-Type: crop type 2 years be-
fore the sampling date; C5-Type: crop type 5 years before the sampling date; types are Oilseed,
Miscellaneous (Misc.), Cereal, Vegetable, grass ley or set aside (Ley) (correctly classified in-
stances: 60.7 %) (Debeljak et al. 2008).
Fig. 14.5: Classification of presence of oilseed rape by the abundance of plants (m2) of particular
functional groups (SloDet slow, determinate development; SloOut slow development living
below the crop canopy; FasIdt – fast indeterminate development) (correctly classified instances:
63.8%) (Debeljak et al. 2008).
14.6 Conclusion
This chapter introduced decision trees as one of the most popular machine learn-
ing techniques used for ecological modeling. It also gave an overview of the use
of decision trees in ecological modeling with a particular focus on population dy-
namic and habitat suitability modeling. We have shown that the applications of
machine learning to population dynamic and habitat suitability modeling can be
grouped along two dimensions. One dimension is the type of environment where
the studied group of organisms lives, e.g., aquatic (river or sea) or terrestrial
(forest or agricultural fields). Another dimension is the type of applied machine
learning technique.
The major advantages of decision tree methods include the ability to capture inter-
actions between the variables used for modeling, the understandability of the pro-
duced models (trees) and their efficiency. Decision tree learning methods can
learn models fast from large quantities of data, involving either a large number of
records (example) or a large number of columns (variables) or both. Also, de-
cision tree models make predictions very fast and can be used to classify large
numbers of examples: This is important in the context of pixel-based classification
in geographical information systems, where very large numbers of spatial
units/points need to be classified.
Decision tree learning is also capable of identifying the relevant variables from a
large set of independent variables. The resulting trees typically use only a few of
the variables available. This, however, can easily be a disadvantage in some situ-
ations: If all the variables available contribute to the classification, it is very likely
that the tree will not use them all and will hence have lower performance.
Other situations where decision trees may encounter problems are domains where
the variables are completely independent. In addition, small numbers of
examples / records are a quite problematic for decision trees. In both situations,
using methods like linear or logistic regression would be more appropriate.
Decision trees are derived from data only. No domain knowledge or limited
amounts thereof are used in the learning process. As such, they represent the data
driven or empirical approach to ecological model construction, which is more ap-
propriate when we have plenty of high-quality (reliable and relevant) measured
data and little knowledge about the studied system. When only few or low-quality
(unreliable or irrelevant) data are available and/or a considerable knowledge about
the studied system, the classical knowledge-based paradigm of manual model con-
struction could be more appropriate.
Alpaydin, E (2010) Introduction to machine learning, 2nd edition. MIT Press, Cambride, MA
Appice A, Dzeroski S (2007) Stepwise Induction of Multi-target Model Trees.In Proc. 18th
European Conference on Machine Learning, LNCS 4701: 502-509
Bell, JF (1999) Tree based methods. In: Fielding AH (ed) Machine Learning Methods for Ecolo-
gical Applications, Kluwer Academic Publishers, Dordrecht
Blockeel H, Struyf J (2002) Efficient algorithms for decision tree cross-validation. Journal of
Machine Learning Research 3:621–650
Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: Proc. Fif-
teenth International Conference on Machine Learning, p. 55–63. San Mateo, CA, Morgan
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees.
Wadsworth, Belmont
Clark P, Boswell R (1991) Rule induction with CN2: Some recent improvements. Lecture
Notes in Computer Science 482:151-163
Debeljak M, Džeroski S, Adami M (1999) Interactions among the red deer (Cerus elaphus, L.)č
population, meteorological parameters and new growth of the natural regenerated forest in
Sneznik, Slovenia. Ecological Modelling 121:51–61
Debeljak M, Džeroski S, Jerina K, Kobler A, Adami M (2001) Habitat suitability modelling ofč
red deer (Cervus elaphus, L.) in South-Central Slovenia. Ecological Modelling 138:321-330
Debeljak M, Cortet J, Demšar D, Krogh PH, Džeroski S (2007) Hierarchical classification of en-
vironmental factors and agricultural practices affecting soil fauna under cropping systems us -
ing Bt-maize. Pediobiologia 51:229-238
Debeljak M, Squire G, Demšar D, Young M, Džeroski S (2008) Relations between the oilseed
rape volunteer seedbank, and soil factors, weed functional groups and geographical location
in the UK. Ecological Modelling 212:138-146
Debeljak M, Kocev D, Towers W, Jones M, Griffiths B, Hallett P (2009) Potential of multi-ob -
jective models for risk-based mapping of the resilience characteristics of soils : demonstra-
tion at a national level. Soil use manage 25:66-77
Demšar D, Džeroski S, Larsen T, Struyf J, Axelsen J, Bruns-Pedersen M, Krogh PH (2006) Us-
ing multi-objective classification to model communities of soil microarthropods. Ecological
Modelling 191:131-143
Džeroski S, Grbovi J (1995) Knowledge discovery in a water quality database. In Proc. First Inć-
ternational Conference on Knowledge Discovery and Data Mining, pp. 81–86. AAAI Press,
Menlo Park, CA
Džeroski S, Todorovski L Bratko I, Kompare B, Križman V (1999) Equation discovery with eco-
logical applications. In: Fielding AH (ed) Machine Learning Methods for Ecological Applica-
tion, Kluwer Academic Publishers, Dordrecht
Džeroski S, Drumm D (2003) Using regression trees to identify the habitat preference of the sea
cucumber (Holothuria leucospilota) on Rarotonga, Cook Island. Ecological Modelling
Džeroski S, Gjorgjioski V, Slavkov I, Struyf J (2007) Analysis of time series data with predictive
clustering trees, In: Proc. Fifth International Workshop on Knowledge Discovery in Induct-
ive Databases, LNCS 4747:63-80, Springer, Berlin.
Fielding AH, (1999) An introduction to machine learning methods. In: Fielding AH (ed) Ma-
chine Learning Methods for Ecological Applications, Kluwer Academic Publishers,
Jeffers JNR (1999) Genetic algorithms I. In: Fielding AH (Ed) Machine Learning Methods for
Ecological Applications, Kluwer Academic Publishers, Dordrecht
Jerina K, Debeljak M, Džeroski S, Kobler A, Adami M (2003) Modelling the brown bear popuč-
lation in Slovenia:A tool in the conservation management of a threatened species. Ecological
Modelling 170:453– 469
Kampichler C, Džeroski S, Wieland R (2000). The application of machine learning techniques to
the analysis of soil ecological data bases: relationships between habitat features and Collem -
bola community characteristics. Soil Biology and Biochemistry 32: 197-209
Kobler A, Adami M (1999) Brown bears in Slovenia: identifying locations for construction ofč
wildlife bridges across highways. In: Proceeding of the 1999 International Conference on
Wildlife Ecology and Transportation,
Kompare B, Džeroski S (1995) Getting more out of data: Automated modelling of algal growth
with machine learning. In: Proc. International Conference on Costal Ocean Space Utilization,
p. 209-220, University of Hawaii
Lek S, Guegan J.F. (1999) Application of Artificial Neural Networks in Ecological Modelling.
Ecological Modelling 120:2-3
Ogris N, Jurc M (2010) Sanitary felling of Norway spruce due to spruce bark beetles in Slovenia:
a model and projections for various climate change scenarios. Ecological modeling 221:290-
Ogris N, Jurc M (2007) Potential changes in the distribution of maple species (Acer pseudo -
platanus, A. campestre, A. platanoides, A. obtusatum) due to climate change in Slovenia. In
Proceedings of the Symposium on Climate Change Influences on Forests and Forestry. Uni-
versity of Ljubljana, Slovenia.
Pivard S, Demšar D, Lecomte J, Debeljak M, Džeroski S (2008) Characterizing the presence of
oilseed rape feral populations on field margins using machine learning. Ecological Modeling
RuleQuest (2009) Accessed 10 September 2010
Quinlan, JR (1986) Induction of decision trees. Machine Learning 1:81-106
Quinlan JR (1992) Learning with continuous classes. Proc. Fifth Australian Joint Conference on
Artificial Intelligence, pp. 343-348, World Scientific, Singapore.
Recknagel F, French M, Harkonen P, Yabunaka K (1997) Artificial neural network approach for
modelling and prediction of algal blooms. Ecological Modelling 96:11-28
Schleiter IM, Borchardt D, Wagner R, Dapper T, Schmidt KD, Schmidt HH, Werner H (1999)
Modelling water quality, bioindication and population dynamics in lotic ecosystems using
neural networks. Ecological Modelling 120: 271-286
Stankovski V, Debeljak M, Bratko I, Adami M (1998) Modelling the population dynamics ofč
Red deer (Cervus elaphus L.) with regard to forest development. Ecol. Modelling 108:145–
Struyf J, Džeroski S (2006). Constraint based induction of multi-objective regression trees. In
Proc. Fourth International Workshop on Knowledge Discovery in Inductive Databases, Re-
vised, Selected and Invited Papers, LNCS 3933:222–233
Struyf J, Zenko B, Blockeel H, Džeroski S (2010) Clus: A Predictive Clustering System. Journal
of Machine Learning Research (Under review). Available for download from http://www.c- Accessed 10 September 2010
Todorovski L, Džeroski S, Kompare B (1998) Modelling and prediction of phytoplankton
growth with equation discovery. Ecological Modelling 113: 71-81
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical
multi-label classification. Machine Learning 73:185-214
Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques. Mor -
gan Kaufmann, San Francisco
... Decision tree (DT) is rule based hierarchical technique used for data classification and predictive modelling (Debeljak and Džeroski, 2011;Murthy, 1998). DT able to handle data of different scale and capable to model relationship between objective and explanatory variables without any assumption about data distribution (BouKheir et al., 2010;Xu et al., 2005). ...
... CITs method is an unbiased binary recursive partitioning method and it follows statistical inference during each splitting step (Nembrini, 2019). DT technique used in groundwater (Naghibi et al., 2016;Pradhan, 2013;Youssef et al., 2016), ecological dynamics (Debeljak and Džeroski, 2011) and gully erosion (Gayen and Pourghasemi, 2019) predictive modelling. The technique also applied to predict trail width in protected area (Tomczyk and Ewertowski, 2013). ...
Full-text available
Trails have high conservation value which provides access to the protected area. But expansion of recreational activities along the trail has notably disturbed its environmental quality. The rapid increase of recreational activities along trails of Sikkim Himalayan region has become a major environmental concern. Therefore, modelling and mapping of sensitive trails are essential aspects for decision makers. The present study integrates RS-GIS with different machine learning algorithms to prepare trail susceptibility mapping. Furthermore, the study compares the predictive performance of logistic regression (LR), decision tree (DT) and random forest (RF) model for trail susceptibility mapping. Here we have considered seventeen trail susceptibility conditioning factors as model input. Thereafter, the dataset was randomly divided into two parts: training dataset (70%) and validation dataset (30%). Multicollinearity analysis carried using variance inflation factor (VIF) and tolerance (TOL) to reduce model biasness. Thereafter, trail susceptibility map prepared using LR, DT and RF models. Finally, Receiver operating curve (ROC)- area under curve (AUC) method, statistical overall accuracy (OA) and Kappa index were used to measure the predictive performance of the models. The study concluded that LR (AUC-0.948, OA-94.8% and Kappa Index- 0.897) gives better performance in overall accuracy assessment as compared to DT (AUC- 0.931, OA- 93% and Kappa Index- 0.862) and RF (AUC- 0.914, OA- 91.3% and Kappa Index- 0.828) model.
... Based on susceptibility levels, DT organizes and categorizes conditioning factors into hierarchical and homogeneous categories. The purpose of building a tree is to come up with a set of decision rules that can be used to predict the outcome based on a set of input factors (Debeljak and Džeroski, 2011;Tehrany et al., 2019). As a result, the rules are developed by examining a set of factors in order to predict an event from a similar set of data (Myles et al., 2004;Tehrany et al., 2019). ...
Conference Paper
Full-text available
In the field of hydrology, floods are a topic of study, and they pose a serious threat to agriculture as well as civil engineering and the public health sector. When used in the meaning of "flowing water," it can also refer to the tide's inflow. Flooding can be caused by the overflow of water from rivers, lakes, and oceans. Floods have wreaked havoc on people's lives and property in recent decades. Escaping quickly depends on an early warning system that can foresee floods. As a result of global flood predictions, a variety of technical methodologies have been used (such as AI machine learning, GIS, remote sensing). So, different methodologies necessitate distinct data sets. For example, a machine learning-based prediction system involves rainfall, humidity, temperature, water flow, and water level. Flow rates, typical temperature ranges, cloud visibility data, soil type, land cover categorization, and other variables are needed for remote sensing and GIS-based prediction. Flood prediction technology is examined in this review paper. The performance comparison of technological approaches provides a detailed grasp of the various strategies within the context of a thorough review and discussion. Using high-precision technology tools, it is hoped that the communities living near wetlands would be better served. Also, it's important to come up with concrete advice in the Bangladesh’s context on how to improve the precaution by using flood prediction. When picking the appropriate technological procedure for a given assignment forecast, hydrologists and civil engineers can utilize the results of this study as a reference. This study has also, by comparative analysis tried to show that why machine learning (ML) is better to use rather than other approaches. Initially stating that this model is suitable for any region around the world.
... A decision tree is a machine learning technique for decision-based analysis and interpretation in business, computer science, civil engineering, and ecology (Bel et al. 2009;Debeljak and Džeroski 2011;Krzywinski and Altman 2017). Decision tree provides solutions to varieties of regression data mining problems used for decision-making and good management practices (Maimon and Rokach 2005;Breiman et al. 2017). ...
... Simultaneously, the highest accuracy rates were mostly achieved for the algorithms built on the decision tree model: DT, RF, GBM and XGB. Thus, also within the analyzed groups of factors, the variables of the model are arranged in a hierarchical system with a root node and decision nodes (Debeljak and Džeroski, 2011). The highest accuracy rates for approximately every group were found for the RF model, which is important for the next stage of analysis with gradient boosting, in which the key parameter is the number of trees selected for boosting (Ferrario and Hämmerli, 2019). ...
The extent of changes at the ecosystem level due to external dynamics from year to year of meteorological conditions is one of the basic determinants of climate change. The object of the study is the sensitivity of a set of environmental and planktonic factors shaping the ecosystem of a temperate shallow lagoon (the Vistula Lagoon, South Baltic) to weather fluctuations from year to year. The specific question concerned the short-term advantage of the sequence of environmental changes related to the impact of wind or those caused by the air temperature. The average speed in the prevailing westerly winds in the first summer research season was 3.05 m/s, while in the second it was 3.84 m/s. Simultaneously, the air temperature changed on average from 16.8 to 17.3 °C. The accuracy of the division of the two-year data set of ecosystemic parameters into the first colder and less windy and the second, warmer and windier summer was analyzed. For this purpose, several machine learning models were used. Next, the model with the highest accuracy was selected for the explanatory modelling based on game theory metrics Shapley Value. The analysis based on the interactions among ecological factors shows that the dynamics of the climate on a year-to-year scale can bring significantly more environmental changes in the shallow lagoon by an increase in wind speed rather than by an increase in air temperature. More windy weather in the summer in the subsequent year caused higher wind action, suspended solids, silicates and diatoms concentration. Simultaneously, the same conditions resulted in lower concentrations of dissolved organic carbon and nitrogen forms in water, accompanied by a reduction of Cyanobacteria biomass. The procedure presented in the study can be used for environmental prognostics of the environmental effects of climate change.
... Decision tree methods such as classification and regression trees can also be used as alternatives to logistic regression. These work by iteratively splitting the data into distinct subsets, with the splits chosen in such a way that entropy in the resulting subsets is minimised (Debeljak and Džeroski, 2011). Decision tree outputs typically have high accuracy and stability and should be straightforward to understand even for people with non-statistical backgrounds. ...
One key component of any eutrophication management strategy is establishment of realistic thresholds above which negative impacts become significant and provision of ecosystem services is threatened. This paper introduces a toolkit of statistical approaches with which such thresholds can be set, explaining their rationale and situations under which each is effective. All methods assume a causal relationship between nutrients and biota, but we also recognise that nutrients rarely act in isolation. Many of the simpler methods have limited applicability when other stressors are present. Where relationships between nutrients and biota are strong, regression is recommended. Regression relationships can be extended to include additional stressors or variables responsible for variation between water bodies. However, when the relationship between nutrients and biota is weaker, categorical approaches are recommended. Of these, binomial regression and an approach based on classification mismatch are most effective although both will underestimate threshold concentrations if a second stressor is present. Whilst approaches such as changepoint analysis are not particularly useful for meeting the specific needs of EU legislation, other multivariate approaches (e.g. decision trees) may have a role to play. When other stressors are present quantile regression allows thresholds to be established which set limits above which nutrients are likely to influence the biota, irrespective of other pressures. The statistical methods in the toolkit may be useful as part of a management strategy, but more sophisticated approaches, often generating thresholds appropriate to individual water bodies rather than to broadly defined “types”, are likely to be necessary too. The importance of understanding underlying ecological processes as well as correct selection and application of methods is emphasised, along with the need to consider local regulatory and decision-making systems, and the ease with which outcomes can be communicated to non-technical audiences.
... DT is mainly aimed at exploring a set of decision rules applicable to the prediction of an outcome considering a set of inputs. In cases where the target variables are continuous or discrete, DT is termed regression tree or classification tree, respectively [60]. Numerous researchers have used DT effectively in different real conditions for prediction and/or classification purposes [61]. ...
Full-text available
The application of artificial neural networks in mapping the mechanical characteristics of the cement-based materials is underlined in previous investigations. However, this machine learning technique includes several major deficiencies highlighted in the literature, such as the overfitting problem and the inability to explain the decisions. Hence, the present study investigates the applicability of other common machine learning techniques, i.e., support vector machine, random forest (RF), decision tree, AdaBoost and k-nearest neighbors in mapping the behavior of the compressive strength (CS) of cement-based mortars. To this end, a big experimental database has been compiled based on experimental data available in the literature considering, namely the cement grade, which is an important parameter for the modeling of mortar’s CS. Other important parameters are namely the age, the water-to-binder ratio, the particle size distribution of the sand and the amount of plasticizer. Many models based on the influential factors affecting machine learning techniques have been developed, and their prediction capacities have been assessed using performance indexes. The present research highlights the potential of AdaBoost and RF models as useful tools which can assist in mortar design and/or optimization. In addition, mapping the development of mortar characteristics can assist in revealing the influence of the different mortar mix parameters on the compressive strength.
... DT is mainly aimed at exploring a set of decision rules applicable to the prediction of an outcome considering a set of inputs. In cases where the target variables are continuous or discrete, DT is termed regression tree or classification tree, respectively [60]. Numerous researchers have used DT effectively in different real conditions for prediction and/or classification purposes [61]. ...
This study aims to implement a hybrid ensemble surrogate machine learning technique in predicting the compressive strength (CS) of concrete, an important parameter used for durability design and service life prediction of concrete structures in civil engineering projects. For this purpose, an experimental database consisting of 1030 records has been compiled from the machine learning repository of the University of California, Irvine. The database was used to train and validate four conventional machine learning (CML) models, namely Artificial Neural Network (ANN), Linear and Non-Linear Multivariate Adaptive Regression Splines (MARS-L and MARS-C), Gaussian Process Regression (GPR), and Minimax Probability Machine Regression (MPMR). Subsequently, the predicted outputs of CML models were combined and trained using ANN to construct the Hybrid Ensemble Model (HENSM). It is observed that the proposed HENSM produces higher predictive accuracy compared to the CML models used in the present study. The predictive performance of all models for CS prediction was compared using the testing dataset and it is found that the HENSM model attained the highest predictive accuracy in both phases. Based on the experimental results, the newly constructed HENSM model is very potential to be a new alternative in handling the overfitting issues of CML models and hence, can be used to predict the concrete CS, including the design of less polluting and more sustainable concrete constructions.
... As such, it starts with all the training examples then selects the variable that fits best and makes some subsets. The tree branches are the result of a test performed at each step by the algorithm on the middle nodes, and predictions appear on tree leaves (Debeljak & Džeroski, 2011). The M5 tree model is able to predict numerical Page 7 of 15 162 continuous variables from the numerical attributes, and the predicted results appear as multivariate linear regression models on tree leaves (Wang & Witten, 1997). ...
Full-text available
Understanding the spatial distribution of soil nutrients and factors affecting their concentration and availability is crucial for soil fertility management and sustainable land utilization while quantifying factors affecting soil nitrogen distribution in Qorveh-Dehgolan plain is mostly lacking. This study, thus, aimed at digital modeling and mapping the spatial distribution of topsoil total nitrogen (TN) in Qorveh-Dehgolan plain with an area of 150,000 ha using random forest (RF), decision tree (DT), and cubist. (CB) algorithms. A total of 130 observation points were collected from a depth of 0 to 30 cm from topsoil surfaces based on a random sampling pattern. Then, soil physicochemical properties, calcium carbonate equivalent, organic carbon, and topsoil total nitrogen were measured. A number of 51 environmental variables including 31 geomorphometric attributes derived from a digital elevation model with 12.5-m spatial resolution, 13 spectral indices and reflectance from SENTINEL-2 satellite (MSI sensor), and five soil properties and two spatial variables of latitude and longitude were used as covariates for digital mapping of topsoil total nitrogen. The most appropriate covariates were then selected by the Boruta algorithm in the R software environment. A standard deviation map was produced to show model uncertainty. The covariate selection resulted in the separation of 14 effective covariates in the spatial prediction of topsoil total nitrogen by using the data mining algorithms. The validation of digital mapping of topsoil total nitrogen by RF, DT, and CB models using 20% of independent data showed a root mean square error (RMSE) of 0.032, 0.035, and 0.043%; mean absolute error (MAE) of 0.0008, 0.001, and 0.002%; and based on the coefficients of determination of 0.42, 0.38, 0.35, respectively. Relative importance (RI) of environmental covariates using the %IncMSE index indicated the importance of two geomorphometric variables of mid-slope position and normalized height along with SAVI and NDVI remote sensing variables in the spatial modeling and distribution of total nitrogen in the studied lands. The RF prediction and associated uncertainty maps, with show high accuracy and low standard deviation in most part of the study area, revealed low overfitting and overtraining in soil-landscape modeling; so, this model can lead to the development of a digital map of soil surface properties with acceptable accuracy for sustainable land utilization. Keywords: Digital soil mapping · Tree-based models · Soil nitrogen mapping · Boruta feature selection
Full-text available
The publication entitled “Determinants of success of small and medium-sized enterprises” presents an innovative approach to the issue of identifying favourble conditions for attaining success by enterprises. The publication focuses on a comprehensive assessment of the importance of selected determinants of success in the activities of enterprises, from the group of internal conditions (the enterprise’s internal environment) and the determinants of the local and institutional environment. This topic is generally not included in standard analyses and surveys of public statistics. An important stage of the presented analysis is the construction of a model of dependence between the proposed aggregate measure of success and the determinants of success of enterprises from the group of SMEs, using the technique of regression trees. Distinguishing family enterprises within the personal scope of this publication on the basis of a subjective criterion - respondents' self-recognition of a company as a family business - made it possible to diagnose the scale of family business in Poland, to determine its basic quantitative characteristics, as well as to analyse its basic behaviours and success factors. Analytical comments are supplemented with numerous charts and maps, which illustrate the results of the analyses in an accessible manner.
Full-text available
O presente trabalho propôs um agrupamento ecológico de espécies arbóreas da Floresta Ombrófila Mista (FOM) do Paraná a partir de uma abordagem funcional. Para tanto, foram calculadas métricas de diversidade funcional de três comunidades de FOM (Floresta Nacional de Irati – FNI, General Carneiro – GNC e São João do Triunfo – SJT) obtidas a partir dos valores de nove atributos funcionais das espécies presentes em cada comunidade, sendo estes: Área Foliar Específica (AFE), Massa da Semente (MS), Altura Máxima Potencial (AP), Densidade da Madeira (DM), Incremento Periódico Anual (IPA), Taxa de Mortalidade (M%), Síndrome de Dispersão (SD), Sistema Reprodutivo (SR) e Regime de Renovação Foliar (RF). Nestas comunidades foram selecionadas 78 espécies, utilizadas como uma amostra geral da FOM Paranaense, as quais foram agrupadas método de Cluster Hierárquico a partir do uso dos valores de seus atributos. Cada grupo gerado foi comparado estatisticamente com o uso Modelo Lineares Generalizados (GLM) e testes post hoc, interpretados também com o uso de Árvores de Decisão (AD) e Análise de Correlação Canônica (ACC). A comunidade SJT apresentou a maior diversidade funcional, justificada pela heterogeneidade ambiental de seus fragmentos, enquanto FNI presentou a menor diversidade e riqueza funcional, porém, com baixa divergência nos papéis funcionais de suas espécies, indicando uma condição ambiental mais estável. GNC, por sua vez, foi a comunidade com maior riqueza e divergência funcional, denotando que as espécies dominantes do local possuem papéis funcionais distintos. O agrupamento gerado a partir das 78 espécies revelou nove grupos de estratégias ecológicas, sendo estes: Pioneiras longevas, Secundárias Dispersas pelo Vento, Pioneiras de vida curta; Pioneiras; Secundárias facultativas; Tardias pequenas; Tardias, Secundárias oportunistas de clareiras e Secundárias Tardias. As características dos agrupamentos corroboram em grande parte com as teorias de estratégias ecológicas de alocação e compensação de recursos, formando grupos ecologicamente coerentes e que podem ser extrapolados para a FOM do Paraná. This survey proposes an ecological grouping of woody species from the Araucaria Mixed Forest (AMF) of Paraná state from a functional approach. For this purpose, functional diversity metrics were calculated for three AMF communities (Floresta Nacional de Irati – FNI, General Carneiro – GNC and São João do Triunfo – SJT) obtained from the values of nine functional traits of the species present in each community, these are: Specific Leaf Area (SLA), Seed Mass (MS), Maximum Potential Height (Hmax), Wood Density (WD), Periodic Annual Increment (PAI), Annual Mortality Rate (M%), Dispersal mode (DM), Reproductive System (RS) and Leaf Renewal (LR). Seventy-eight species were selected from these communities and used as a general sample for the Paraná state AMF, which were grouped using the Hierarchical Cluster method using their trait values. Each group was statistically compared using Generalized Linear Models (GLM) and post hoc tests, interpreted also using Decision Trees (DT) and Canonical Correlation Analysis (CCA). The SJT community had the highest functional diversity, justified by the environmental heterogeneity of its fragments, while FNI had the lowest diversity and functional richness, however, with low divergence in the functional roles of its species, indicating a more stable environmental condition. In other hand, GNC was the community with the greatest richness and functional divergence, denoting that the dominant species in this community plays distinct functional roles. The cluster generated from the 78 species revealed nine groups of ecological strategies, namely: Long-lived pioneers, Wind-dispersed Secondaries, Short-lived pioneers; Pioneers; Facultative secondaries; Small-size late trees; Late trees, Secondary Gap Opportunists and Late Secondary. The characteristics of the clusters largely corroborate the theories of ecological resource allocation and compensation strategies, forming ecologically coherent groups that can be extrapolated to the Paraná state AMF.
Conference Paper
Full-text available
Constrained based inductive systems are a key component of inductive databases and responsible for building the models that satisfy the constraints in the inductive queries. In this paper, we propose a constraint based system for building multi-objective regression trees. A multi-objective regression tree is a decision tree capable of predicting several numeric variables at once. We focus on size and accuracy constraints. By either specifying maximum size or minimum accuracy, the user can trade-off size (and thus interpretability) for accuracy. Our approach is to first build a large tree based on the training data and to prune it in a second step to satisfy the user constraints. This has the advantage that the tree can be stored in the inductive database and used for answering inductive queries with different constraints. Besides size and accuracy constraints, we also briefly discuss syntactic constraints. We evaluate our system on a number of real world data sets and measure the size versus accuracy trade-off.
Full-text available
Hierarchical multi-label classification (HMC) is a variant of classification where instances may belong to multiple classes at the same time and these classes are organized in a hierarchy. This article presents several approaches to the induction of decision trees for HMC, as well as an empirical study of their use in functional genomics. We compare learning a single HMC tree (which makes predictions for all classes together) to two approaches that learn a set of regular classification trees (one for each class). The first approach defines an independent single-label classification task for each class (SC). Obviously, the hierarchy introduces dependencies between the classes. While they are ignored by the first approach, they are exploited by the second approach, named hierarchical single-label classification (HSC). Depending on the application at hand, the hierarchy of classes can be such that each class has at most one parent (tree structure) or such that classes may have multiple parents (DAG structure). The latter case has not been considered before and we show how the HMC and HSC approaches can be modified to support this setting. We compare the three approaches on 24 yeast data sets using as classification schemes MIPS’s FunCat (tree structure) and the Gene Ontology (DAG structure). We show that HMC trees outperform HSC and SC trees along three dimensions: predictive accuracy, model size, and induction time. We conclude that HMC trees should definitely be considered in HMC tasks where interpretable models are desired.
In contrast with traditional modeling methods, which are used to identify parameter values of a model with known structure, equation discovery systems identify the structure of the model also. The model generated with such systems can give experts a better insight into the measured data and can be also used for predicting future values of the measured variables. This paper presents lagramge, an equation discovery system that allows the user to define the space of possible model structures and to make use of domain specific expert knowledge in the form of function definitions. We use lagramge to automate the modeling of phytoplankton growth in lake Glumsoe, Denmark. The structure of the model constructed with lagramge agrees with human experts’ expectations. The model can be successfully used for long term prediction of phytoplankton concentration during algal blooms.
Following a preliminary study (Stankovski et al., Ecol. Modelling, 108, 1998), we use machine learning techniques to conduct a more detailed analysis of the interactions among the red deer population, meteorological parameters and new forest growth. We use the machine learning program M5 (Quinlan, Proc. 10th Int. Conf. Machine Learning, Morgan Kaufmann, San Mateo CA, 1993) that learns regression trees to automate the modelling of dynamic interactions. An area of 40 000 hectares of naturally regenerated forest on the high Dinaric Karst of Notranjska, Slovenia, is studied. The analysis uses data collected during the period 1976–1993, which include several meteorological parameters, the degrees of browsing intensity of new growth of woody plants (beech and maple), and parameters about the population of red deer. Models of the degree of beech browsing and calf weight were studied earlier; here, we automatically induce models of the red deer population size, the degree of beech and maple browsing, calf weight for 1- and 2-year-olds, and hind weight. The induced models are evaluated in terms of predictive accuracy and interpreted for their explanatory power. The models show that the meteorological parameters, the parameters of the red deer population and the rates of the browsing intensity of the new growth form a complex system with closely related parameters. While these interactions can be mainly explained by our current knowledge, we still gain some new knowledge from the automatically induced models. The results emphasise the importance of a pluralistic approach and a holistic perception of the system formed by meteorological conditions, the red deer population and the new growth in a forest ecosystem.
The population dynamics of soil organisms under agricultural field conditions are influenced by many factors, such as pedology and climate, but also farming practices such as crop type, tillage and the use of pesticides. To assess the real effects of farming practices on soil organisms it is necessary to rank the influence of all of these parameters. Bt maize (Zea mays L.), as a crop recently introduced into farming practices, is a genetically modified maize with the Cry1Ab gene which produces a protein toxic to specific lepidopteran insect pests. To assess the effects of Bt maize on non-target soil organisms, we conducted research at a field site in Foulum (Denmark) with a loamy sand soil containing 6.4% organic matter. The study focused on populations of springtails (Collembola) and earthworms (Oligochaeta) from samples taken at the beginning and at the end of the maize crop-growing season during 2 consecutive years. Farming practices, soil parameters, the biological structure of soil communities, and the type and age of the crop at the time of sampling, were used as attributes to predict the total abundance of springtails and biomass of earthworms in general and the abundance or biomass for specific functional groups (epigeic, endogeic and anecic groups for earthworms, and eu-, eu to hemi-, hemi-, hemi to epi- and epiedaphic groups for Collembola). Predictive models were built with data mining tools, such as regression trees that predict the value of a dependent variable from a set of independent variables. Regression trees were constructed with the data mining system M5′. The models were evaluated by qualitative and quantitative measures of performance and two models were selected for further interpretation: anecic worms and hemi-epiedaphic Collembola. The anecic worms (r2=0.83) showed preferences for less clay and more silt soil with medium pH but were not influenced directly by farming practices. The biomass of earthworms was greater in early autumn than in spring or late autumn. Biomass of hemi-epiedaphic Collembola (r2=0.59) increased at the end of the maize growing season, while higher organic matter content and pH tended to increase their biomass in spring. Greater abundance of Collembola was also noted in early autumn if the crop was non-Bt maize. The models assessed by this research did not find any effects of the Bt maize cropping system on functional groups of soil fauna.
Data mining techniques were applied to model the presence and abundance of volunteer oilseed rape (OSR) (Brasica napus L.) in the seedbank at 257 arable fields used for baseline sampling in the UK's Farm Scale Evaluations of genetically modified herbicide tolerant (GMHT) crops. Constructed models were supported by statistical tests. Volunteer OSR was most likely present if a previous OSR crop had been grown in the same field, but it was also present at sites where it had not been grown in the previous 8 years (24% of all fields). In 136 fields where it was found, it showed a slow decline in abundance since the last crop. However, data mining indicated previously unfound correlations between oilseed rape abundance, total seedbank and several other factors, notably percent of nitrogen and percent of carbon in the soil, all of which were smallest in the centre of arable production in southern England and greatest in the surrounding south-west, west and north. In a separate analysis, its abundance was also associated with particular plant life history groups, which include broadleaf weeds such as Capsella and Matricaria species, having a similar phenology to oilseed rape, between rapidly developing annuals and the biennials and perennials. The findings are a reference point in the evolution of oilseed rape as a weed and potential GM impurity. Data mining approaches provide models that may be used to assess the status of volunteer OSR in other countries or at a later time in the UK.
In agricultural soil, a suite of anthropogenic events shape the ecosystem processes and populations. However, the impact from anthropogenic sources on the soil environment is almost exclusively assessed for chemicals, although other factors like crop and tillage practices have an important impact as well. Thus, the farming system as a whole should be evaluated and ranked according to its environmental benefits and impacts. Our starting point is a data set describing agricultural events and soil biological parameters. Using machine learning methods for inducing regression and model trees, we produce empirical models able to predict the soil quality from agricultural measures in terms of quantities describing the soil microarthropod community. We are also interested in discovering additional higher level knowledge. In particular, we have identified the most important factors influencing the population densities of springtails and mites and their biodiversity. We also identify to which agricultural actions different microarthropods react distinctly. To obtain this higher level knowledge, we employ multi-objective regression trees.
Policy makers rely on risk-based maps to make informed decisions on soil protection. Producing the maps, however, can often be confounded by a lack of data or appropriate methods to extrapolate using pedotransfer functions. In this paper, we applied multi-objective regression tree analysis to map the resistance and resilience characteristics of soils onto stress. The analysis used a machine learning technique of multiple regression tree induction that was applied to a data set on the resistance and resilience characteristics of a range of soils across Scotland. Data included both biological and physical perturbations. The response to biological stress was measured as changes in substrate mineralization over time following a transient (heat) or persistent (copper) stress. The response to physical stress was measured from the resistance and recovery of pore structure following either compaction or waterlogging. We first determined underlying relationships between soil properties and its resistance and resilience capacity. This showed that the explanatory power of such models with multiple dependent variables (multi-objective models) for the simultaneous prediction of interdependent resilience and resistance variables was much better than a piecewise approach using multiple regression analysis. We then used GIS techniques coupled with an existing, extensive soil data set to up-scale the results of the models with multiple dependent variables to a national level (Scotland). The resulting maps indicate areas with low, moderate and high resistance and resilience to a range of biological and physical perturbations applied to soil. More data would be required to validate the maps, but the modelling approach is shown to be extremely valuable for up-scaling soil processes for national-level mapping.
In the Pacific Islands, invertebrates including sea cucumbers are among the most valuable and vulnerable inshore fisheries resources. As human activities continue to force substantial impacts on coral reef ecosystems, the management of inshore fisheries has become an increasingly important priority. Knowledge of the distribution, biology and habitat requirements of a species can significantly enhance conservation efforts. The sea cucumber (Holothuria leucospilota) forms an important part of the traditional subsistence fishery on Rarotonga, Cook Islands, yet little is known of this species' present spatial distribution and abundance around the island. We apply two machine learning approaches and a classical statistical approach to predict the number of sea cucumber individuals from site characteristics. The machine learning methods used are induction of regression trees and instance-based learning. These are compared to the classical statistical approach of linear regression. The most accurate predictions are obtained using instance-based learning, while the most understandable descriptions are obtained using regression tree induction.
The assessment of properties and processes of running waters is a major issue in aquatic environmental management. Because system analysis and prediction with deterministic and stochastic models is often limited by the complexity and dynamic nature of these ecosystems, supplementary or alternative methods have to be developed. We tested the suitability of various types of artificial neural networks for system analysis and impact assessment in different fields: (1) temporal dynamics of water quality based on weather, urban storm-water run-off and waste-water effluents; (2) bioindication of chemical and hydromorphological properties using benthic macroinvertebrates; and (3) long-term population dynamics of aquatic insects. Specific pre-processing methods and neural models were developed to assess relations among complex variables with high levels of significance. For example, the diurnal variation of oxygen concentration (modelled from precipitation and oxygen of the preceding day; R-2 = 0.79), population dynamics of emerging aquatic insects (modelled from discharge, water temperature and abundance of the parental generation; R-2 = 0.93), and water quality and habitat characteristics as indicated by selected sensitive benthic organisms (e.g. R-2 = 0.83 for pH and R-2 = 0.82 for diversity of substrate, using five out of 248 species). Our results demonstrate that neural networks and modelling techniques can conveniently be applied to the above mentioned fields because of their specific features compared with classical methods. Particularly, they can be used to reduce the complexity of data sets by identifying important (functional) inter-relationships and key variables. Thus, complex systems can be reasonably simplified in clear models with low measuring and computing effort. This allows new insights about functional relationships of ecosystems with the potential to improve the assessment of complex impact factors and ecological predictions.