ArticlePDF Available

A study on the effects of unbalanced data when fitting logistic regression models in ecology

Authors:

Abstract and Figures

Binary variables have two possible outcomes: occurrence or non-occurrence of an event (usually with 1 and 0 values, respectively). Binary data are common in ecology, including studies of presence/absence, alive/dead, and change/no-change. Logistic regression analysis has been widely used to model binary response variables. Unbalanced data (i.e., an extremely larger proportion of zeros than ones) are often found across a variety of ecological datasets. Sometimes the data are balanced (i.e., same amount of zeros and ones) before fitting the model, however, the statistical implications of balancing (or not) the data remain unclear. We assessed the statistical effects of balancing data when fitting a logistic regression model by studying both its statistical properties of the estimated parameters and its predictive capabilities. We used a base forest-mortality model as reference, and by using stochastic simulations representing different configurations of 0/1 data in a sample (unbalanced data scenarios), we fitted the logistic regression model by maximum likelihood. For each scenario we computed the bias and variance of the estimated parameters and several prediction indexes. We found that the variability of the estimated parameters is affected, with the balanced-data scenario having the lowest variability, thus, affecting the statistical inference as well. Furthermore, the prediction capabilities of the model are altered by balancing the data, with the balanced-data scenario having the better sensitivity/specificity ratio. Balancing, or not, the data to be used for fitting a logistic regression models may affect the conclusion that can arise from the fitted model and its subsequent applications.
Content may be subject to copyright.
Contents lists available at ScienceDirect
Ecological Indicators
journal homepage: www.elsevier.com/locate/ecolind
Research paper
A study on the eects of unbalanced data when tting logistic regression
models in ecology
Christian Salas-Eljatib
a,
, Andres Fuentes-Ramirez
a
, Timothy G. Gregoire
b
, Adison Altamirano
c
,
Valeska Yaitul
a
a
Laboratorio de Biometría, Departamento de Ciencias Forestales, Universidad de La Frontera, Temuco, Chile
b
School of Forestry and Environmental Studies, Yale University, New Haven, CT 065111, USA
c
Laboratorio de Ecología del Paisaje Forestal, Departamento de Ciencias Forestales, Universidad de La Frontera, Temuco, Chile
ARTICLE INFO
Keywords:
Statistical inference
Model prediction
Logit model
Binary variable
Bias
Precision
ABSTRACT
Binary variables have two possible outcomes: occurrence or non-occurrence of an event (usually with 1 and 0
values, respectively). Binary data are common in ecology, including studies of presence/absence, alive/dead,
and change/no-change. Logistic regression analysis has been widely used to model binary response variables.
Unbalanced data (i.e., an extremely larger proportion of zeros than ones) are often found across a variety of
ecological datasets. Sometimes the data are balanced (i.e., same amount of zeros and ones) before tting the
model, however, the statistical implications of balancing (or not) the data remain unclear. We assessed the
statistical eects of balancing data when tting a logistic regression model by studying both its statistical
properties of the estimated parameters and its predictive capabilities. We used a base forest-mortality model as
reference, and by using stochastic simulations representing dierent congurations of 0/1 data in a sample
(unbalanced data scenarios), we tted the logistic regression model by maximum likelihood. For each scenario
we computed the bias and variance of the estimated parameters and several prediction indexes. We found that
the variability of the estimated parameters is aected, with the balanced-data scenario having the lowest
variability, thus, aecting the statistical inference as well. Furthermore, the prediction capabilities of the model
are altered by balancing the data, with the balanced-data scenario having the better sensitivity/specicity ratio.
Balancing, or not, the data to be used for tting a logistic regression models may aect the conclusion that can
arise from the tted model and its subsequent applications.
1. Introduction
Data of occurrence/non-occurrence of a phenomenon of interest are
vastly found across several disciplines (Alberini, 1995; Arana and Leon,
2005; Bell et al., 1994). This type of variable is known as binary or
dichotomous, and it represents whether an event occurs or not. This
event is represented by the random variable Y, and we usually record
occurrence by Y= 1 and non-occurrence by Y= 0. In ecology, binary
variables arise when studying the presence of a species in a geographic
area (Bastin and Thomas, 1999; Phillips and Elith, 2013; Hastie and
Fithian, 2013) or the occurrence of mortality at the tree or forest level
(Davies, 2001; Wunder et al., 2008; Chao et al., 2009; Young et al.,
2017). Meanwhile in landscape ecology, binary variables are used to
represent the occurrence of re within a given area (Bigler et al., 2005;
Mermoz et al., 2005; Dickson et al., 2006; Vega-García and Chuvieco,
2006; Palma et al., 2007; Bradstock et al., 2010); deforestation (Wilson
et al., 2005; Schulz et al., 2011; Kumar et al., 2014; Hu et al., 2014);
and in general the change from one land use category to another (Seto
and Kaufmann, 2005; Leyk and Zimmermann, 2007; Lander et al.,
2011).
Logistic regression analysis is the most frequently used modelling
approach for analyzing binary response variables. If we need to model a
binary variable, to statistically relate it to predictor variable(s) or
covariate(s), one of the most used approaches for pursuing this task in
ecology is to use logistic regression models (Warton and Hui, 2011).
These models belong to group of the generalized linear models (GLM).
In a GLM, three compartments must be specied (Lindsey, 1997;
Schabenberger and Pierce, 2002): a random component, a systematic
component, and a link function. A logistic regression model uses: a
binomial probability density function as the random component; a
linear predictor function Xβ(where Xis a matrix with the covariates
and βis a vector with the parameters or coecients) as the systematic
component; and a logistic equation as the link function. One of the key
advantages of using logistic regression models in ecology is that the
http://dx.doi.org/10.1016/j.ecolind.2017.10.030
Received 7 April 2017; Received in revised form 9 August 2017; Accepted 16 October 2017
Corresponding author. Tel.: +56 45 2325652.
E-mail address: christia.salas@ufrontera.cl (C. Salas-Eljatib).
(FRORJLFDO,QGLFDWRUV²
;(OVHYLHU/WG$OOULJKWVUHVHUYHG
0$5.
probability of the binary response variable is directly modelled, thereby
accounting explicitly for the random nature of the phenomenon of in-
terest.
In many applications when dealing with binary data in ecology, it
happens that the number of observations with ones (Y= 1) is much
smaller than the number of observations with zeros (Y= 0) or vice
versa. We simply term this situation as unbalanced data, but other terms
have been also used for this situation, including disproportionate
sampling (Maddala, 1992) or rarity events (King and Zeng, 2001).
Based on our review of scientic applications of logistic regression to
model ecological phenomena, the proportion of zeros in datasets ranges
between 80% and 95%. Therefore, having balanced data (i.e., equal
numbers of observations of zeros and ones) is more the exception than
the rule.
Both unbalanced and balanced data have been used for tting lo-
gistic regression models. In ecological studies, some researchers have
adopted the practice of balancing the data before carrying out the
analyses (e.g., Vega-García et al., 1995; Vega-García et al., 1999; Lloret
et al., 2002; Brook and Bowman, 2006; Vega-García and Chuvieco,
2006; Jones et al., 2010; Rueda, 2010). Balancing data means to select,
by some rule (usually at random), the same amount of observations
with ones and zeros from the originally available dataset. Therefore, a
balanced dataset or balanced sample is created, where a 5050% pro-
portion of zero and one values is met. After the balanced dataset is
built, the logistic regression model is tted (i.e., its parameters are es-
timated) by maximum likelihood (ML). An example of this practice in
ecological applications is the option for balancing data before tting a
logit model when conducting analyses of land use changes in the soft-
ware IDRISI (Eastman, 2006). On the other hand, it is important to
point out that unbalanced data have been also used in ecological studies
(Wilson et al., 2005; Echeverria et al., 2008; Kumar et al., 2014; Young
et al., 2017). Therefore, unbalanced data in applied ecological studies
has been considered as not having important eects into the models
being tted. Moreover, to date, no studies have addressed the eect of
balancing data when tting logistic regression models in ecological
analyses, and just a handful have explored some statistical implications
in ecological applications (Qi and Wu, 1996; Wu et al., 1997; Cailleret
et al., 2016).
The applied statistical implications of unbalanced data in logistic
regression are not well described nor realized for applied researchers.
Although balancing the data seems to be an accepted practice, the
reasons that justify its use are not well explained. The most immediate
eect of balancing the data is to greatly reduce the sample size avail-
able for tting purposes, therefore decreasing the precision with which
the parameters of the model are estimated. Among the statistical studies
on logistic regression and unbalanced data, we highlight the following:
Schaefer (1983) and Scott and Wild (1986) pointed out that the max-
imum likelihood estimates (MLE) of a logit model are biased only for
small sample sizes. On the other hand, Xie and Manski (1989) stated
that unbalanced data only aect the intercept parameter of a logit
model, specically being biased estimated according to Maddala
(1992).King and Zeng (2001), advocated that all the MLE of the logit
parameters are biased. Schaefer (1983) and Firth (1993) proposed
correction for the bias of the MLE of the logistic regression model
parameters. McPherson et al. (2004) conducted one of the few related
analysis when tting presence-absence species distribution models in
ecology, but only focusing in the prediction capabilities of the tted
models. Maggini et al. (2006) assessed the eect of weighting absences
when modelling forest communities by generalized additive models.
Recently, Komori et al. (2016) indicated that logistic regression suer
poor predictive performance, and proposed an alternative model to
improve predictive performance. Komori et al. (2016) approach in-
volves to add a new parameter to the original structure of a logistic
regression model, and tted it in a mixed-eects modelling framework,
therefore their approaches becomes a dierent type of statistical model.
From above, we can infer that: (a) most of the statistical studies on
logistic regression and unbalanced data have focus on the bias of the
MLE parameters (a topic that has been rarely taking into account in
ecological applications); (b) much less attention has been put into the
prediction performance; and (c) no study has dealt with the eects of
unbalanced data in the variance of the MLE parameters.
In this study we aimed at assessing the eect of using unbalanced
data when tting logistic regression models by analyzing both the
statistical properties (i.e., bias and variance) of the estimated para-
meters and the predictive capabilities of the tted model.
2. Methods
2.1. Base model
The binary variable (Y) is the occurrence of a phenomenon of in-
terest, where Y= 1 denotes occurrence and Y= 0 otherwise. In a
modelling framework, we seek to model the probability of the response
variable being Y= 1, given the values of the predictor variables, this is
Pr(Y= 1|X), that we can more easily represent by π
yX
.
In our analysis we used a logistic regression equation with ve
predictor variables, as a base model for carrying out our analysis. This
model served as a reference for assessing the statistical eects of un-
balanced data on tting logistic regression models. The binary variable
of forest mortality occurrence (Y), given the analyses of Young et al.
(2017) in the state of California, USA, is modeled as a function of cli-
mate and biotic variables, as follows:
===++++
+
π
πYββXβXβXβX
βX
ln 1logit[ 1]
,
yx
yx
i01
1223344
55
ii
ii
iiii
i
(1)
where Y
i
is the occurrence of forest mortality (i.e., 1 for occurrence, 0
for non-occurrence) at the ith pixel), meanwhile the predictor variables
X
1i
,X
2i
,X
3i
,X
4i
, and X
5i
represent the: mean climatic water decit
(CWD) or simply Defnorm
i
; basal area of live trees (BA
i
);
B
Ai
2
; CWD
anomaly (Defz0
i
); and Defnorm
i
×BA
i
for the ith pixel, respectively. We
have used the nomenclature for the variables as in the study of Young
et al. (2017) and only the available data for year 2012. Notice that we
could more easily represent model (1) as:
===
β
π
πyX
l
n1logit[ 1] ,
yX
yX (2)
where yis the vector with the binary variable, Xis the matrix with the
predictor variables (and a rst column of 1), and βis the vector of
parameters
ˆ
β
[
0
,
ˆ
β
1
,
ˆ
β
2
,
ˆ
β3,
β
4,
ˆ
β]
5.
In the sequel, we shall use Eq. (2) as the mean function in various
scenarios of unbalanced data. It is important to point out that we are
not interested in nding the best model, but rather on studying the
eects of using several unbalanced data scenarios on a reference model.
Furthermore, we want to remark that we are not pursuing to assess
dierent alternative statistical models for unbalanced data (e.g. as in,
Warton and Hui, 2011; Hastie and Fithian, 2013). We also want to
mention that the zero-inated models are those focusing on modelling
count variables (Schabenberger and Pierce, 2002; Zuur et al., 2010),
such as the prediction of the amount of tree mortality (e.g., Aeck,
2006). These models are not part of our study, since we are dealing with
modelling a binomial variable.
2.2. Unbalanced data scenarios
We use data of forest mortality occurrence from Young et al. (2017),
in California during 2012 as our population, containing 11763 total
observations (N), with 2985 cases of mortaltity occurrence (N
1
) and
8778 cases of non-occurrence (N
0
). In order to assess the eects of
unbalanced data on the statistical properties of the logit model (Eq.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

(2)), we examined dierent sample strategies from the population,
where each has a dierent proportion of occurrence and non-occur-
rence of mortality (1 and 0 values, respectively). We xed the sample
size in n= 1000 in all scenarios, and the number of cases with zeros
and ones for the response variable that the sample should contain,
across scenarios ranging from 10% to 90%. In this way, we constrained
the sample to containing dierent cases with zeros (n
0
) and ones (n
1
),
but the same sample size (n= 1000). In order to achieve each of the
proportion of 0/1 values, which has a xed sample size of 0 and 1 (i.e.,
n
0
and n
1
, respectively), we (i) drew a random sample without re-
placement of size (n
0
) from the sub-population (with size N
0
) of cases
containing zero in the response variable; (ii) drew a random sample
without replacement of size n
1
from the sub-population (with size N
1
)
of cases containing ones in the response variable; and (iii) merge the
randomly selected n
0
and n
1
cases in a sample of size n(i.e.,
n=n
0
+n
1
).
2.3. Statistical assessment
We assessed the statistical properties of the tted logistic regression
model by stochastic simulations (i.e., Monte Carlo simulation). We
carried out S= 100, 000 simulations so that the sampling error of the
simulation itself is negligibly small. A similar analysis to justify the
number of simulation has been conducted by Gregoire and
Schabenberger (1999), in agreement with the amount of simulations
conducted in other statistical simulation studies (e.g. Gregoire and
Salas, 2009; Salas and Gregoire, 2010). For each simulated sample, we
tted the logistic regression model (Eq. (2)) by maximum likelihood
using the glm function implemented in R (R Development Core Team,
2016).
Based on the simulations, we examined the empirical distribution of
the estimated parameters and prediction indexes. Our assessment was
divided and focused in: (a) the statistical properties of estimated model
parameters, and (b) the accuracy of predictions from the tted model.
(a) Statistical properties of the estimated parameters. In order to assess
how the accuracy of the estimated parameters is aected by unbalanced
data, we computed the empirical bias (B
MC
) of each parameter being
estimated,
ˆ
θ, as follows:
=
ˆˆ
θθ θ
B
[] E[]
,
MC (3)
where θis the respective parameter value and
ˆ
θ
E
[]
is the empirical
expected value of the estimated parameter. The former was obtained
from the maximum likelihood estimate (MLE) of θusing the population
available, and the latter is approximated from the average of the S
values of the estimated parameter
ˆ
θ. Notice that
ˆ
θin Eq. (3) is replaced
by each parameter of the model (i.e.,
ˆ
β
0
,
ˆ
β
1
,
ˆ
β
2
,
ˆ
β3,
ˆ
β3, and
ˆ
β5).
In order to assess how the precision of the estimated parameters is
aected by unbalanced data, we computed the empirical variance
(V
MC
) of each estimated parameter
ˆ
θas follows:
=
=
θSθθ
V
[ˆ]1(ˆE[ˆ])
,
j
S
jMC
1
2
(4)
where
ˆ
θis the MLE of θfor the jth simulation. Finally, we compute the
empirical mean square error (ECM
MC
) of each
ˆ
θby:
=+
ˆˆ ˆ
θθ θ
E
CM [ ] V ( ) [B ( ) ]
MC MC MC 2(5)
We represented the variance and mean square error in the same units of
the estimated parameters by taking their square root, thus obtaining the
standard error (SE
ˆ
θ
[
]
) and their root mean squared error (RMSE
ˆ
θ
[
]
).
(b) Prediction capabilities. For each simulation and unbalanced data
scenario we computed prediction indexes of the logistic regression
model. In order to do so, we calculated the predicted probability of
mortality occurrence for the ith observation ( =
ˆ
πyX1
i
i
), as follows:
=+
=
ˆ
ˆ
π
e
1
1
,
yβ
XX
1
iii(6)
where X
i
is the matrix of predictor variables for the ith case and
ˆ
β
is the
vector of estimated parameters. We use a probability threshold of 0.5
for occurrence, that is to say, if
=
ˆ
π0.5
yX1
ii
we assume that the event
occurs, and non-occurrence otherwise (Jones et al., 2010). Based on
these predicted probabilities, we computed the following eight pre-
diction indexes: commission error (proportion of ncases in which the
model erroneously predicts occurrence); commission accuracy (pro-
portion of ncases in which the model correctly predicts occurrence);
omission error (proportion of ncases in which the model erroneously
predicts non-occurrence); omission accuracy (proportion of ncases in
which the model correctly predicts non-occurrence); sensitivity (pro-
portion of the total cases of occurrence where the model correctly
predicts occurrence); and specicity (proportion of the total cases of
non-occurrence where the model correctly predicts non-occurrence).
We have also carried out all the above analyses (i.e., simulation and
statistical properties assessment) for a dierent dataset. We used data of
forest re occurrence in central-Chile, as a way of representing how our
ndings could change in a forest re model, and the main results are
shown in Supplementary Material.
3. Results
The proportion of 0/1 in the data used for tting a logistic regres-
sion model aects the distribution of the estimated parameters. The
variability of the estimated parameters tends to increase with an ex-
treme proportion of zero (or ones) in the data (Fig. 1).
Unbalanced data aects on the bias of the estimated parameters. All
the parameters estimates were nearly unbiased for the proportion of
zeros data assessed that is closer to the proportion of zeros in the entire
population (First row panel of Fig. 1). However, for the other un-
balanced data scenarios, all parameters are biasedly estimated (Fig. 1).
The bias increases as the proportion of zeros in the data decreases both
in nominal units (Fig. 1), as well as in percentage (Fig. 2a). The bias is
larger for the estimated intercept-parameter than for the other para-
meters, regardless the unbalanced data scenario. The only exception to
this trend is the estimate of the parameter β
2
, being also heavily biased,
which could be a result of its higher variability compared to the other
parameter estimates (Fig. 2b). More importantly, the greatest precision
of all estimated parameters occurs with balanced data (Fig. 2b), as well
as the lowest root mean squared error (Fig. 2c). The reported greatest
precision of the estimated parameter for the balanced-data scenario was
even more pronounced for the forest re model (Fig. 4). This can be a
result of a stronger relationship among the response and the predictor
variables, than we found in the forest mortality model. Besides, the
forest re model (Eq. (7) in Supplementary Material) has a lower
number of parameters, therefore multicollinearity should be a minor
problem than in a model with two more parameters (Eq. (1)). In fact, in
the mortality model there are two parameters representing function of
variables already present in the model (i.e.,
B
Ai
2
and Defnorm
i
×BA
i
),
therefore the model is aected by multicollinearity.
The prediction capabilities of the logit model are greatly aected by
the dierent proportions of zeros and ones. Both overall error (i.e., sum
of omission and commission errors) and overall accuracy (i.e., sum of
omission and commission accuracy) tend to be better, with a decreasing
and increasing trend, respectively, when extreme proportions of zeros
(or ones) are used for tting the model (Table 1). Moreover, the larger
is the proportion of zeros in the data, the better is the prediction of non-
occurrence (i.e., higher values of omission accuracy). A similar trend,
but not completely linear, is found when the omission errors are used as
reference. On the contrary, the larger is the proportion of ones in the
data, the better is the prediction of occurrence (i.e., higher values of
commission accuracy). A similar trend is found, when the commission
errors are used as reference (Table 1). A clear pattern is observed if
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

specicity or sensitivity are used as reference. Hence, specicity in-
creases with higher number of zeros, but sensitivity decreases as
number of zeros increases (Table 1).
4. Discussion
In this paper we demonstrated that the common unbalanced pro-
portion of zeros and ones found in ecological data aects the statistical
properties of logistic regression models being tted. Because the var-
iances of the estimated parameters are aected by the proportion of 0/1
data, all the statistical inference (e.g., hypothesis testing) of the tted
model will be aected. Thus, if we are investigating the driver variables
of a ecological phenomenon, such as species distribution across a geo-
graphic area, we could be erroneously determining them, because the
statistical signicance of each parameter of the model is based upon
their respective variance estimator. Therefore, the practice of balancing
data must be carried out with caution, as well as fully considering its
implications for model performance. Some authors have argued that
there is no major eect in having unbalanced binary data (except for
the bias in the intercept parameter, Maddala, 1992), but our results
indicate that all statistical properties of the MLE parameters are af-
fected. Although all the parameters estimates are biased, the magnitud
of bias will diminish as soon as our sample mimic the proportion of
zeros that are found in the population (see the crossed lines in Fig. 2a).
Notice that it has previously been stated that all the parameters would
be biased for small samples sizes (Schaefer, 1983; King and Zeng,
2001), but that was not necessarily the case in the present study (where
n= 1000).
We also claim that the prediction capabilities of the logistic re-
gression model are aected, as also was found by McPherson et al.
(2004) and Maggini et al. (2006), but using slightly dierent statistical
models. Thus, a given ecological binary phenomenon could be erro-
neously predicted to occur (or not) if the tted model suers from
statistical issues derived from using unbalanced data. This is especially
critical for predicting habitat suitability for endangered species (and its
conservation) or for predicting the distribution range of exotic invasive
species and their subsequent control plans. In either case it can result in
allocating eorts and resources in an inecient manner. In this study
we encourage researchers to carefully examine the nature of the data
they have available and the 0/1 proportion of it before tting the
Fig. 1. Empirical distribution of estimated parameters for the forest mortality model (Eq. (1)) given dierent scenarios of zeros in the data. The vertical solid line and the vertical dashed
line, within each histogram, represents the parameter value and the Monte Carlo expected value of the estimated parameter, respectively.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

logistic regression model, as this can greatly improve both the statistical
properties of the estimated parameters of the model and the prediction
capabilities applied to ecological phenomena.
We did not focus on analyzing alternatives for overcoming the ef-
fects of unbalanced data nor on nding the best tted model. Some
studies dealing with models in ecology have shown the necessity of
eectively correct biased analyses for better interpretation and pre-
diction capabilities (Lajeunesse, 2015; Ruault et al., 2014), but we
focused on pointing out the eects of unbalanced data when tting
logistic regression models. The bias in the intercept of a logistic model
could be diminished when using the correction given by Manski and
Lerman (1977), but this type of correction is better suited for disciplines
where higher proportion of ones in the population is more common to
nd or sample (e.g., social, economy, and political sciences), than for
ecological populations.
The statistical inference of tted logistic regression models is af-
fected by the unbalanced nature of ecological data. Our results show
that the largest standard error and root mean squared error of the es-
timated parameters are found when having extreme proportion of zeros
(or ones) in the data. More importantly, for the rst time in the lit-
erature, as far as we are aware of, we described that the variability of
the maximum likelihood estimated (MLE) parameters decreases when
having a balanced sample. This nding may suggest that balancing data
is an appropriate practice, if statistical inference (e.g., hypothesis
testing), is what the researcher is concerned about. Hence, by using
unbalanced data, we might conclude that a predictor variable is sta-
tistically signicant when in fact it is not, or otherwise. Furthermore,
this nding refutes what King and Zeng (2001) had claimed regarding
that the addition of ones into the data, would decrease the variance of
the MLE parameters.
We also found that unbalanced data heavily aects the prediction
capabilities of a logistic regression model. Our study reects that the
occurrence of the event is better predicted when having larger pro-
portions of ones in the data. On the other hand, non-occurrence of the
event is better predicted when having larger proportions of zeros in the
data. This trend is expected, because the model is tted by ML, where
the parameters estimates are those that maximize the likelihood of the
data at hand, therefore we should predict them concordantly
(Schabenberger and Pierce, 2002). Also, if we take into account the
trade-oof building a model that predicts occurrence and non-occur-
rence as best as possible, the balanced data scenario with a 50% of zeros
and ones oers a suitable way to proceed (Table 1). Overall, balancing
the data seems to be an appropriate practice to improve some statistical
properties and prediction capabilities of the tted model. Regardless of
balancing or not balancing the data before tting a logistic regression
model, we recommend to use the remaining sample (i.e., not used for
tting the model) for validation purposes and behavior analyses.
5. Recommendations
Given that the proportion of 0/1 data aects the variance of the
estimated parameters of the tted logistic regression model, the selec-
tion of the statistically signicant predictor variables to conduct the
analyses may also being inuenced, ultimately leading to a wrong
Fig. 2. Statistical properties of the estimated parameters for the forest
mortality model (Eq. (1)), given dierent scenarios of zeros in the data. (a)
Bias, (b) standard error, and (c) root mean squared error are shown as a
percentage of the real parameter value.
Table 1
Prediction indexes of the logistic regression model depending upon unbalanced data
scenarios. Each value is the empirical expected value of the respective index.
Proportions of zeros in the sample
10% 30% 50% 70% 90%
Commission
Error (%) 0.41 4.44 13.09 25.12 9.99
Accuracy (%) 89.59 65.56 36.91 4.88 0.01
Omission
Error (%) 9.54 21.33 21.52 4.54 0.01
Accuracy (%) 0.46 8.67 28.48 65.46 89.99
Sensitivity (%) 99.54 93.65 73.81 16.28 0.02
Specicity (%) 4.58 28.91 56.95 93.52 99.91
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

conclusion. From our study, we have provided evidence that using a
balanced data scenario (i.e., 50% of zeros and ones) will yield smaller
variances for the maximum likelihood estimates of parameters, there-
fore oering less uncertainty in the estimation process, and ultimately
in identifying the driver variables for modelling presence/absence re-
sponse variables. This nding is extremely relevant in ecological ap-
plications, as an important amount of studies are currently dealing with
niche modeling and species distribution based on presence/absence
data, especially within the climate change context. Thus, we re-
commend that when modelling binary response variables, researchers
can safely use balanced datasets for tting candidate models, in order to
choose the best model given the variables used for the analysis. Giving
our results, by performing these procedure, the analysis itself will gain
more certainty because the researcher could better distinguish between
the eects of the predictor variables being included in the model or
whether is the ecological phenomenon really important (McPherson
et al., 2004). Another issue to take into account is the drastic reduction
of the sample for balancing purposes. We recommend to use the re-
maining data (i.e., the one not used for tting purposes) for assessing
the prediction capabilities of the models (using the indices and plots
recommend by Jones et al., 2010), and assessing the models behavior
by plotting the prediction of the response variable as a function of the
predictor variable(s).
Regarding interpreting the predicted outcomes, we recommend not
extrapolating the model results into areas where predictor variables
were not measured. This implies that the presence/absence of a given
organisms could be altered within certain ranges. In the case when
extrapolation is indeed necessary, researchers should dierentiate their
predictions from those areas where no data were collected using, for
instance, color-codded results to distinguish them from prediction re-
sults by the model using real data. This situation will be specially ad-
vantageous for modelling any ecological phenomena that is a function
of spatially-recorded predictor variable(s).
6. Concluding remarks
The proportion of zeros and ones in a dataset aects the statistical
inference and prediction capabilities of a tted logistic regression
model. Not only the accuracy of the estimated parameters is aected by
unbalanced data, but also their precision. More importantly, the sta-
tistical inference (e.g., hypothesis testing) is inuenced by the propor-
tion of zeros and ones in the data. In addition, the prediction cap-
abilities of the tted logistic regression model are aected as well,
therefore the model performance would greatly depend on the pro-
portion of 0/1 data. Overall, the 0/1 proportion might aect the con-
clusions that can arise from the tted model and its further application.
Since unbalanced data in ecology are fairly common, this can have
great implications in model building of several ecological phenomena
being modelled by scientists.
Acknowledgements
This study was supported by the Chilean research grant Fondecyt
No. 1151495. AFR is supported by a Postdoctoral Scholarship from
Vicerrectoría de Investigación y Postgrado, Universidad de La Frontera,
Temuco, Chile.
Appendix A. Supplementary data
Supplementary data associated with this article can be found, in the
online version, at http://dx.doi.org/10.1016/j.ecolind.2017.10.030.
References
Aeck, D.L.R., 2006. Poisson mixture models for regression analysis of stand level
mortality. Can. J. For. Res. 36, 29943006.
Alberini, A., 1995. Testing willingness-to-pay models of discrete-choice contingent va-
luation survey data. Land Econ. 71 (1), 8395.
Arana, J.E., Leon, C.J., 2005. Flexible mixture distribution modeling of dichotomous
choice contingent valuation with heterogenity. J. Environ. Econ. Manage. 50 (1),
170188.
Bastin, L., Thomas, C.D., 1999. The distribution of plant species in urban vegetation
fragments. Landsc. Ecol. 14 (5), 493507.
Bell, C.D., Roberts, R.K., English, B.C., Park, W.M., 1994. A logit analysis of participation
in Tennessee's forest stewardship program. J. Agric. Appl. Econ. 26 (2), 463472.
Bigler, C., Kulakowski, D., Veblen, T.T., 2005. Multiple disturbance interactions and
drought inuence re severity in Rocky mountain subalpine forests. Ecology 86 (11),
30183029.
Bradstock, R.A., Hammill, K.A., Collins, L., Price, O., 2010. Eects of weather, fuel and
terrain on re severity in topographically diverse landscapes of south-eastern
Australia. Landsc. Ecol. 25 (4), 607619.
Brook, B.W., Bowman, D.M.J., 2006. Postcards from the past: charting the landscape-
scale conversion of tropical Australian savanna to closed forest during the 20th
century. Landsc. Ecol. 21, 12531266.
Cailleret, M., Bigler, C., Bugmann, H., Camarero, J.J., Cufar, K., Davi, H., Meszaros, I.,
Minunno, F., Peltoniemi, M., Robert, E.M.R., Suarez, M.L., Tognetti, R., Martinez-
Vilalta, J., 2016. Towards a common methodology for developing logistic tree mor-
tality models based on ring-width data. Ecol. Appl. 26 (6), 18271841.
Chao, K.-J., Phillips, O.L., Monteagudo, A., Torres-Lezama, A., Vásquez, R., 2009. How do
trees die? Mode of death in northern Amazonia. J. Veg. Sci. 20, 260268.
Davies, S.J., 2001. Tree mortality and growth in 11 sympatric Macaranga species in
Borneo. Ecology 82 (4), 920932.
Dickson, B.G., Prather, J.W., Xu, Y.G., Hampton, H.M., Aumack, E.N., Sisk, T.D., 2006.
Mapping the probability of large re occurrence in Northern Arizona. Landsc. Ecol.
21 (2), 747761.
Eastman, J.R., 2006. Idrisi 15 andes, guide to GIS and Image Processing. Clark University,
Worcester, MA, USA.
Echeverria, C., Coomes, D.A., Newton, M.H.A.C., 2008. Spatially explicit models to
analyze forest loss and fragmentation between 1976 and 2020 in southern Chile.
Ecol. Model. 212, 439449.
Firth, D., 1993. Bias reduction of maximum likelihood estimates. Biometrika 80, 2738.
Gregoire, T.G., Salas, C., 2009. Ratio estimation with measurement error in the auxiliary
variate. Biometrics 65 (2), 590598.
Gregoire, T.G., Schabenberger, O., 1999. Sampling-skewed biological populations: be-
havior of condence intervals for the population total. Ecology 80 (3), 10561065.
Hastie, T., Fithian, W., 2013. Inference from presence-only data; the ongoing controversy.
Ecography 36, 864867.
Hu, X., Wu, C., Hong, W., Qiu, R., Li, J., Hong, T., 2014. Forest cover change and its
drivers in the upstream area of the Minjiang River, China. Ecol. Indic. 46, 121128.
Jones, C.C., Acker, S.A., Halpern, C.B., 2010. Combining local- and large-scale models to
predict the distributions of invasive plant species. Ecol. Appl. 20 (2), 311326.
King, G., Zeng, L., 2001. Logistic regression in rare events data. Polit. Anal. 9 (2),
137163.
Komori, O., Eguchi, S., Ikeda, S., Okamura, H., Ichinokawa, M., Nakayama, S., 2016. An
asymmetric logistic regression model for ecological data. Methods Ecol. Evol. 7,
249260.
Kumar, R., Nandy, S., Agarwal, R., Kushwaha, S.P.S., 2014. Forest cover dynamics ana-
lysis and prediction modeling using logistic regression model. Ecol. Indic. 45,
444455.
Lajeunesse, M.J., 2015. Bias and correction for the log response ratio in ecological meta-
analysis. Ecology 96 (8), 20562063.
Lander, T.A., Bebber, D.P., Choy, C.T., Harris, S.A., Boshier, D.H., 2011. The circe prin-
ciple explains how resource-rich land can waylay pollinators in fragmented land-
scapes. Curr. Biol. 21, 13021307.
Leyk, S., Zimmermann, N.E., 2007. Improving land change detection based on uncertain
survey maps using fuzzy sets. Landsc. Ecol. 22 (2), 257272.
Lindsey, J.K., 1997. Applying Generalized Linear Models. Springer, New York, USA, pp.
256.
Lloret, F., Calvo, E., Pons, X., Diaz-Delgado, R., 2002. Wildres and landscape patterns in
the eastern Iberian peninsula. Landsc. Ecol. 17 (8), 745759.
Maddala, G.S., 1992. Introduction to Econometrics, 2nd ed. Macmillan Publishing
Company, New York, NY, USA, pp. 631.
Maggini, R., Lehmann, A., Zimmermann, N.E., Guisan, A., 2006. Improving generalized
regression analysis for the spatial prediction of forest communities. J. Biogeogr. 33
(10), 17291749.
Manski, C.F., Lerman, S.R., 1977. The estimation of choice probabilities from choice
based samples. Econometrica 45 (8), 19771988.
McPherson, J.M., Jetz, W., Rogers, D.J., 2004. The eects of species range sizes on the
accuracy of distribution models: ecological phenomenon or statistical artefact? J.
Appl. Ecol. 41 (5), 811823.
Mermoz, M., Kitzberger, T., Veblen, T.T., 2005. Landscape inuences on occurrence and
spread of wildres in Patagonian forests and shrublands. Ecology 86 (10),
27052715.
Palma, C., Cui, W., Martell, D., Robak, D., Weintraub, A., 2007. Assessing the impact of
stand-level harvests on the ammability of forest landscapes. Int. J. Wildl. Fire 16 (5),
584592.
Phillips, S.J., Elith, J., 2013. On estimating probability of presence from use-availability
or presence-background data. Ecology 94 (6), 14091419.
Qi, Y., Wu, J., 1996. Eects of changing spatial resolution on the results of landscape
pattern analysis using spatial autocorrelation indices. Landsc. Ecol. 11 (1), 3949.
R Development Core Team, 2016. R: A language and environment for statistical com-
puting. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

project.org.
Rueda, X., 2010. Understanding deforestation in the southern Yucatán: insights from a
sub-regional, multi-temporal analysis. Reg. Environ. Change 10 (3), 175189.
Ruault, J., Martin-StPaul, N.K., Duet, C., Goge, F., Mouillot, F., 2014. Projecting future
drought in mediterranean forests: bias correction of climate models matters!. Theor.
Appl. Climatol. 117 (12), 113122.
Salas, C., Gregoire, T.G., 2010. Statistical analysis of ratio estimators and their estimators
of variances when the auxiliary variate is measured with error. Eur. J. For. Res. 129
(5), 847861.
Schabenberger, O., Pierce, F.J., 2002. Contemporary Statistical Models for the Plant and
Soil Sciences. CRC Press, Boca Raton, FL, USA, pp. 738.
Schaefer, R.L., 1983. Bias correction in maximum likelihood logistic regression model.
Stat. Med. 2, 7178.
Schulz, J.J., Cayuela, L., Rey-Benayas, J.M., Schröder, B., 2011. Factors inuencing ve-
getation cover change in Mediterranean Central Chile (19752008). Appl. Veg. Sci.
14, 571582.
Scott, A.J., Wild, C.J., 1986. Fitting logistic models under casecontrol or choice based
sampling. J. R. Stat. Soc. B 78 (2), 170182.
Seto, K.C., Kaufmann, R.F., 2005. Using logit models to classify land cover and land-cover
change from Landsat Thematic Mapper. Int. J. Rem. Sens. 25 (3), 563577.
Vega-García, C., Chuvieco, E., 2006. Applying local measures of spatial heterogeneity to
Landsat-TM images for predicting wildre occurrence in Mediterranean landscapes.
Landsc. Ecol. 21, 596605.
Vega-García, C., Woodard, P., Titus, S., Adamowicz, W., Lee, B., 1995. A logit model for
predicting the daily occurrence of human caused forest res. Int. J. Wildl. Fire 5 (2),
101111.
Vega-García, C., Woodard, P.M., Lee, B.S., Adamowicz, W.L., Titus, S.J., 1999. Dos
modelos para la predicción de incendios forestales en Whitecourt Forest, Canadá.
Investigación Agraria: Sistemas y Recursos Forestales 8 (1), 524.
Warton, D.I., Hui, F.K.C., 2011. The arcsine is asinine: the analysis of proportions in
ecology. Ecology 92 (1), 310.
Wilson, K., Newton, A., Echeverría, C., Weston, C., Burgman, M., 2005. A vulnerability
analysis of the temperate forests of south central Chile. Biol. Conserv. 122, 921.
Wu, J., Gao, W., Tueller, P.T., 1997. Eects of changing spatial scale on the results of
statistical analysis with landscape data: a case study. Geogr. Inf. Sci. 3 (1-2), 3041.
Wunder, J., Reineking, B., Bigler, C., Bugmann, H., 2008. Predicting tree mortality from
growth data: how virtual ecologists can help real ecologists. J. Ecol. 96 (1), 174187.
Xie, Y., Manski, C.F., 1989. The logit model and response-based samples. Sociol. Methods
Res. 17 (3), 283302.
Young, D.J.N., Stevens, J.T., Earles, J.M., Moore, J., Ellis, A., Jirka, A.L., Latimer, A.M.,
2017. Long-term climate and competition explain forest mortality patterns under
extreme drought. Ecol. Lett. 20 (1), 7886.
Zuur, A.F., Ieno, E.N., Elphick, C.S., 2010. A protocol for data exploration to avoid
common statistical problems. Methods Ecol. Evol. 1 (1), 314.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

... Shearman et al. (2019) applied logistic regression to examine tree mortality model performance by differentiating the proportion of the response classes in the sample by simulations and proved that increasing class imbalance had a crucial effect on model performance, showing low sensitivity and high specificity for low mortality rates and vice versa. Salas-Eljatib et al. (2018) used stochastic simulation (Monte Carlo simulation) to draw a random sample of the survival status (alive or dead) with a specific proportion of the response classes and predicted tree mortality following fires. In the imbalanced data scenario, all parameters in the model were biased, which increased as the proportion of either the alive or dead class deviated away from the middle (50-50). ...
... We applied a 60:40 training set to the test set split ratio, and empirical (nonparametric) bootstrapping, which is one of the tools frequently used to estimate measures of uncertainty in parameters associated with a given statistical method, was applied to both original (class-imbalanced) and class-balanced datasets to examine the effect of class imbalance. For each bootstrap sample, a multiple logistic regression was developed to predict the probability of post-fire tree mortality by maximum likelihood estimation (MLE) using the glm function in R. To make the sampling error marginally small, B = 100,000 repetitions were conducted, in agreement with the number of simulations used in the study by Salas-Eljatib et al. (2018). Six predictors (Slope, DBH, Height, BSI, Aspect, and CrownRatio), were determined to have significant effects on tree mortality based on the construction of univariate models with each variable in the preliminary study and were used to build models. ...
... That is, the estimated parameters are the most likely to reflect the data (Schabenberger and Pierce 2001). In addition, Salas-Eljatib et al. (2018) illustrated the effect of class imbalance using empirical bootstrap regression by differentiating the proportion of non-occurrences (coded as zero) and revealed the predictive capabilities of each coefficient in the fitted model. They noted that the variance in the MLE parameters decreased as the class discrepancy decreased, resulting in the lowest RMSE. ...
Article
Full-text available
We aimed to tackle a common problem in post-fire tree mortality where the number of trees that survived surpasses the number of dead trees. Here, we investigated the factors that affect Korean red pine ( Pinus densiflora Siebold & Zucc.) tree mortality following fires and assessed the statistical effects of class-balancing methods when fitting logistic regression models for predicting tree mortality using empirical bootstrapping ( B = 100,000). We found that Slope , Aspect , Height , and Crown Ratio potentially impacted tree mortality, whereas the bark scorch index ( BSI ) and diameter at breast height ( DBH ) significantly affected tree mortality when fitting a logistic regression with the original dataset. The same variables included in the fitted logistic regression model were observed using the class-balancing regimes. Unlike the imbalanced scenario, lower variabilities of the estimated parameters in the logistic models were found in balanced data. In addition, class-balancing scenarios increased the prediction capabilities, showing reduced root mean squared error (RMSE) and improved model accuracy. However, we observed various levels of effectiveness of the class-balancing scenarios on our post-fire tree mortality data. We still suggest a thorough investigation of the minority class, but class-balancing scenarios, especially oversampling strategies, are appropriate for developing parsimonious models to predict tree mortality following fires.
... An issue which frequently prevents the calibration of logistic regression models, and hence obstructs conclusions regarding variable importance, is that of imbalanced (also known as unbalanced) data sets, especially with regards to discrete classification problems (2)(3)(4)(5); we note that methods to classify continuous target variables in imbalanced data sets are also being developed Yang et al. (6). Imbalanced data refers to a highly skewed data set, which for simplicity we assume can be partitioned into majority (non-event) and minority (event) classes with respect to the classifier label (i.e. ...
... We note that despite significant research in cancer statistics, minimal work exists addressing the specific problem of imbalanced data in cancer data sets. Studies in fields such as ecology and credit scoring have shown that dealing with imbalanced data prior to fitting a model can improve model predictions for response variables (4,5,8). ...
Article
Full-text available
Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the majority class, and affecting the understanding of the relative importance of predictive variables. Despite its prevalence, the existing literature lacks comprehensive studies that elucidate methodologies to handle imbalanced data effectively. In this study, we discuss the binary logistic model and its limitations when dealing with imbalanced data, as model performance tends to be biased towards the majority class. We propose a novel approach to addressing imbalanced data and apply it to publicly available data from the VITAL trial, a large-scale clinical trial that examines the effects of vitamin D and Omega-3 fatty acid to investigate the relationship between vitamin D and cancer incidence in sub-populations based on race/ethnicity and demographic factors such as body mass index (BMI), age, and sex. Our results demonstrate a significant improvement in model performance after our undersampling method is applied to the data set with respect to cancer incidence prediction. Both epidemiological and laboratory studies have suggested that vitamin D may lower the occurrence and death rate of cancer, but inconsistent and conflicting findings have been reported due to the difficulty of conducting large-scale clinical trials. We also utilize logistic regression within each ethnic sub-population to determine the impact of demographic factors on cancer incidence, with a particular focus on the role of vitamin D. This study provides a framework for using classification models to understand relative variable importance when dealing with imbalanced data.
... These authors noted that even an EPV of 20 may still sometimes offer rather low statistical power and therefore, the original proposed threshold may not be sufficient to judge about the reliability of parameter (β) estimates. Salas-Eljatib et al. (2018) also reported that balanced binary data (i.e., a ratio of 0's and 1's close to unity) have a better sensitivity and specificity, yielding the lowest variability in β estimates when compared to unbalanced data. Their findings are likely transposable to the confidence or credible interval (hereafter CI) of the π's obtained, as well as the level of L 50 uncertainty. ...
... Basilone et al. (2021) reported similar findings in the European sardine (Sardina pilchardus) in which a low proportion of immature individuals sometimes resulted in wider bootstrapped CIs. Although these combined results are of analytical interest and corroborate earlier findings by Salas-Eljatib et al. (2018) about the benefits of balanced binary data in logistic regressions, a fisheries biologist has little control in practice over the composition of a randomly-selected sample of fish from which a L 50 estimate is derived. Attempting to reach more balanced data by gradually discarding the smallest fish with undeveloped gonads when they are over-represented for instance does not appear to constitute a way to circumvent such analytical situation, as most of the time this will not cause the L 50 estimate to vary by much since the inflexion point will mostly remain the same, but the associated uncertainty will increase as a result of progressively decreasing n. ...
Article
In fisheries management, the L50 is a commonly-monitored reproductive trait reflecting the sex-specific length (L) at which 50% of the fish are expected to exhibit developing gonads near the spawning period. Although different methods exist to estimate the L50, it is often obtained from a binomial generalized linear model (GLM) fitted with a logit link. To provide reliable parameter estimates for the production of the ogive from which the L50 is obtained, the logit link must be correctly specified, a prerequisite that is not always verified nor true. To identify the most-likely ogive, adequacy tests allowing to detect a lack-of-fit or any link misspecification should ideally be applied to competing binomial models that also consider alternative links, such as the probit and complementary log-log (cloglog). Because multiple estimates are often compared, their uncertainties should also be adequately quantified to detect any potential change, but no consensus exists about the method offering the best coverage probability. Here, we used a large-scale dataset of 23,681 walleye (Sander vitreus) females sampled in 90 harvested populations over two decades in Québec, Canada, to first estimate the L50 uncertainty under different simulation scenarios to identify the most accurate method regarding nominal coverage. Then, of 46 populations sampled in two different surveys, we assessed which link provided the best adequate estimation for each survey-specific L50 and used associated 84% CIs to determine if a change occurred based on whether they overlapped. The influence of gonad data characteristics on L50 uncertainty was also examined before assessing between-survey L50 variation with competing GLMs as an alternative approach. The analysis of the simulated data indicated that the Monte Carlo approach with bias-corrected and accelerated (BCa) interval offered the best coverage probability. The empirical analyses revealed that the probit was the most often adequate link to estimate the L50, followed by the cloglog, whereas the logit ranked last. When controlling for sample size, the level of L50 uncertainty increased with the percentage of overlap in length between walleye females with (1) and without (0) developing gonads, whereas it progressively decreased as the percentage of 1’s in the sample increased to then inversed, indicating that more balanced binary data yielded less L50 uncertainty. All analyses revealed that for a “true” or estimated effect size of ≤ 10%, the survey-specific 84% CIs often overlapped despite statistical support for a difference found in the best-supported GLM comparing both surveys. Overall, our study indicates that many analytical considerations, from gonad data structure characteristics, model adequacy assessment, up to how the L50 uncertainty is quantified and compared, need to be accounted for to achieve more reliable statistical inferences regarding the observed variation in this reproductive trait.
... Using a threshold in the form of a minimum volume of dead trees, on the other hand, could result in the mortality risk being overestimated in stands with the highest volume, i.e. older stands and those growing on more fertile sites. Based on the literature review (Woolley et al., 2012;Alenius et al., 2003;Salas-Eljatib et al., 2018), we concluded that a logistic model approach would be most appropriate for our study. However, the referred studies were mostly focused on modelling the mortality of individual trees. ...
... There is no consensus in the scientific literature on the optimal approach for developing predictive models using the logistic regression approach in cases of severe class imbalance (Crone and Finlay, 2012;Salas-Eljatib et al., 2018). We, therefore, compared two approaches: i) fitting the model with all available observations from 47,450 stands with the class imbalance (44,903 standsclass 0, 2547 -class 1) and ii) fitting the model with random subsampling of class 0 (no mortality) to 2547 observations so that the training data are balanced (50%/50%, Supplementary Fig. 3). ...
Article
Full-text available
Warmer and drier conditions increase forest mortality worldwide. At the same time, nitrogen deposition, longer growing seasons and higher atmospheric CO 2 concentrations may increase site productivity accelerating forest growth. However, tree physiological studies suggest that increased site productivity can also have adverse effects , reducing adaptation to drought. Understanding such intricate interactions that might foster tree mortality is essential for designing activities and policies aimed at preserving forests and the ecosystem services they provide. This study shows how site factors and stand features affect the susceptibility of Scots pine to drought-induced stand-level mortality. We use extensive forest data covering 750,000 ha, including 47,450 managed Scots pine stands, of which 2,547 were affected by mortality during the drought in 2015-2019. We found that the oldest and most dense stands growing on the most productive sites showed the highest susceptibility to enhanced mortality during drought. Our findings suggest that increasing site productivity may accelerate the intensity and prevalence of drought-induced forest mortality. Therefore, climate change may increase mortality, particularly in old and high-productive forests. Such exacerbated susceptibility to mortality should be considered in forest carbon sink projections, forest management, and policies designed to increase resilience and protect forest ecosystems.
... LR's lowest increase is likely due to the nature and assumptions of the LR algorithm itself. LR classifier performance can be impacted by the presence of imbalanced data(Salas-Eljatib et al., 2018). ...
Article
Full-text available
In this paper, a novel classification algorithm that is based on Data Importance (DI) reformatting and Genetic Algorithms (GA) named GADIC is proposed to overcome the issues related to the nature of data which may hinder the performance of the Machine Learning (ML) classifiers. GADIC comprises three phases which are data reformatting phase which depends on DI concept, training phase where GA is applied on the reformatted training dataset, and testing phase where the instances of the reformatted testing dataset are being averaged based on similar instances in the training dataset. GADIC is an approach that utilizes the exiting ML classifiers with involvement of data reformatting, using GA to tune the inputs, and averaging the similar instances to the unknown instance. The averaging of the instances becomes the unknown instance to be classified in the stage of testing. GADIC has been tested on five existing ML classifiers which are Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Logistic Regression (LR), Decision Tree (DT), and Naïve Bayes (NB). All were evaluated using seven open-source UCI ML repository and Kaggle datasets which are Cleveland heart disease, Indian liver patient, Pima Indian diabetes, employee future prediction, telecom churn prediction, bank customer churn, and tech students. In terms of accuracy, the results showed that, with the exception of approximately 1% decrease in the accuracy of NB classifier in Cleveland heart disease dataset, GADIC significantly enhanced the performance of most ML classifiers using various datasets. In addition, KNN with GADIC showed the greatest performance gain when compared with other ML classifiers with GADIC followed by SVM while LR had the lowest improvement. The lowest average improvement that GADIC could achieve is 5.96%, whereas the maximum average improvement reached 16.79%. Keywords: machine learning; genetic algorithms; classification; data importance
... Neither GLMMs nor threshold models formally require a balanced response distribution. However, the distribution can be so imbalanced such that the information content is too low to accurately estimate the model parameters, which in turn leads to a higher variability in the estimates (Salas-Eljatib et al. 2018). Less problems are expected with increasing amount of data. ...
Thesis
Full-text available
Health and reproductive traits are increasingly important in cattle breeding programms all around the world. In contrast to productivity traits, health and reproductive traits are often measured on a nominal or ordinal scale which makes classical breeding value estimation via linear mixed effects models (LMMs) inappropriate. Despite extensive litherature, application of generalized linear mixed effects models (GLMMs) and threshold models in practical breeding value estimation remains challenging due to limited availability of software implementation for this specific purpose. In this study we present available software packages, show their weaknesses and implement improvements. The implementations were tested on simulated data sets and compared with respect to computation time and accuracy of the estimated breeding values. The best implementations were applied to realworld data sets of some major Swiss cattle populations. Traits of interest were multiple birth, early-life calf survival and carcass confirmation scores. GLMMs and threshold models clearly improved the prediction of breeding values compared to LMMs when applied to simulated binary and ordinal traits. Bayesian implementations performed relatively slow for small data sets but returned trustworthy standard errors of the estimated breeding value by accounting for the uncertainty of variance component estimation. The improvements also came at a higher computational cost, however, the cost was largely reduced by assuming known variance components. A similar strategy was successfully applied to the much larger real world data sets by separately estimating variance components and animal breeding values. This study shows that GLMMs and threshold models can and should be applied for non-normal traits in order to improve the properties of estimated breeding value and obtain unbiased heritability estimates which allow for well-informed constructions of selection indices.
... To convert the outputs of the Maxent and logistic regression models to the binary prediction (i.e., 0 indicates that the habitat is not suitable, while 1 indicates that the habitat is suitable), a threshold of 0.5 was set. This threshold is often used as the default threshold for binary classification [37,38]. Given that both algorithms are probability-based methods, to assign samples to the class with relatively higher probability, the threshold was set at the middle point between 0% and 100%. ...
Article
Full-text available
Extensive occurrence of rice sheath blight has been observed in China in recent years due to agricultural practices and climatic conditions, posing a serious threat to rice production. Assessing habitat suitability for rice sheath blight at a regional scale can provide important information for disease forecasting. In this context, the present study aims to propose a regional-scale habitat suitability evaluation method for rice sheath blight in Yangzhou city using multisource data, including remote sensing data, meteorological data, and disease survey data. By combining the epidemiological characteristics of the crop disease and the Relief-F algorithm, some habitat variables from key stages were selected. The maximum entropy (Maxent) and logistic regression models were adopted and compared in constructing the disease habitat suitability assessment model. The results from the Relief-F algorithm showed that some remote sensing variables in specific temporal phases are particularly crucial for evaluating disease habitat suitability, including the MODIS products of LAI (4–20 August), FPAR (9–25 June), NDVI (12–20 August), and LST (11–27 July). Based on these remote sensing variables and meteorological features, the Maxent model yielded better accuracy than the logistic regression model, with an area under the curve (AUC) value of 0.90, overall accuracy (OA) of 0.75, and a true skill statistics (TSS) value of 0.76. Indeed, the results of the habitat suitability assessment models were consistent with the actual distribution of the disease in the study area, suggesting promising predictive capability. Therefore, it is feasible to utilize remotely sensed and meteorological variables for assessing disease habitat suitability at a regional scale. The proposed method is expected to facilitate prevention and control practices for rice sheath blight disease.
... Logistic regression classifier is a classical classification method in machine learning, and logistic regression models belong to log-linear models [16][17]. The most common logistic regression model is the binomial logistic regression classification model, and the conditional probability distribution of the classification is shown as follows: ...
Article
Full-text available
This paper analyzes the relationship between talent introduction policy and talent mobility based on a logistic breakpoint regression model. The logit function and likelihood function are analyzed to determine the probability values between the relationship of things. The variable selection is achieved by adding penalty terms in estimating the regression parameters. Focused on the analysis of the MM algorithm to select the optimal control function and iterative solution to separate the parameters in the optimization problem. The processing effect of breakpoint analysis is studied, and breakpoint regression is classified and used to optimize the objective function to the minimum using local linear regression. The feasibility of the breakpoint analysis method is demonstrated by analyzing the use of breakpoint regression in different fields. The results surface: the probability of implementing a talent introduction policy for individuals on one side of the breakpoint is 0, while the probability on the other side of the breakpoint is 1, which meets the condition of exact breakpoint regression. The most talent policies are mainly concentrated in the third and fourth tier cities, accounting for 49%, followed by the second tier cities, with 39% of talent introduction policies.
... Thus, this work compares the performance of classical machine learning techniques with deep learning techniques on a dataset collected from a mobile application for spirometric measurements [11]. Nevertheless, class imbalance is a problem that compromises the performance of artificial intelligence models [ 13], hence, after contrasting multiple oversampling methods, a maximum accuracy higher than 99% is obtained for the prediction of lung ages. ...
Chapter
The use of artificial intelligence in the quest to contribute to human longevity is becoming increasingly common in medical settings, one of these being spirometry. Given the different factors that can deteriorate the pulmonary status, several works aim to establish ways to alert future patients of their potential pulmonary complications. Thus, we carry out a lung age prediction task from spirometry data extracted using a previously developed mobile application. Regarding the imbalanced classes, SMOTE, ADASYN, and Random Oversampling algorithms were compared with different classifier models. The SMOTE and Quadratic Discriminant Analysis combination achieves a 99.12% accuracy, 99.09% specificity, and 99.91% sensitivity. Additionally, we performed an exploratory analysis of deep learning models, demonstrating that multilayer perceptron models, along with feature fusion techniques, achieve higher performances than classical models such as K-Nearest Neighbors or Decision Trees.KeywordsArtificial IntelligenceSpirometryMachine Learning
Article
Early warning of increased algal activity is important to mitigate potential impacts on aquatic life and human health. While many methods have been developed to predict increased algal activity, an ongoing issue is that severe algal blooms often occur with low frequency in water bodies. This results in imbalanced data sets available for model specification, leading to poor predictions of the frequency of increased algal activity. One approach to address this is to resample data sets of increased algal activity to increase the prevalence of higher than normal algal activity in calibration data and ultimately improve model predictions. This study aims to investigate the use of resampling techniques to address the imbalanced dataset and determine if such methods can improve the prediction of increased algal activity. Three techniques were investigated, Kmeans under-sampling (US_Kmeans), synthetic minority over-sampling technique (SMOTE), and 'SMOTE and cluster-based under-sampling technique' (SCUT). The resampling methods were applied to a Bayesian network (BN) model of Lake Burragorang in New South Wales, Australia. The model was developed to predict chlorophyll-a (chl-a) using a range of water quality parameters as predictors. The original data and each of the balanced datasets were used for BN structures and parameter learning. The results showed that the best graphical structure was obtained by adding synthetic data from SMOTE with the highest true positive rate (TPR) and area under the curve (AUC). When compared using a fixed graphical structure for the BN, all resampling techniques increased the ability of the BN to detect events with higher probability of increased algal activity. The resampling model results can also be used to better understand the most important influences on high chl-a concentrations and suggest future data collection and model development priorities.
Article
Full-text available
Tree mortality is a key process shaping forest dynamics. Thus, there is a growing need for indicators of the likelihood of tree death. During the last decades, an increasing number of tree-ring based studies have aimed to derive growth–mortality functions, mostly using logistic models. The results of these studies, however, are difficult to compare and synthesize due to the diversity of approaches used for the sampling strategy (number and characteristics of alive and death observations), the type of explanatory growth variables included (level, trend, etc.), and the length of the time window (number of years preceding the alive/death observation) that maximized the discrimination ability of each growth variable. We assess the implications of key methodological decisions when developing tree-ring based growth–mortality relationships using logistic mixed-effects regression models. As examples, we use published tree-ring datasets from Abies alba (13 different sites), Nothofagus dombeyi (one site), and Quercus petraea (one site). Our approach is based on a constant sampling size and aims at (1) assessing the dependency of growth–mortality relationships on the statistical sampling scheme used, (2) determining the type of explanatory growth variables that should be considered, and (3) identifying the best length of the time window used to calculate them. The performance of tree-ring-based mortality models was reasonably high for all three species (area under the receiving operator characteristics curve, AUC > 0.7). Growth level variables were the most important predictors of mortality probability for two species (A. alba, N. dombeyi), while growth-trend variables need to be considered for Q. petraea. In addition, the length of the time window used to calculate each growth variable was highly uncertain and depended on the sampling scheme, as some growth–mortality relationships varied with tree age. The present study accounts for the main sampling-related biases to determine reliable species-specific growth–mortality relationships. Our results highlight the importance of using a sampling strategy that is consistent with the research question. Moving towards a common methodology for developing reliable growth–mortality relationships is an important step towards improving our understanding of tree mortality across species and its representation in dynamic vegetation models.
Article
Full-text available
Ecologists widely use the log response ratio for summarizing the outcomes of studies for meta-analysis. However, little is known about the sampling distribution of this effect size estimator. Here I show with a Monte Carlo simulation that the log response ratio is biased when quantifying the outcome of studies with small sample sizes, and can yield erroneous variance estimates when the scale of study parameters are near zero. Given these challenges, I derive and compare two new estimators that help correct this small-sample bias, and update guidelines and diagnostics for assessing when the response ratio is appropriate for ecological meta-analysis. These new bias-corrected estimators retain much of the original utility of the response ratio and are aimed to improve the quality and reliability of inferences with effect sizes based on the log ratio of two means.
Article
Full-text available
Global and regional climate models (GCM and RCM) are generally biased and cannot be used as forcing variables in ecological impact models without some form of prior bias correction. In this study, we investigated the influ-ence of the bias correction method on drought projections in Mediterranean forests in southern France for the end of the twenty-first century (2071–2100). We used a water balance model with two different atmospheric climate forcings built from the same RCM simulations but using two different correction methods (quantile mapping or anomaly method). Drought, defined here as periods when vegetation functioning is affected by water deficit, was described in terms of intensity, duration and timing. Our results showed that the choice of the bias correction method had little effects on temperature and global radiation projections. However, although both methods led to similar predictions of precipitation amount, they in-duced strong differences in their temporal distribution, espe-cially during summer. These differences were amplified when the climatic data were used to force the water balance model. On average, the choice of bias correction leads to 45 % uncertainty in the predicted anomalies in drought intensity along with discrepancies in the spatial pattern of the predicted changes and changes in the year-to-year variability in drought characteristics. We conclude that the choice of a bias correc-tion method might have a significant impact on the projections of forest response to climate change.
Article
Rising temperatures are amplifying drought-induced stress and mortality in forests globally. It remains uncertain, however, whether tree mortality across drought-stricken landscapes will be concentrated in particular climatic and competitive environments. We investigated the effects of long-term average climate [i.e. 35-year mean annual climatic water deficit (CWD)] and competition (i.e. tree basal area) on tree mortality patterns, using extensive aerial mortality surveys conducted throughout the forests of California during a 4-year statewide extreme drought lasting from 2012 to 2015. During this period, tree mortality increased by an order of magnitude, typically from tens to hundreds of dead trees per km 2 , rising dramatically during the fourth year of drought. Mortality rates increased independently with average CWD and with basal area, and they increased disproportionately in areas that were both dry and dense. These results can assist forest managers and policy-makers in identifying the most drought-vulnerable forests across broad geographic areas.
Article
Binary data are popular in ecological and environmental studies; however, due to various uncertainties and complexities present in data sets, the standard generalized linear model with a binomial error distribution often demonstrates insufficient predictive performance when analysing binary and proportional data. To address this difficulty, we propose an asymmetric logistic regression model that uses a new parameter to account for data complexity. We observe that this parameter controls the model's asymmetry and is important for adjusting the weights associated with observed data in order to improve model fitting. This model includes the ordinary logistic regression model as a special case. It is easily implemented using a slight modification of glm or glmer in statistical software R . Simulation studies suggest that our new approach outperforms a traditional approach in terms of both predictive accuracy and variable selection. In a case study involving fisheries data, we found that the annual catch amount had a greater impact on stock status prediction, and improved predictive capability was supported with a smaller AIC compared to a generalized linear model. In summary, our method can enhance the applicability of a generalized linear model to various ecological problems using a slight modification, and significantly improves model fitting and model selection.
Article
Presence-only data abounds in ecology, often accompanied by a background sample. Although many interesting aspects of the species’ distribution can be learned from such data, one cannot learn the overall species occurrence probability, or prevalence, without making unjustified simplifying assumptions. In this forum article we question the approach of Royle et al. (2012) that claims to be able to do this.
Book
Maintaining G.S. Maddala’s brilliant expository style of cutting through the technical superstructure to reveal only essential details, while retaining the nerve centre of the subject matter, Professor Kajal Lahiri has brought forward this new edition of one of the most important textbooks in its field. The new edition continues to provide a large number of worked examples, and some shorter data sets. Further data sets and additional supplementary material to assist both the student and lecturer are available on the companion website www.wileyeurope.com/college/maddala