ArticlePDF Available

A study on the effects of unbalanced data when fitting logistic regression models in ecology

Authors:

Abstract and Figures

Binary variables have two possible outcomes: occurrence or non-occurrence of an event (usually with 1 and 0 values, respectively). Binary data are common in ecology, including studies of presence/absence, alive/dead, and change/no-change. Logistic regression analysis has been widely used to model binary response variables. Unbalanced data (i.e., an extremely larger proportion of zeros than ones) are often found across a variety of ecological datasets. Sometimes the data are balanced (i.e., same amount of zeros and ones) before fitting the model, however, the statistical implications of balancing (or not) the data remain unclear. We assessed the statistical effects of balancing data when fitting a logistic regression model by studying both its statistical properties of the estimated parameters and its predictive capabilities. We used a base forest-mortality model as reference, and by using stochastic simulations representing different configurations of 0/1 data in a sample (unbalanced data scenarios), we fitted the logistic regression model by maximum likelihood. For each scenario we computed the bias and variance of the estimated parameters and several prediction indexes. We found that the variability of the estimated parameters is affected, with the balanced-data scenario having the lowest variability, thus, affecting the statistical inference as well. Furthermore, the prediction capabilities of the model are altered by balancing the data, with the balanced-data scenario having the better sensitivity/specificity ratio. Balancing, or not, the data to be used for fitting a logistic regression models may affect the conclusion that can arise from the fitted model and its subsequent applications.
Content may be subject to copyright.
Contents lists available at ScienceDirect
Ecological Indicators
journal homepage: www.elsevier.com/locate/ecolind
Research paper
A study on the eects of unbalanced data when tting logistic regression
models in ecology
Christian Salas-Eljatib
a,
, Andres Fuentes-Ramirez
a
, Timothy G. Gregoire
b
, Adison Altamirano
c
,
Valeska Yaitul
a
a
Laboratorio de Biometría, Departamento de Ciencias Forestales, Universidad de La Frontera, Temuco, Chile
b
School of Forestry and Environmental Studies, Yale University, New Haven, CT 065111, USA
c
Laboratorio de Ecología del Paisaje Forestal, Departamento de Ciencias Forestales, Universidad de La Frontera, Temuco, Chile
ARTICLE INFO
Keywords:
Statistical inference
Model prediction
Logit model
Binary variable
Bias
Precision
ABSTRACT
Binary variables have two possible outcomes: occurrence or non-occurrence of an event (usually with 1 and 0
values, respectively). Binary data are common in ecology, including studies of presence/absence, alive/dead,
and change/no-change. Logistic regression analysis has been widely used to model binary response variables.
Unbalanced data (i.e., an extremely larger proportion of zeros than ones) are often found across a variety of
ecological datasets. Sometimes the data are balanced (i.e., same amount of zeros and ones) before tting the
model, however, the statistical implications of balancing (or not) the data remain unclear. We assessed the
statistical eects of balancing data when tting a logistic regression model by studying both its statistical
properties of the estimated parameters and its predictive capabilities. We used a base forest-mortality model as
reference, and by using stochastic simulations representing dierent congurations of 0/1 data in a sample
(unbalanced data scenarios), we tted the logistic regression model by maximum likelihood. For each scenario
we computed the bias and variance of the estimated parameters and several prediction indexes. We found that
the variability of the estimated parameters is aected, with the balanced-data scenario having the lowest
variability, thus, aecting the statistical inference as well. Furthermore, the prediction capabilities of the model
are altered by balancing the data, with the balanced-data scenario having the better sensitivity/specicity ratio.
Balancing, or not, the data to be used for tting a logistic regression models may aect the conclusion that can
arise from the tted model and its subsequent applications.
1. Introduction
Data of occurrence/non-occurrence of a phenomenon of interest are
vastly found across several disciplines (Alberini, 1995; Arana and Leon,
2005; Bell et al., 1994). This type of variable is known as binary or
dichotomous, and it represents whether an event occurs or not. This
event is represented by the random variable Y, and we usually record
occurrence by Y= 1 and non-occurrence by Y= 0. In ecology, binary
variables arise when studying the presence of a species in a geographic
area (Bastin and Thomas, 1999; Phillips and Elith, 2013; Hastie and
Fithian, 2013) or the occurrence of mortality at the tree or forest level
(Davies, 2001; Wunder et al., 2008; Chao et al., 2009; Young et al.,
2017). Meanwhile in landscape ecology, binary variables are used to
represent the occurrence of re within a given area (Bigler et al., 2005;
Mermoz et al., 2005; Dickson et al., 2006; Vega-García and Chuvieco,
2006; Palma et al., 2007; Bradstock et al., 2010); deforestation (Wilson
et al., 2005; Schulz et al., 2011; Kumar et al., 2014; Hu et al., 2014);
and in general the change from one land use category to another (Seto
and Kaufmann, 2005; Leyk and Zimmermann, 2007; Lander et al.,
2011).
Logistic regression analysis is the most frequently used modelling
approach for analyzing binary response variables. If we need to model a
binary variable, to statistically relate it to predictor variable(s) or
covariate(s), one of the most used approaches for pursuing this task in
ecology is to use logistic regression models (Warton and Hui, 2011).
These models belong to group of the generalized linear models (GLM).
In a GLM, three compartments must be specied (Lindsey, 1997;
Schabenberger and Pierce, 2002): a random component, a systematic
component, and a link function. A logistic regression model uses: a
binomial probability density function as the random component; a
linear predictor function Xβ(where Xis a matrix with the covariates
and βis a vector with the parameters or coecients) as the systematic
component; and a logistic equation as the link function. One of the key
advantages of using logistic regression models in ecology is that the
http://dx.doi.org/10.1016/j.ecolind.2017.10.030
Received 7 April 2017; Received in revised form 9 August 2017; Accepted 16 October 2017
Corresponding author. Tel.: +56 45 2325652.
E-mail address: christia.salas@ufrontera.cl (C. Salas-Eljatib).
(FRORJLFDO,QGLFDWRUV²
;(OVHYLHU/WG$OOULJKWVUHVHUYHG
0$5.
probability of the binary response variable is directly modelled, thereby
accounting explicitly for the random nature of the phenomenon of in-
terest.
In many applications when dealing with binary data in ecology, it
happens that the number of observations with ones (Y= 1) is much
smaller than the number of observations with zeros (Y= 0) or vice
versa. We simply term this situation as unbalanced data, but other terms
have been also used for this situation, including disproportionate
sampling (Maddala, 1992) or rarity events (King and Zeng, 2001).
Based on our review of scientic applications of logistic regression to
model ecological phenomena, the proportion of zeros in datasets ranges
between 80% and 95%. Therefore, having balanced data (i.e., equal
numbers of observations of zeros and ones) is more the exception than
the rule.
Both unbalanced and balanced data have been used for tting lo-
gistic regression models. In ecological studies, some researchers have
adopted the practice of balancing the data before carrying out the
analyses (e.g., Vega-García et al., 1995; Vega-García et al., 1999; Lloret
et al., 2002; Brook and Bowman, 2006; Vega-García and Chuvieco,
2006; Jones et al., 2010; Rueda, 2010). Balancing data means to select,
by some rule (usually at random), the same amount of observations
with ones and zeros from the originally available dataset. Therefore, a
balanced dataset or balanced sample is created, where a 5050% pro-
portion of zero and one values is met. After the balanced dataset is
built, the logistic regression model is tted (i.e., its parameters are es-
timated) by maximum likelihood (ML). An example of this practice in
ecological applications is the option for balancing data before tting a
logit model when conducting analyses of land use changes in the soft-
ware IDRISI (Eastman, 2006). On the other hand, it is important to
point out that unbalanced data have been also used in ecological studies
(Wilson et al., 2005; Echeverria et al., 2008; Kumar et al., 2014; Young
et al., 2017). Therefore, unbalanced data in applied ecological studies
has been considered as not having important eects into the models
being tted. Moreover, to date, no studies have addressed the eect of
balancing data when tting logistic regression models in ecological
analyses, and just a handful have explored some statistical implications
in ecological applications (Qi and Wu, 1996; Wu et al., 1997; Cailleret
et al., 2016).
The applied statistical implications of unbalanced data in logistic
regression are not well described nor realized for applied researchers.
Although balancing the data seems to be an accepted practice, the
reasons that justify its use are not well explained. The most immediate
eect of balancing the data is to greatly reduce the sample size avail-
able for tting purposes, therefore decreasing the precision with which
the parameters of the model are estimated. Among the statistical studies
on logistic regression and unbalanced data, we highlight the following:
Schaefer (1983) and Scott and Wild (1986) pointed out that the max-
imum likelihood estimates (MLE) of a logit model are biased only for
small sample sizes. On the other hand, Xie and Manski (1989) stated
that unbalanced data only aect the intercept parameter of a logit
model, specically being biased estimated according to Maddala
(1992).King and Zeng (2001), advocated that all the MLE of the logit
parameters are biased. Schaefer (1983) and Firth (1993) proposed
correction for the bias of the MLE of the logistic regression model
parameters. McPherson et al. (2004) conducted one of the few related
analysis when tting presence-absence species distribution models in
ecology, but only focusing in the prediction capabilities of the tted
models. Maggini et al. (2006) assessed the eect of weighting absences
when modelling forest communities by generalized additive models.
Recently, Komori et al. (2016) indicated that logistic regression suer
poor predictive performance, and proposed an alternative model to
improve predictive performance. Komori et al. (2016) approach in-
volves to add a new parameter to the original structure of a logistic
regression model, and tted it in a mixed-eects modelling framework,
therefore their approaches becomes a dierent type of statistical model.
From above, we can infer that: (a) most of the statistical studies on
logistic regression and unbalanced data have focus on the bias of the
MLE parameters (a topic that has been rarely taking into account in
ecological applications); (b) much less attention has been put into the
prediction performance; and (c) no study has dealt with the eects of
unbalanced data in the variance of the MLE parameters.
In this study we aimed at assessing the eect of using unbalanced
data when tting logistic regression models by analyzing both the
statistical properties (i.e., bias and variance) of the estimated para-
meters and the predictive capabilities of the tted model.
2. Methods
2.1. Base model
The binary variable (Y) is the occurrence of a phenomenon of in-
terest, where Y= 1 denotes occurrence and Y= 0 otherwise. In a
modelling framework, we seek to model the probability of the response
variable being Y= 1, given the values of the predictor variables, this is
Pr(Y= 1|X), that we can more easily represent by π
yX
.
In our analysis we used a logistic regression equation with ve
predictor variables, as a base model for carrying out our analysis. This
model served as a reference for assessing the statistical eects of un-
balanced data on tting logistic regression models. The binary variable
of forest mortality occurrence (Y), given the analyses of Young et al.
(2017) in the state of California, USA, is modeled as a function of cli-
mate and biotic variables, as follows:
===++++
+
π
πYββXβXβXβX
βX
ln 1logit[ 1]
,
yx
yx
i01
1223344
55
ii
ii
iiii
i
(1)
where Y
i
is the occurrence of forest mortality (i.e., 1 for occurrence, 0
for non-occurrence) at the ith pixel), meanwhile the predictor variables
X
1i
,X
2i
,X
3i
,X
4i
, and X
5i
represent the: mean climatic water decit
(CWD) or simply Defnorm
i
; basal area of live trees (BA
i
);
B
Ai
2
; CWD
anomaly (Defz0
i
); and Defnorm
i
×BA
i
for the ith pixel, respectively. We
have used the nomenclature for the variables as in the study of Young
et al. (2017) and only the available data for year 2012. Notice that we
could more easily represent model (1) as:
===
β
π
πyX
l
n1logit[ 1] ,
yX
yX (2)
where yis the vector with the binary variable, Xis the matrix with the
predictor variables (and a rst column of 1), and βis the vector of
parameters
ˆ
β
[
0
,
ˆ
β
1
,
ˆ
β
2
,
ˆ
β3,
β
4,
ˆ
β]
5.
In the sequel, we shall use Eq. (2) as the mean function in various
scenarios of unbalanced data. It is important to point out that we are
not interested in nding the best model, but rather on studying the
eects of using several unbalanced data scenarios on a reference model.
Furthermore, we want to remark that we are not pursuing to assess
dierent alternative statistical models for unbalanced data (e.g. as in,
Warton and Hui, 2011; Hastie and Fithian, 2013). We also want to
mention that the zero-inated models are those focusing on modelling
count variables (Schabenberger and Pierce, 2002; Zuur et al., 2010),
such as the prediction of the amount of tree mortality (e.g., Aeck,
2006). These models are not part of our study, since we are dealing with
modelling a binomial variable.
2.2. Unbalanced data scenarios
We use data of forest mortality occurrence from Young et al. (2017),
in California during 2012 as our population, containing 11763 total
observations (N), with 2985 cases of mortaltity occurrence (N
1
) and
8778 cases of non-occurrence (N
0
). In order to assess the eects of
unbalanced data on the statistical properties of the logit model (Eq.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

(2)), we examined dierent sample strategies from the population,
where each has a dierent proportion of occurrence and non-occur-
rence of mortality (1 and 0 values, respectively). We xed the sample
size in n= 1000 in all scenarios, and the number of cases with zeros
and ones for the response variable that the sample should contain,
across scenarios ranging from 10% to 90%. In this way, we constrained
the sample to containing dierent cases with zeros (n
0
) and ones (n
1
),
but the same sample size (n= 1000). In order to achieve each of the
proportion of 0/1 values, which has a xed sample size of 0 and 1 (i.e.,
n
0
and n
1
, respectively), we (i) drew a random sample without re-
placement of size (n
0
) from the sub-population (with size N
0
) of cases
containing zero in the response variable; (ii) drew a random sample
without replacement of size n
1
from the sub-population (with size N
1
)
of cases containing ones in the response variable; and (iii) merge the
randomly selected n
0
and n
1
cases in a sample of size n(i.e.,
n=n
0
+n
1
).
2.3. Statistical assessment
We assessed the statistical properties of the tted logistic regression
model by stochastic simulations (i.e., Monte Carlo simulation). We
carried out S= 100, 000 simulations so that the sampling error of the
simulation itself is negligibly small. A similar analysis to justify the
number of simulation has been conducted by Gregoire and
Schabenberger (1999), in agreement with the amount of simulations
conducted in other statistical simulation studies (e.g. Gregoire and
Salas, 2009; Salas and Gregoire, 2010). For each simulated sample, we
tted the logistic regression model (Eq. (2)) by maximum likelihood
using the glm function implemented in R (R Development Core Team,
2016).
Based on the simulations, we examined the empirical distribution of
the estimated parameters and prediction indexes. Our assessment was
divided and focused in: (a) the statistical properties of estimated model
parameters, and (b) the accuracy of predictions from the tted model.
(a) Statistical properties of the estimated parameters. In order to assess
how the accuracy of the estimated parameters is aected by unbalanced
data, we computed the empirical bias (B
MC
) of each parameter being
estimated,
ˆ
θ, as follows:
=
ˆˆ
θθ θ
B
[] E[]
,
MC (3)
where θis the respective parameter value and
ˆ
θ
E
[]
is the empirical
expected value of the estimated parameter. The former was obtained
from the maximum likelihood estimate (MLE) of θusing the population
available, and the latter is approximated from the average of the S
values of the estimated parameter
ˆ
θ. Notice that
ˆ
θin Eq. (3) is replaced
by each parameter of the model (i.e.,
ˆ
β
0
,
ˆ
β
1
,
ˆ
β
2
,
ˆ
β3,
ˆ
β3, and
ˆ
β5).
In order to assess how the precision of the estimated parameters is
aected by unbalanced data, we computed the empirical variance
(V
MC
) of each estimated parameter
ˆ
θas follows:
=
=
θSθθ
V
[ˆ]1(ˆE[ˆ])
,
j
S
jMC
1
2
(4)
where
ˆ
θis the MLE of θfor the jth simulation. Finally, we compute the
empirical mean square error (ECM
MC
) of each
ˆ
θby:
=+
ˆˆ ˆ
θθ θ
E
CM [ ] V ( ) [B ( ) ]
MC MC MC 2(5)
We represented the variance and mean square error in the same units of
the estimated parameters by taking their square root, thus obtaining the
standard error (SE
ˆ
θ
[
]
) and their root mean squared error (RMSE
ˆ
θ
[
]
).
(b) Prediction capabilities. For each simulation and unbalanced data
scenario we computed prediction indexes of the logistic regression
model. In order to do so, we calculated the predicted probability of
mortality occurrence for the ith observation ( =
ˆ
πyX1
i
i
), as follows:
=+
=
ˆ
ˆ
π
e
1
1
,
yβ
XX
1
iii(6)
where X
i
is the matrix of predictor variables for the ith case and
ˆ
β
is the
vector of estimated parameters. We use a probability threshold of 0.5
for occurrence, that is to say, if
=
ˆ
π0.5
yX1
ii
we assume that the event
occurs, and non-occurrence otherwise (Jones et al., 2010). Based on
these predicted probabilities, we computed the following eight pre-
diction indexes: commission error (proportion of ncases in which the
model erroneously predicts occurrence); commission accuracy (pro-
portion of ncases in which the model correctly predicts occurrence);
omission error (proportion of ncases in which the model erroneously
predicts non-occurrence); omission accuracy (proportion of ncases in
which the model correctly predicts non-occurrence); sensitivity (pro-
portion of the total cases of occurrence where the model correctly
predicts occurrence); and specicity (proportion of the total cases of
non-occurrence where the model correctly predicts non-occurrence).
We have also carried out all the above analyses (i.e., simulation and
statistical properties assessment) for a dierent dataset. We used data of
forest re occurrence in central-Chile, as a way of representing how our
ndings could change in a forest re model, and the main results are
shown in Supplementary Material.
3. Results
The proportion of 0/1 in the data used for tting a logistic regres-
sion model aects the distribution of the estimated parameters. The
variability of the estimated parameters tends to increase with an ex-
treme proportion of zero (or ones) in the data (Fig. 1).
Unbalanced data aects on the bias of the estimated parameters. All
the parameters estimates were nearly unbiased for the proportion of
zeros data assessed that is closer to the proportion of zeros in the entire
population (First row panel of Fig. 1). However, for the other un-
balanced data scenarios, all parameters are biasedly estimated (Fig. 1).
The bias increases as the proportion of zeros in the data decreases both
in nominal units (Fig. 1), as well as in percentage (Fig. 2a). The bias is
larger for the estimated intercept-parameter than for the other para-
meters, regardless the unbalanced data scenario. The only exception to
this trend is the estimate of the parameter β
2
, being also heavily biased,
which could be a result of its higher variability compared to the other
parameter estimates (Fig. 2b). More importantly, the greatest precision
of all estimated parameters occurs with balanced data (Fig. 2b), as well
as the lowest root mean squared error (Fig. 2c). The reported greatest
precision of the estimated parameter for the balanced-data scenario was
even more pronounced for the forest re model (Fig. 4). This can be a
result of a stronger relationship among the response and the predictor
variables, than we found in the forest mortality model. Besides, the
forest re model (Eq. (7) in Supplementary Material) has a lower
number of parameters, therefore multicollinearity should be a minor
problem than in a model with two more parameters (Eq. (1)). In fact, in
the mortality model there are two parameters representing function of
variables already present in the model (i.e.,
B
Ai
2
and Defnorm
i
×BA
i
),
therefore the model is aected by multicollinearity.
The prediction capabilities of the logit model are greatly aected by
the dierent proportions of zeros and ones. Both overall error (i.e., sum
of omission and commission errors) and overall accuracy (i.e., sum of
omission and commission accuracy) tend to be better, with a decreasing
and increasing trend, respectively, when extreme proportions of zeros
(or ones) are used for tting the model (Table 1). Moreover, the larger
is the proportion of zeros in the data, the better is the prediction of non-
occurrence (i.e., higher values of omission accuracy). A similar trend,
but not completely linear, is found when the omission errors are used as
reference. On the contrary, the larger is the proportion of ones in the
data, the better is the prediction of occurrence (i.e., higher values of
commission accuracy). A similar trend is found, when the commission
errors are used as reference (Table 1). A clear pattern is observed if
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

specicity or sensitivity are used as reference. Hence, specicity in-
creases with higher number of zeros, but sensitivity decreases as
number of zeros increases (Table 1).
4. Discussion
In this paper we demonstrated that the common unbalanced pro-
portion of zeros and ones found in ecological data aects the statistical
properties of logistic regression models being tted. Because the var-
iances of the estimated parameters are aected by the proportion of 0/1
data, all the statistical inference (e.g., hypothesis testing) of the tted
model will be aected. Thus, if we are investigating the driver variables
of a ecological phenomenon, such as species distribution across a geo-
graphic area, we could be erroneously determining them, because the
statistical signicance of each parameter of the model is based upon
their respective variance estimator. Therefore, the practice of balancing
data must be carried out with caution, as well as fully considering its
implications for model performance. Some authors have argued that
there is no major eect in having unbalanced binary data (except for
the bias in the intercept parameter, Maddala, 1992), but our results
indicate that all statistical properties of the MLE parameters are af-
fected. Although all the parameters estimates are biased, the magnitud
of bias will diminish as soon as our sample mimic the proportion of
zeros that are found in the population (see the crossed lines in Fig. 2a).
Notice that it has previously been stated that all the parameters would
be biased for small samples sizes (Schaefer, 1983; King and Zeng,
2001), but that was not necessarily the case in the present study (where
n= 1000).
We also claim that the prediction capabilities of the logistic re-
gression model are aected, as also was found by McPherson et al.
(2004) and Maggini et al. (2006), but using slightly dierent statistical
models. Thus, a given ecological binary phenomenon could be erro-
neously predicted to occur (or not) if the tted model suers from
statistical issues derived from using unbalanced data. This is especially
critical for predicting habitat suitability for endangered species (and its
conservation) or for predicting the distribution range of exotic invasive
species and their subsequent control plans. In either case it can result in
allocating eorts and resources in an inecient manner. In this study
we encourage researchers to carefully examine the nature of the data
they have available and the 0/1 proportion of it before tting the
Fig. 1. Empirical distribution of estimated parameters for the forest mortality model (Eq. (1)) given dierent scenarios of zeros in the data. The vertical solid line and the vertical dashed
line, within each histogram, represents the parameter value and the Monte Carlo expected value of the estimated parameter, respectively.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

logistic regression model, as this can greatly improve both the statistical
properties of the estimated parameters of the model and the prediction
capabilities applied to ecological phenomena.
We did not focus on analyzing alternatives for overcoming the ef-
fects of unbalanced data nor on nding the best tted model. Some
studies dealing with models in ecology have shown the necessity of
eectively correct biased analyses for better interpretation and pre-
diction capabilities (Lajeunesse, 2015; Ruault et al., 2014), but we
focused on pointing out the eects of unbalanced data when tting
logistic regression models. The bias in the intercept of a logistic model
could be diminished when using the correction given by Manski and
Lerman (1977), but this type of correction is better suited for disciplines
where higher proportion of ones in the population is more common to
nd or sample (e.g., social, economy, and political sciences), than for
ecological populations.
The statistical inference of tted logistic regression models is af-
fected by the unbalanced nature of ecological data. Our results show
that the largest standard error and root mean squared error of the es-
timated parameters are found when having extreme proportion of zeros
(or ones) in the data. More importantly, for the rst time in the lit-
erature, as far as we are aware of, we described that the variability of
the maximum likelihood estimated (MLE) parameters decreases when
having a balanced sample. This nding may suggest that balancing data
is an appropriate practice, if statistical inference (e.g., hypothesis
testing), is what the researcher is concerned about. Hence, by using
unbalanced data, we might conclude that a predictor variable is sta-
tistically signicant when in fact it is not, or otherwise. Furthermore,
this nding refutes what King and Zeng (2001) had claimed regarding
that the addition of ones into the data, would decrease the variance of
the MLE parameters.
We also found that unbalanced data heavily aects the prediction
capabilities of a logistic regression model. Our study reects that the
occurrence of the event is better predicted when having larger pro-
portions of ones in the data. On the other hand, non-occurrence of the
event is better predicted when having larger proportions of zeros in the
data. This trend is expected, because the model is tted by ML, where
the parameters estimates are those that maximize the likelihood of the
data at hand, therefore we should predict them concordantly
(Schabenberger and Pierce, 2002). Also, if we take into account the
trade-oof building a model that predicts occurrence and non-occur-
rence as best as possible, the balanced data scenario with a 50% of zeros
and ones oers a suitable way to proceed (Table 1). Overall, balancing
the data seems to be an appropriate practice to improve some statistical
properties and prediction capabilities of the tted model. Regardless of
balancing or not balancing the data before tting a logistic regression
model, we recommend to use the remaining sample (i.e., not used for
tting the model) for validation purposes and behavior analyses.
5. Recommendations
Given that the proportion of 0/1 data aects the variance of the
estimated parameters of the tted logistic regression model, the selec-
tion of the statistically signicant predictor variables to conduct the
analyses may also being inuenced, ultimately leading to a wrong
Fig. 2. Statistical properties of the estimated parameters for the forest
mortality model (Eq. (1)), given dierent scenarios of zeros in the data. (a)
Bias, (b) standard error, and (c) root mean squared error are shown as a
percentage of the real parameter value.
Table 1
Prediction indexes of the logistic regression model depending upon unbalanced data
scenarios. Each value is the empirical expected value of the respective index.
Proportions of zeros in the sample
10% 30% 50% 70% 90%
Commission
Error (%) 0.41 4.44 13.09 25.12 9.99
Accuracy (%) 89.59 65.56 36.91 4.88 0.01
Omission
Error (%) 9.54 21.33 21.52 4.54 0.01
Accuracy (%) 0.46 8.67 28.48 65.46 89.99
Sensitivity (%) 99.54 93.65 73.81 16.28 0.02
Specicity (%) 4.58 28.91 56.95 93.52 99.91
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

conclusion. From our study, we have provided evidence that using a
balanced data scenario (i.e., 50% of zeros and ones) will yield smaller
variances for the maximum likelihood estimates of parameters, there-
fore oering less uncertainty in the estimation process, and ultimately
in identifying the driver variables for modelling presence/absence re-
sponse variables. This nding is extremely relevant in ecological ap-
plications, as an important amount of studies are currently dealing with
niche modeling and species distribution based on presence/absence
data, especially within the climate change context. Thus, we re-
commend that when modelling binary response variables, researchers
can safely use balanced datasets for tting candidate models, in order to
choose the best model given the variables used for the analysis. Giving
our results, by performing these procedure, the analysis itself will gain
more certainty because the researcher could better distinguish between
the eects of the predictor variables being included in the model or
whether is the ecological phenomenon really important (McPherson
et al., 2004). Another issue to take into account is the drastic reduction
of the sample for balancing purposes. We recommend to use the re-
maining data (i.e., the one not used for tting purposes) for assessing
the prediction capabilities of the models (using the indices and plots
recommend by Jones et al., 2010), and assessing the models behavior
by plotting the prediction of the response variable as a function of the
predictor variable(s).
Regarding interpreting the predicted outcomes, we recommend not
extrapolating the model results into areas where predictor variables
were not measured. This implies that the presence/absence of a given
organisms could be altered within certain ranges. In the case when
extrapolation is indeed necessary, researchers should dierentiate their
predictions from those areas where no data were collected using, for
instance, color-codded results to distinguish them from prediction re-
sults by the model using real data. This situation will be specially ad-
vantageous for modelling any ecological phenomena that is a function
of spatially-recorded predictor variable(s).
6. Concluding remarks
The proportion of zeros and ones in a dataset aects the statistical
inference and prediction capabilities of a tted logistic regression
model. Not only the accuracy of the estimated parameters is aected by
unbalanced data, but also their precision. More importantly, the sta-
tistical inference (e.g., hypothesis testing) is inuenced by the propor-
tion of zeros and ones in the data. In addition, the prediction cap-
abilities of the tted logistic regression model are aected as well,
therefore the model performance would greatly depend on the pro-
portion of 0/1 data. Overall, the 0/1 proportion might aect the con-
clusions that can arise from the tted model and its further application.
Since unbalanced data in ecology are fairly common, this can have
great implications in model building of several ecological phenomena
being modelled by scientists.
Acknowledgements
This study was supported by the Chilean research grant Fondecyt
No. 1151495. AFR is supported by a Postdoctoral Scholarship from
Vicerrectoría de Investigación y Postgrado, Universidad de La Frontera,
Temuco, Chile.
Appendix A. Supplementary data
Supplementary data associated with this article can be found, in the
online version, at http://dx.doi.org/10.1016/j.ecolind.2017.10.030.
References
Aeck, D.L.R., 2006. Poisson mixture models for regression analysis of stand level
mortality. Can. J. For. Res. 36, 29943006.
Alberini, A., 1995. Testing willingness-to-pay models of discrete-choice contingent va-
luation survey data. Land Econ. 71 (1), 8395.
Arana, J.E., Leon, C.J., 2005. Flexible mixture distribution modeling of dichotomous
choice contingent valuation with heterogenity. J. Environ. Econ. Manage. 50 (1),
170188.
Bastin, L., Thomas, C.D., 1999. The distribution of plant species in urban vegetation
fragments. Landsc. Ecol. 14 (5), 493507.
Bell, C.D., Roberts, R.K., English, B.C., Park, W.M., 1994. A logit analysis of participation
in Tennessee's forest stewardship program. J. Agric. Appl. Econ. 26 (2), 463472.
Bigler, C., Kulakowski, D., Veblen, T.T., 2005. Multiple disturbance interactions and
drought inuence re severity in Rocky mountain subalpine forests. Ecology 86 (11),
30183029.
Bradstock, R.A., Hammill, K.A., Collins, L., Price, O., 2010. Eects of weather, fuel and
terrain on re severity in topographically diverse landscapes of south-eastern
Australia. Landsc. Ecol. 25 (4), 607619.
Brook, B.W., Bowman, D.M.J., 2006. Postcards from the past: charting the landscape-
scale conversion of tropical Australian savanna to closed forest during the 20th
century. Landsc. Ecol. 21, 12531266.
Cailleret, M., Bigler, C., Bugmann, H., Camarero, J.J., Cufar, K., Davi, H., Meszaros, I.,
Minunno, F., Peltoniemi, M., Robert, E.M.R., Suarez, M.L., Tognetti, R., Martinez-
Vilalta, J., 2016. Towards a common methodology for developing logistic tree mor-
tality models based on ring-width data. Ecol. Appl. 26 (6), 18271841.
Chao, K.-J., Phillips, O.L., Monteagudo, A., Torres-Lezama, A., Vásquez, R., 2009. How do
trees die? Mode of death in northern Amazonia. J. Veg. Sci. 20, 260268.
Davies, S.J., 2001. Tree mortality and growth in 11 sympatric Macaranga species in
Borneo. Ecology 82 (4), 920932.
Dickson, B.G., Prather, J.W., Xu, Y.G., Hampton, H.M., Aumack, E.N., Sisk, T.D., 2006.
Mapping the probability of large re occurrence in Northern Arizona. Landsc. Ecol.
21 (2), 747761.
Eastman, J.R., 2006. Idrisi 15 andes, guide to GIS and Image Processing. Clark University,
Worcester, MA, USA.
Echeverria, C., Coomes, D.A., Newton, M.H.A.C., 2008. Spatially explicit models to
analyze forest loss and fragmentation between 1976 and 2020 in southern Chile.
Ecol. Model. 212, 439449.
Firth, D., 1993. Bias reduction of maximum likelihood estimates. Biometrika 80, 2738.
Gregoire, T.G., Salas, C., 2009. Ratio estimation with measurement error in the auxiliary
variate. Biometrics 65 (2), 590598.
Gregoire, T.G., Schabenberger, O., 1999. Sampling-skewed biological populations: be-
havior of condence intervals for the population total. Ecology 80 (3), 10561065.
Hastie, T., Fithian, W., 2013. Inference from presence-only data; the ongoing controversy.
Ecography 36, 864867.
Hu, X., Wu, C., Hong, W., Qiu, R., Li, J., Hong, T., 2014. Forest cover change and its
drivers in the upstream area of the Minjiang River, China. Ecol. Indic. 46, 121128.
Jones, C.C., Acker, S.A., Halpern, C.B., 2010. Combining local- and large-scale models to
predict the distributions of invasive plant species. Ecol. Appl. 20 (2), 311326.
King, G., Zeng, L., 2001. Logistic regression in rare events data. Polit. Anal. 9 (2),
137163.
Komori, O., Eguchi, S., Ikeda, S., Okamura, H., Ichinokawa, M., Nakayama, S., 2016. An
asymmetric logistic regression model for ecological data. Methods Ecol. Evol. 7,
249260.
Kumar, R., Nandy, S., Agarwal, R., Kushwaha, S.P.S., 2014. Forest cover dynamics ana-
lysis and prediction modeling using logistic regression model. Ecol. Indic. 45,
444455.
Lajeunesse, M.J., 2015. Bias and correction for the log response ratio in ecological meta-
analysis. Ecology 96 (8), 20562063.
Lander, T.A., Bebber, D.P., Choy, C.T., Harris, S.A., Boshier, D.H., 2011. The circe prin-
ciple explains how resource-rich land can waylay pollinators in fragmented land-
scapes. Curr. Biol. 21, 13021307.
Leyk, S., Zimmermann, N.E., 2007. Improving land change detection based on uncertain
survey maps using fuzzy sets. Landsc. Ecol. 22 (2), 257272.
Lindsey, J.K., 1997. Applying Generalized Linear Models. Springer, New York, USA, pp.
256.
Lloret, F., Calvo, E., Pons, X., Diaz-Delgado, R., 2002. Wildres and landscape patterns in
the eastern Iberian peninsula. Landsc. Ecol. 17 (8), 745759.
Maddala, G.S., 1992. Introduction to Econometrics, 2nd ed. Macmillan Publishing
Company, New York, NY, USA, pp. 631.
Maggini, R., Lehmann, A., Zimmermann, N.E., Guisan, A., 2006. Improving generalized
regression analysis for the spatial prediction of forest communities. J. Biogeogr. 33
(10), 17291749.
Manski, C.F., Lerman, S.R., 1977. The estimation of choice probabilities from choice
based samples. Econometrica 45 (8), 19771988.
McPherson, J.M., Jetz, W., Rogers, D.J., 2004. The eects of species range sizes on the
accuracy of distribution models: ecological phenomenon or statistical artefact? J.
Appl. Ecol. 41 (5), 811823.
Mermoz, M., Kitzberger, T., Veblen, T.T., 2005. Landscape inuences on occurrence and
spread of wildres in Patagonian forests and shrublands. Ecology 86 (10),
27052715.
Palma, C., Cui, W., Martell, D., Robak, D., Weintraub, A., 2007. Assessing the impact of
stand-level harvests on the ammability of forest landscapes. Int. J. Wildl. Fire 16 (5),
584592.
Phillips, S.J., Elith, J., 2013. On estimating probability of presence from use-availability
or presence-background data. Ecology 94 (6), 14091419.
Qi, Y., Wu, J., 1996. Eects of changing spatial resolution on the results of landscape
pattern analysis using spatial autocorrelation indices. Landsc. Ecol. 11 (1), 3949.
R Development Core Team, 2016. R: A language and environment for statistical com-
puting. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

project.org.
Rueda, X., 2010. Understanding deforestation in the southern Yucatán: insights from a
sub-regional, multi-temporal analysis. Reg. Environ. Change 10 (3), 175189.
Ruault, J., Martin-StPaul, N.K., Duet, C., Goge, F., Mouillot, F., 2014. Projecting future
drought in mediterranean forests: bias correction of climate models matters!. Theor.
Appl. Climatol. 117 (12), 113122.
Salas, C., Gregoire, T.G., 2010. Statistical analysis of ratio estimators and their estimators
of variances when the auxiliary variate is measured with error. Eur. J. For. Res. 129
(5), 847861.
Schabenberger, O., Pierce, F.J., 2002. Contemporary Statistical Models for the Plant and
Soil Sciences. CRC Press, Boca Raton, FL, USA, pp. 738.
Schaefer, R.L., 1983. Bias correction in maximum likelihood logistic regression model.
Stat. Med. 2, 7178.
Schulz, J.J., Cayuela, L., Rey-Benayas, J.M., Schröder, B., 2011. Factors inuencing ve-
getation cover change in Mediterranean Central Chile (19752008). Appl. Veg. Sci.
14, 571582.
Scott, A.J., Wild, C.J., 1986. Fitting logistic models under casecontrol or choice based
sampling. J. R. Stat. Soc. B 78 (2), 170182.
Seto, K.C., Kaufmann, R.F., 2005. Using logit models to classify land cover and land-cover
change from Landsat Thematic Mapper. Int. J. Rem. Sens. 25 (3), 563577.
Vega-García, C., Chuvieco, E., 2006. Applying local measures of spatial heterogeneity to
Landsat-TM images for predicting wildre occurrence in Mediterranean landscapes.
Landsc. Ecol. 21, 596605.
Vega-García, C., Woodard, P., Titus, S., Adamowicz, W., Lee, B., 1995. A logit model for
predicting the daily occurrence of human caused forest res. Int. J. Wildl. Fire 5 (2),
101111.
Vega-García, C., Woodard, P.M., Lee, B.S., Adamowicz, W.L., Titus, S.J., 1999. Dos
modelos para la predicción de incendios forestales en Whitecourt Forest, Canadá.
Investigación Agraria: Sistemas y Recursos Forestales 8 (1), 524.
Warton, D.I., Hui, F.K.C., 2011. The arcsine is asinine: the analysis of proportions in
ecology. Ecology 92 (1), 310.
Wilson, K., Newton, A., Echeverría, C., Weston, C., Burgman, M., 2005. A vulnerability
analysis of the temperate forests of south central Chile. Biol. Conserv. 122, 921.
Wu, J., Gao, W., Tueller, P.T., 1997. Eects of changing spatial scale on the results of
statistical analysis with landscape data: a case study. Geogr. Inf. Sci. 3 (1-2), 3041.
Wunder, J., Reineking, B., Bigler, C., Bugmann, H., 2008. Predicting tree mortality from
growth data: how virtual ecologists can help real ecologists. J. Ecol. 96 (1), 174187.
Xie, Y., Manski, C.F., 1989. The logit model and response-based samples. Sociol. Methods
Res. 17 (3), 283302.
Young, D.J.N., Stevens, J.T., Earles, J.M., Moore, J., Ellis, A., Jirka, A.L., Latimer, A.M.,
2017. Long-term climate and competition explain forest mortality patterns under
extreme drought. Ecol. Lett. 20 (1), 7886.
Zuur, A.F., Ieno, E.N., Elphick, C.S., 2010. A protocol for data exploration to avoid
common statistical problems. Methods Ecol. Evol. 1 (1), 314.
C. Salas-Eljatib et al. (FRORJLFDO,QGLFDWRUV²

... Timber surface fuel models (T8 [n = 257], T9 [n = 329], T10 [n = 451]) and non-burnable fuel models (T99 [n = 111]) had limited observations and were not analyzed further. Improving the balance of success and failure reduces representation bias, which improves estimator consistency and model performance as assessed by the area under the receiver operating curve (AUC) (Zou et al., 2007;Salas-Eljatib et al., 2018). Using natural breaks in the distribution of success rates we defined a wildfire-fuel type combination as being adequately balanced if the success rate was between 20 % and 80 %. ...
... ( Table 3). The AUC metric assesses how well model estimates are ranked across a range of classification thresholds (Salas-Eljatib et al., 2018;Zou et al., 2007). Across both surface fuel types, AUC showed modest variation among the three models, with the largest improvements resulting from the removal of unbalanced samples in the grass system and the removal of docile failures from the shrub system. ...
Article
Pre-fire mitigation efforts that include the installation and maintenance of fuel breaks are integral to wildfire suppression in Southern California. Fuel breaks alter fire behavior and assist in fire suppression at strategic locations on the landscape. However, the combined effectiveness of fuel breaks and wildfire suppression is not well studied. Using daily firefighting personnel to proxy the quantity and diversity of potential fire suppression operations (i.e., operational complexity), we examined 15 wildfires from 2017 to 2020 in the Los Padres, Angeles, San Bernardino, and Cleveland National Forests to assess how weather and site-specific fuel break characteristics influenced wildfire containment when leveraged during suppression operations. After removing effects of fuel treatments, wildfire and aerial firefighting, we estimated that expanding fuel break width in grass-dominant systems from 10 to 100 m increased the average success rate against a heading fire from 31 % to 41 %. Likewise, recently cleared fuel breaks had higher success rates compared to poorly maintained fuel breaks in both grass (25 % to 45 %) and shrub systems (20 % to 45 %). Combined, grass and shrub systems exhibited an estimated success rate of 80 % under mild weather conditions (20th percentile) and 19 % under severe weather (80th percentile). Other significant determinants included forb and grass production, adjacent tree canopy cover and terrain. Consistent with complexity theory and previous suppression effectiveness research, our analysis showed signs of suppression effectiveness declining as firefighter personnel increased. Future work could better account for the role of suppression with improved data on firefighting resource types, actions, locations, and timing. https://authors.elsevier.com/a/1jtGp4y2D1kEi5
... This approach is particularly advantageous when the goal is to statistically relate deforestation occurrence to various predictor variables or covariates. Logistic regression's popularity in deforestation research stems from its robustness and ease of interpretation [79]. One of its main advantages is the ability to provide clear probabilities of deforestation based on the predictors, which aids in understanding the influence of different factors. ...
... While logistic regression offers simplicity and interpretability, it may not capture complex, non-linear relationships as effectively as some machine learning models. Nonetheless, its frequent use and proven reliability make it a standard choice for many deforestation studies [79]. ...
Article
Full-text available
The objective of this paper is to contribute to the understanding of forest cover loss patterns and the protection role of Indigenous peoples in the forests of Araucanía, Chile. Previous research indicated lower rates of forest cover loss in land managed by Indigenous peoples; however, this was primarily focused on tropical forests. This paper focuses on the temperate forests in the region of Araucanía and hypothesizes that there will be a similar trend, with lower rates of deforestation in areas owned by Indigenous peoples. A logistic regression model was used which included multiple underlying drivers that have shown to impact deforestation rates. The results of this study corroborated the hypothesis that lands owned by Indigenous peoples have lower rates of deforestation, and that protection status, agricultural function, and railway proximity have a strong influence on forest clearing, while slope, elevation, and proximity to urban areas demonstrated a minimal impact.
... Logistic regression analysis is the most frequently used modelling approach for analysing binary response variables however when dealing with binary data in ecology or CPP, it happens that the number of observations with ones (Y = 1) is much smaller than the number of observations with zeros (Y = 0) and both unbalanced as well as balanced data has been used in the literature for fitting logistic regression models (e.g. Salas-Eljatib et al., 2018;Young et al., 2017;Jones et al., 2010). Instead of balancing data by some rule (which may be debatable), we decided to weight data and perform weighted logistic regression, assigning different weights to each class based on their prevalence in the dataset. ...
Article
Full-text available
This paper investigates the relationship between the financial situation of local governments (LGs) and the awarding of circular public procurements (CPP) within the circular economy. The investigation was based on a population research sample of 200 LGs representative for all Polish LGs (district (powiat), city on district rights, urban, urban-rural, and rural municipality (gmina)) selected in stratified random sampling. The empirical research was conducted using CATI research. Logistic regression analysis was used to predict the binary outcome of awarding CPP (dependent variable) or not awarding CPP by analysing the relationship with the set of defined financial indicators (independent variables). The conducted study revealed that the higher the level of LG expenditure, the more eager LGs are to award CPP. Surprisingly, LGs with deficits, lack of operating surplus ratio or low level of financial independence were as active in awarding CPP as LGs with no deficit, high level of operating surplus ratio or high level of financial independence.
... Ordinal Logistic Regression (OLR) is well-suited for this type of data, as it models ordinal dependent variables effectively, even when categories are imbalanced. Research by Salas-Eljatib et al. (2018) supports the argument that adjusting for balance may alter statistical properties and distort meaningful insights, particularly in cases like this where imbalances reflect real-world distributions. Additionally, OLR has been shown to remain robust in handling imbalanced data (Fullerton, 2009), making it an appropriate choice for analyzing skewed data without sacrificing interpretability. ...
... LR's lowest increase is likely due to the nature and assumptions of the LR algorithm itself. LR classifier performance can be impacted by the presence of imbalanced data(Salas-Eljatib et al., 2018). ...
Preprint
Full-text available
In this paper, a novel classification algorithm that is based on Data Importance (DI) reformatting and Genetic Algorithms (GA) named GADIC is proposed to overcome the issues related to the nature of data which may hinder the performance of the Machine Learning (ML) classifiers. GADIC comprises three phases which are data reformatting phase which depends on DI concept, training phase where GA is applied on the reformatted training dataset, and testing phase where the instances of the reformatted testing dataset are being averaged based on similar instances in the training dataset. GADIC is an approach that utilizes the exiting ML classifiers with involvement of data reformatting, using GA to tune the inputs, and averaging the similar instances to the unknown instance. The averaging of the instances becomes the unknown instance to be classified in the stage of testing. GADIC has been tested on five existing ML classifiers which are Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Logistic Regression (LR), Decision Tree (DT), and Na\"ive Bayes (NB). All were evaluated using seven open-source UCI ML repository and Kaggle datasets which are Cleveland heart disease, Indian liver patient, Pima Indian diabetes, employee future prediction, telecom churn prediction, bank customer churn, and tech students. In terms of accuracy, the results showed that, with the exception of approximately 1% decrease in the accuracy of NB classifier in Cleveland heart disease dataset, GADIC significantly enhanced the performance of most ML classifiers using various datasets. In addition, KNN with GADIC showed the greatest performance gain when compared with other ML classifiers with GADIC followed by SVM while LR had the lowest improvement. The lowest average improvement that GADIC could achieve is 5.96%, whereas the maximum average improvement reached 16.79%.
... Unbalanced data may potentially affect SDM performance, and recently, correcting for the class imbalance (difference between numbers of presence and absence observations) has been suggested to improve model performance of machine learning methods (Benkendorf et al., 2023) and logistic regression (Salas-Eljatib et al., 2018). In our study, improvements on model calibration (Tjur's R 2 ) were observed when evaluated with 30 % marine data (interpolation), but the discrimination ability of the model (AUC) was unaffected. ...
Article
Full-text available
Species Distribution Models (SDMs) are frequently applied in ecological research, but geographic transferability of SDMs holds major uncertainties. Here, we assess the cross-realm (sea to lake) geographic transferability of four SDM methods: Generalized Linear Models (GLMs), Generalized Additive Models (GAMs), Boosted Regression Trees (BRTs), and Bayesian Additive Regression Trees (BARTs) predicting occurrences of freshwater macrophytes from brackish water sea area (Bothnian Bay) to a freshwater lake environment in Finland. We found that the SDM method applied did not affect model transferability, and majority of the variation in transferability performance was associated with species. For most species model transferability was low, but reasonably good on one third of the species modelled, which had similar prevalences in both marine and freshwater data. These were emergent species or species growing close to shoreline, which presumably share similar environmental niche in terms of growing depth and water turbidity between the two environments. Generally, models which had high interpolation performance, also had higher transferability, but this relationship was not dependent on the SDM method applied. Our results suggest that species prevalence and species-specific characteristics, such as growth form, life history traits and ecological niche, are main contributors to geographic transferability of SDMs.
... Logistic regression emerges as an effective approach for addressing data imbalance, facilitating the modeling of relationships between the dependent variable and multiple classes [11]. In contrast to traditional logistic regression, multiclass logistic regression can predict probabilities across diverse classes [12][13][14], with the objective being to estimate parameters that minimize prediction errors and furnish accurate predictions for each class [15,16]. This research is geared towards tackling the challenges presented by imbalanced multiclass medical data by harnessing logistic regression. ...
Article
Full-text available
Background and Aims One‐third of all child deaths in this country are caused by diarrhea. The burden of the disease appears to be increasing in recent years in Bangladesh. This study aimed to analyze the prevalence of diarrhea and identify the factors contributing to diarrheal diseases among children aged 0–5 years in Bangladesh from 2006 to 2019, to understand the recent increase in this serious health issue. Methods In this study, using the data from the Multiple Indicator Cluster Survey (MICS), a total of 31,566, 23,402, and 24,686 children under five were included from, 2006, 2012, and 2019, respectively. Logistic regressions were applied to analyze the changes in factors influencing childhood diarrhea. Results The results revealed a decline in diarrhea prevalence from MICS 2006 (7.1%) to MICS 2012 (3.9%). However, there was a sharp increase to 6.9% in MICS 2019. Notably, children aged 12–23 months exhibited consistently 2.22 times (adjusted odds ratio (AOR) = 2.22, 95% confidence interval (CI: 1.86–2.65), 5.24 times (CI: 2.51–10.95) and 3.36 times (CI: 2.67–4.22) higher likelihood of experiencing diarrhea compared to the older age group (48–59 months) in MICS 2006, 2012 and 2019, respectively. The mother's educational background also played a role, in MICS 2006, 2012, and 2019, children whose mothers had no or incomplete primary education had 1.48 (CI: 1.18–1.86), 1.07 (CI: 0.76–1.50), and 1.34 (CI: 1.06–1.69) times higher chances of diarrhea compared to children of mothers with secondary complete or higher education. Conclusion Underweight status, geographical division, household wealth status, and unimproved and shared toilet facilities emerged as contributing factors of diarrhea among children aged 0–5 years. The findings underscore the importance of child nutrition, basic hygiene practices, and special care during the rainy season to mitigate the under‐five mortality rate associated with diarrhea.
Article
Harnessing green energy from sources like the sun is important to meet increasing global energy demands while reducing dependence on fossil fuel and mitigating climate change. However, the potential negative effects of green energy, especially concerning local biodiversity, are frequently overlooked. We use a case study from Trans-Himalayan India to discuss how green energy development, in our case proposed large-scale solar parks containing 13 solar sites, can be reconciled with wildlife conservation. We use detection-non-detection data for snow leopards with bioclimatic covariates using single-season single-species occupancy model to build a habitat suitability model. We prioritise development scenarios, to ensure the aim of snow leopard conservation, by operationalizing the step-wise (avoid>minimise>remediate>offset) Mitigation Hierarchy (MH). All of the 13 proposed solar plant sites fall within areas of high snow leopard suitability (>0.5). Applying the sequential MH framework, to “avoid” any impact would require either complete halt of all construction of the solar parks, or identifying alternative sites where the suitability for snow leopards is lower. However, collaborative planning is needed to fully implement this framework such that both objectives - solar energy generation and snow leopard conservation - can be optimally decided. We acknowledge that such decisions need integration of local people's perspective, which needs further elucidation. We advocate for a nuanced, data-driven, approach to reconcile conservation aims with development.
Article
Guardianship is a multidimensional concept that influences the risk of victimization. However, the influence of the household—central to people’s routines—has been overlooked. Using data from the NCVS (2017–2022), this study employs a multilevel analysis of household guardianship to investigate residents’ risk of violent victimization and domestic violence, controlling for individual-level factors. Living in single-unit households, marriage, and owning a home reduce the risk of violent victimization. Conversely, owning a business and residing in urban areas increase this risk. For domestic violence, living in single-unit households increases the risk, whereas gated communities lower the risk. Implications for crime prevention and future directions on guardianship are discussed.
Article
Full-text available
Tree mortality is a key process shaping forest dynamics. Thus, there is a growing need for indicators of the likelihood of tree death. During the last decades, an increasing number of tree-ring based studies have aimed to derive growth–mortality functions, mostly using logistic models. The results of these studies, however, are difficult to compare and synthesize due to the diversity of approaches used for the sampling strategy (number and characteristics of alive and death observations), the type of explanatory growth variables included (level, trend, etc.), and the length of the time window (number of years preceding the alive/death observation) that maximized the discrimination ability of each growth variable. We assess the implications of key methodological decisions when developing tree-ring based growth–mortality relationships using logistic mixed-effects regression models. As examples, we use published tree-ring datasets from Abies alba (13 different sites), Nothofagus dombeyi (one site), and Quercus petraea (one site). Our approach is based on a constant sampling size and aims at (1) assessing the dependency of growth–mortality relationships on the statistical sampling scheme used, (2) determining the type of explanatory growth variables that should be considered, and (3) identifying the best length of the time window used to calculate them. The performance of tree-ring-based mortality models was reasonably high for all three species (area under the receiving operator characteristics curve, AUC > 0.7). Growth level variables were the most important predictors of mortality probability for two species (A. alba, N. dombeyi), while growth-trend variables need to be considered for Q. petraea. In addition, the length of the time window used to calculate each growth variable was highly uncertain and depended on the sampling scheme, as some growth–mortality relationships varied with tree age. The present study accounts for the main sampling-related biases to determine reliable species-specific growth–mortality relationships. Our results highlight the importance of using a sampling strategy that is consistent with the research question. Moving towards a common methodology for developing reliable growth–mortality relationships is an important step towards improving our understanding of tree mortality across species and its representation in dynamic vegetation models.
Article
Full-text available
Ecologists widely use the log response ratio for summarizing the outcomes of studies for meta-analysis. However, little is known about the sampling distribution of this effect size estimator. Here I show with a Monte Carlo simulation that the log response ratio is biased when quantifying the outcome of studies with small sample sizes, and can yield erroneous variance estimates when the scale of study parameters are near zero. Given these challenges, I derive and compare two new estimators that help correct this small-sample bias, and update guidelines and diagnostics for assessing when the response ratio is appropriate for ecological meta-analysis. These new bias-corrected estimators retain much of the original utility of the response ratio and are aimed to improve the quality and reliability of inferences with effect sizes based on the log ratio of two means.
Article
Full-text available
Global and regional climate models (GCM and RCM) are generally biased and cannot be used as forcing variables in ecological impact models without some form of prior bias correction. In this study, we investigated the influ-ence of the bias correction method on drought projections in Mediterranean forests in southern France for the end of the twenty-first century (2071–2100). We used a water balance model with two different atmospheric climate forcings built from the same RCM simulations but using two different correction methods (quantile mapping or anomaly method). Drought, defined here as periods when vegetation functioning is affected by water deficit, was described in terms of intensity, duration and timing. Our results showed that the choice of the bias correction method had little effects on temperature and global radiation projections. However, although both methods led to similar predictions of precipitation amount, they in-duced strong differences in their temporal distribution, espe-cially during summer. These differences were amplified when the climatic data were used to force the water balance model. On average, the choice of bias correction leads to 45 % uncertainty in the predicted anomalies in drought intensity along with discrepancies in the spatial pattern of the predicted changes and changes in the year-to-year variability in drought characteristics. We conclude that the choice of a bias correc-tion method might have a significant impact on the projections of forest response to climate change.
Article
Rising temperatures are amplifying drought-induced stress and mortality in forests globally. It remains uncertain, however, whether tree mortality across drought-stricken landscapes will be concentrated in particular climatic and competitive environments. We investigated the effects of long-term average climate [i.e. 35-year mean annual climatic water deficit (CWD)] and competition (i.e. tree basal area) on tree mortality patterns, using extensive aerial mortality surveys conducted throughout the forests of California during a 4-year statewide extreme drought lasting from 2012 to 2015. During this period, tree mortality increased by an order of magnitude, typically from tens to hundreds of dead trees per km 2 , rising dramatically during the fourth year of drought. Mortality rates increased independently with average CWD and with basal area, and they increased disproportionately in areas that were both dry and dense. These results can assist forest managers and policy-makers in identifying the most drought-vulnerable forests across broad geographic areas.
Article
Binary data are popular in ecological and environmental studies; however, due to various uncertainties and complexities present in data sets, the standard generalized linear model with a binomial error distribution often demonstrates insufficient predictive performance when analysing binary and proportional data. To address this difficulty, we propose an asymmetric logistic regression model that uses a new parameter to account for data complexity. We observe that this parameter controls the model's asymmetry and is important for adjusting the weights associated with observed data in order to improve model fitting. This model includes the ordinary logistic regression model as a special case. It is easily implemented using a slight modification of glm or glmer in statistical software R . Simulation studies suggest that our new approach outperforms a traditional approach in terms of both predictive accuracy and variable selection. In a case study involving fisheries data, we found that the annual catch amount had a greater impact on stock status prediction, and improved predictive capability was supported with a smaller AIC compared to a generalized linear model. In summary, our method can enhance the applicability of a generalized linear model to various ecological problems using a slight modification, and significantly improves model fitting and model selection.
Article
It is shown how, in regular parametric problems, the first-order term is removed from the asymptotic bias of maximum likelihood estimates by a suitable modification of the score function. In exponential families with canonical parameterization the effect is to penalize the likelihood by the Jeffreys invariant prior. In binomial logistic models, Poisson log linear models and certain other generalized linear models, the Jeffreys prior penalty function can be imposed in standard regression software using a scheme of iterative adjustments to the data.
Article
Presence-only data abounds in ecology, often accompanied by a background sample. Although many interesting aspects of the species’ distribution can be learned from such data, one cannot learn the overall species occurrence probability, or prevalence, without making unjustified simplifying assumptions. In this forum article we question the approach of Royle et al. (2012) that claims to be able to do this.