Conference PaperPDF Available

SPECIES DISTRIBUTION MODELS OF WILD TEA IN THE UPPER NORTH OF THAILAND

Authors:

Abstract and Figures

This study aims to find factors that affect the distribution of wild tea in the upper north of Thailand and build mathematical models that show the prediction of wild tea distribution. We obtained wild tea data from Tea Institute at Mae Fah Luang University. We used 3 climatic factors (rainfall, humidity and temperature) and 5 geographic factors (soil, slope, digital elevation model (DEM), distance from the main river and aspect) to build species distribution by generalized linear models (GLMs). We generated pseudo-absence points by randomly selecting outside the circle at difference radii from presence points and used the area under the curve (AUC) to statistically evaluate the model. The results showed that the radius size had a major effect on the predicted distribution area.
Content may be subject to copyright.
1st Mae Fah Luang University International Conference 2012
1
SPECIES DISTRIBUTION MODELS OF WILD TEA IN THE UPPER NORTH OF
THAILAND
S.Nanglae* and R.Nilthong
School of Science, Mae Fah Luang University, 333 Moo 1, Tambon Tasud, Muang District,
Chiang Rai, 57100, Thailand
*e-mail: Snanglae@gmail.com
Abstract
This study aims to find factors that affect the distribution of wild tea in the upper north of
Thailand and build mathematical models that show the prediction of wild tea distribution. We
obtained wild tea data from Tea Institute at Mae Fah Luang University. We used 3 climatic
factors (rainfall, humidity and temperature) and 5 geographic factors (soil, slope, digital
elevation model (DEM), distance from the main river and aspect) to build species distribution
by generalized linear models (GLMs). We generated pseudo-absence points by randomly
selecting outside the circle at difference radii from presence points and used the area under the
curve (AUC) to statistically evaluate the model. The results showed that the radius size had a
major effect on the predicted distribution area.
Keywords: Species Distribution Models (SDMs), Generalized Linear Models (GLMs), Area
under the Curve (AUC).
Introduction
Wild tea is Assam tea (Camillia sinensis ver. assamica). It had origin from India and naturally
distributes in the highland forest. In Thailand, the wild tea cultivation fields, known as “miang
gardens”, have been maintained in the forests and recognized as a system of agroforestry.
Pornchai Preechapanya [1] reported that many biodiversity features were found in miang
gardens especially plants that are useful for health. Moreover, miang gardens are buffer zone
to protect forest from invasion and disruption for agriculture.
This study aims to find factors that affect the distribution of wild tea in the upper north of
Thailand and build mathematic model, known as “species distribution models (SDMs)”. The
SDMs is an important tool for conservation and evolution study. It shows the relationship
between species ranges and environmental parameters. The SDMs are categorized in two
groups. First group, the methods that require presence-only data are called “profile
techniques”, for example, a bioclimatic analysis and prediction system (BIOCLIM), a flexible
modeling procedure for mapping potential distributions of plants and animals (DOMAIN) and
ecological niche factors analysis (ENFA). Second group, the methods that require both
presence and absence data are called “group discrimination techniques”, for example,
generalized linear models (GLMs) and generalized additive models (GAMs). Both groups are
based on statistical models. Elith et al [2] reported that when they compared the various
SDMs, they found that group discrimination techniques tend to perform better than profile
techniques. Thus, group discrimination techniques are increasingly used.
1st Mae Fah Luang University International Conference 2012
2
The main problem when we use group discrimination techniques is absence data are
unavailable. So, many studies have used pseudo-absence data in place of real absence data [3].
There are several approaches for generating pseudo-absence data. It can be categorized in two
main approaches. The first approach is randomly selecting the pseudo-absence points
including background across all of the study area [4-5]. The second approach is selecting
pseudo-absence points with two steps. First step is estimating suitability area by profile
technique and the second step is selecting pseudo-absence points outside the suitability area [3,
5].
In this study, we chose group discrimination techniques for building SDMs with generalized
linear models (GLMs). We assumed that the area in the circle is the suitable area for wild tea.
So, the pseudo-absence points were randomly selected outside the circles at each radius (5 km,
10 km,15 km, 20 km, 25 km, 30 km, 35 km, 40 km, 45 km, and 50 km) from presence points
and then compared SDMs at each radius.
Methodology
Species and environmental data
We used 41 points of tea data with three climatic factors (rainfall, humidity and temperature)
and five geographic factors (soil series, slope, digital elevation model (DEM), distance from
the main river and aspect) in this study. The tea data were obtained from Tea Institute at Mae
Fah Luang University. Rainfall data, humidity data; temperature data and river data were
obtained from Remote Sensing and GIS at Asian Institute of Technology; DEM were obtained
from Land Development Department and soil series data were obtained from Center for
Information Technology Services at Mae Fah Luang University. We calculated slope and
aspect from DEM and calculated distance from the main river from river data. The study area
was Upper-North of Thailand. These areas covered eight provinces: Chiang Mai, Chiang Rai,
Mae Hong Son, Nan, Phayao, Phrae, Lampang and Lamphun. Points of tea data and the study
area are shown in Figure 1.
Figure 1 Points of tea data that distribute on Upper-North of Thailand
1st Mae Fah Luang University International Conference 2012
3
Figure 2 Generating pseudo-absence points: The center of circle is presence point. The red
points are random points. The green points are the center points of each grid and also inside
the circle. The blue points are the center points of the grid not containing the red point(s) and
outside the circle. The black points are the center points of the grid containing the red point(s)
and outside the circle called as pseudo-absence points.
Generating pseudoabsence points
We generated pseudoabsence points by randomly selected outside the circles at each radius (5
km, 10 km,15 km, 20 km, 25 km, 30 km, 35 km, 40 km, 45 km, and 50 km) from presence
points. From Figure 2, we were overlaying two maps with R programming. The first map, we
drew circle around at each presence points with one radius. Then, we randomly selected 410
points on the map of study area (red points). The second map, we built map with 5 km × 5 km
grid cells across of the study area with center point for each grid cell. When we overlaid two
maps, the outside circle points (black points) were selected as pseudo-absence points.
Although, the blue points were outside the circle but they were not selected as pseudo-absence
points because the grid cells were not contained random points (red points).
Modeling experiment
Generalized linear models (GLMs) were introduced by Nelder and Wedderburn [10] in 1972.
In GLMs, the functions in exponential family that are non-linear form are transformed to linear
form. Estimation and inference are based on the theory of maximum likelihood estimation.
We used logistic regression models that are one type of GLMs for binary (presence/absence)
data to predict species distribution. Let
1Y
or 0 denote the presence/absence of a species
respectively and
( ) ( 1| )p x P Y X x
be the probability that the species is present when
Xx
. The resulting presence-absence response curve is
()
()
( ) 1
gu
gu
e
px e
where
( ) gu
is link function.
1
( )
p
T
jj
j
g u x x
1st Mae Fah Luang University International Conference 2012
4
Results
This study, we randomly selected 20 times of pseudo-absence data set at each radius and used
the area under the curve (AUC) to statistically evaluate each model. We ran GLMs and
selected variables by stepwise method. In R programming, the criterion for selecting variables
by stepwise was Akaike information criterion (AIC).
Figure 3 Box plot of AUC and mean plot of AUC at each radius
From Figure 3, we compared these models with AUC by box plot. The box plot showed the
median value, the distribution and the outliers of AUC at each radius. The lines in the box
indicated the median values of AUC. It showed that the median values of AUC at 5 to 10 km
were in the period of increasing. The median values of AUC at 15 to 30 km were rather
constant and the median values of AUC at 35 to 50 km were changing. The points outside the
end of the vertical lines were outliers. The outliers of AUC at each radius were not considered.
The mean plot showed the mean value of AUC and 95% confidence interval at each radius.
The most of mean values were about 0.85 to 0.90. These models were considered excellent
discrimination. After we deleted the outliers at each radius, we fitted the cubic curve for
comparing AUC at each radius. We tried to find the suitability radius for generating pseudo-
absence data set.
1st Mae Fah Luang University International Conference 2012
5
Figure 4 Curve fitting between AUC and radius size for cubic models
From Figure 4, it showed the cubic curve for comparing AUC at each radius. The range
between 15 km up to 40 km was quite constant for AUC. Therefore, we selected smallest part
of radius sizes 15 km and 20 km that still gave the excellent AUC to pursuit the suitability
model.
Table 1 Statistic value of AUC at radius size 15 km and 20 km
Radius size
15 km
20 km
0.864565
0.881416
0.048561
0.067682
0.056168
0.076788
19
20
From Table 1, we randomly selected 20 times of pseudo-absence data set at each radius and the
outliers of AUC were not considered. The mean values of AUC at each radius were closely.
Although, the mean value at 20 km was more than that of at 15 km but the value of coefficient
of variance (CV) at 20 km was more than that of at 15 km. It could interpret that the AUC at
20 km was varying more than AUC at 15 km so we couldn’t say that the model at radius 20 km
was better than the model at radius 15 km.
Then, we selected model that had AUC closed to the mean value to represent at each radius.
We compared model and predictive maps.
1st Mae Fah Luang University International Conference 2012
6
Table 2. GLMs with stepwise method for radius size 15 km and 20 km
Factors
Radius Size = 15 km
Radius Size = 20 km
Coefficients
Std. Error
z value
Coefficients
Std. Error
z value
DEM
0.003243
0.000723
4.487
*
0.004121
0.00084
4.893
*
Rainfall
0.008095
0.001839
4.402
*
0.008751
0.00198
4.421
*
Humidity
-0.531300
0.142400
-3.73
*
-0.477800
0.13860
-3.447
*
Distance
0.000069
0.000037
1.882
.
0.000115
0.00004
2.766
*
Aspect
-0.016060
0.00635
-2.528
*
Intercept = 26.270000
AUC = 0.8674242
Intercept = 23.680000
AUC = 0.887931
“ * ” is significance at 0.05 and “ . ” is significance at 0.10
Table 2 showed the factors that were selected to model by stepwise method. At 15 km, there
were 4 factors that affected the model (DEM, Rainfall, Humidity and Distance). DEM was the
strongest effect on the models. Subordinate factors were rainfall, humidity and distance,
respectively. At 20 km, there were 5 factors that affected the model (DEM, Rainfall,
Humidity, Distance and Aspect). DEM was the strongest effects like the model at 15 km.
Subordinate factors were rainfall, humidity, distance and aspect, respectively.
Figure 5 Estimation of presence area (Green area) and absence area (Gray area) from
GLMs at radius equal 15 km and 20 km
From Figure 5 showed the estimation of presence and absence area, we compared two
estimation maps with radius 15 km and 20 km and the red points on the maps were presence
points. These figure showed that both maps estimated the presence area somewhat similarly
but the map at radius 20 km seems to have more presence area than the map at radius 15 km
and also AUC at radius 20 km was more than AUC at radius 15 km. Although, the map with
larger presence area could not tell it was better than the other map because the AUC value at
radius 20 km was more varying than AUC value at radius 15 km. So, we could not select the
best model from this study but we could analyze the factors that affected the wild tea
distribution. The next study, we would analyze factors with GLMs by choosing the other
1st Mae Fah Luang University International Conference 2012
7
method of generating pseudoabsence points and try to improve the model by reducing the
variance of AUC at each radius.
Conclusion
Many studies demonstrated that the generating pseudo-absence points affected the resulting
models [6]. In this study, we generated pseudo-absence points by randomly selecting outside
the circle at each radius from presence points. It showed that species distribution models by
GLMs depended on the radius size from presence points. If chosen radius size was too small
(less than 15 km), the selected pseudo-absence points probably had the same geographical area
or same climate whereas if radius size was too large (more than 35 km), surely the selected
pseudo-absence points would have very different geographical area. Although our results were
fairly effective, the random size would need to be tuned up by using other statistical techniques
together with sufficient data for improving the model.
Acknowledgements
We also thank the Tea Institute and Center for Information Technology Services at Mae Fah
Luang University, Remote Sensing and GIS at Asian Institute of Technology and Land
Development Department for all data in this study.
References
1. Pornchai Preechapanya, Indigenous Highland Agroforestry Systems of Northern Thailand,
Chiang Doa Watershed Research Station, Chiang Mai, Thailand, 2000.
2. Elith et al., Novel method improve prediction of species’ distributions from occurrence
data, Ecography, 2006, 29, 129-151.
3. Mary S wisz and Antoine Guisan, Do pseudo absence selection strategies influence
species distribution models and their predictions? An information theoretic approach
based on simulated data, BMC Ecology, 2009.
4. Morgane Barbet-Massin, Frédéric Jiguet, Cécile Héléne Albert and Wilfried Thuiller,
Selecting pseudo absences for species distribution models: how, where and how many?,
Method in Ecology and Evolution, 2012.
5. Jorge M. Lobo and Marcelo F. Tognelli, Exploring the effects of quantity and location of
pseudo absences and sampling biases on the performance of distribution models with
limited point occurrence data, Journal for Nature Conservation, 2011, 19, 1-7.
6. Jogeir N. Stokkland, Rune Halvorsen, Bente StØa, Species distribution modelling Effect
of design and sample size of pseudo-absence observations, Ecological Modelling, 2011,
222, 1800-1890.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
1. Species distribution models are increasingly used to address questions in conservation biology, ecology and evolution. The most effective species distribution models require data on both species presence and the available environmental conditions (known as background or pseudo-absence data) in the area. However, there is still no consensus on how and where to sample these pseudo-absences and how many. 2. In this study, we conducted a comprehensive comparative analysis based on simple simulated species distributions to propose guidelines on how, where and how many pseudo-absences should be generated to build reliable species distribution models. Depending on the quantity and quality of the initial presence data (unbiased vs. climatically or spatially biased), we assessed the relative effect of the method for selecting pseudo-absences (random vs. environmentally or spatially stratified) and their number on the predictive accuracy of seven common modelling techniques (regression, classification and machine-learning techniques). 3. When using regression techniques, the method used to select pseudo-absences had the greatest impact on the model’s predictive accuracy. Randomly selected pseudo-absences yielded the most reliable distribution models. Models fitted with a large number of pseudo-absences but equally weighted to the presences (i.e. the weighted sum of presence equals the weighted sum of pseudo-absence) produced the most accurate predicted distributions. For classification and machine-learning techniques, the number of pseudo-absences had the greatest impact on model accuracy, and averaging several runs with fewer pseudo-absences than for regression techniques yielded the most predictive models. 4. Overall, we recommend the use of a large number (e.g. 10 000) of pseudo-absences with equal weighting for presences and absences when using regression techniques (e.g. generalised linear model and generalised additive model); averaging several runs (e.g. 10) with fewer pseudo-absences (e.g. 100) with equal weighting for presences and absences with multiple adaptive regression splines and discriminant analyses; and using the same number of pseudo-absences as available presences (averaging several runs if few pseudo-absences) for classification techniques such as boosted regression trees, classification trees and random forest. In addition, we recommend the random selection of pseudo-absences when using regression techniques and the random selection of geographically and environmentally stratified pseudo-absences when using classification and machine-learning techniques.
Article
Full-text available
Prediction of species’ distributions is central to diverse applications in ecology, evolution and conservation science. There is increasing electronic access to vast sets of occurrence records in museums and herbaria, yet little effective guidance on how best to use this information in the context of numerous approaches for modelling distributions. To meet this need, we compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date. We used presence-only data to fit models, and independent presence-absence data to evaluate the predictions. Along with well-established modelling methods such as generalised additive models and GARP and BIOCLIM, we explored methods that either have been developed recently or have rarely been applied to modelling species’ distributions. These include machine-learning methods and community models, both of which have features that may make them particularly well suited to noisy or sparse information, as is typical of species’ occurrence data. Presence-only data were effective for modelling species’ distributions for many species and regions. The novel methods consistently outperformed more established methods. The results of our analysis are promising for the use of data from museums and herbaria, especially as methods suited to the noise inherent in such data improve.
Article
Full-text available
Multiple logistic regression is precluded from many practical applications in ecology that aim to predict the geographic distributions of species because it requires absence data, which are rarely available or are unreliable. In order to use multiple logistic regression, many studies have simulated "pseudo-absences" through a number of strategies, but it is unknown how the choice of strategy influences models and their geographic predictions of species. In this paper we evaluate the effect of several prevailing pseudo-absence strategies on the predictions of the geographic distribution of a virtual species whose "true" distribution and relationship to three environmental predictors was predefined. We evaluated the effect of using a) real absences b) pseudo-absences selected randomly from the background and c) two-step approaches: pseudo-absences selected from low suitability areas predicted by either Ecological Niche Factor Analysis: (ENFA) or BIOCLIM. We compared how the choice of pseudo-absence strategy affected model fit, predictive power, and information-theoretic model selection results. Models built with true absences had the best predictive power, best discriminatory power, and the "true" model (the one that contained the correct predictors) was supported by the data according to AIC, as expected. Models based on random pseudo-absences had among the lowest fit, but yielded the second highest AUC value (0.97), and the "true" model was also supported by the data. Models based on two-step approaches had intermediate fit, the lowest predictive power, and the "true" model was not supported by the data. If ecologists wish to build parsimonious GLM models that will allow them to make robust predictions, a reasonable approach is to use a large number of randomly selected pseudo-absences, and perform model selection based on an information theoretic approach. However, the resulting models can be expected to have limited fit.
Article
We explored the effect of varying pseudo-absence data in species distribution modelling using empirical data for four real species and simulated data for two imaginary species. In all analyses we used a fixed study area, a fixed set of environmental predictors and a fixed set of presence observations. Next, we added pseudo-absence data generated by different sampling designs and in different numbers to assess their relative importance for the output from the species distribution model. The sampling design strongly influenced the predictive performance of the models while the number of pseudo-absences had minimal effect on the predictive performance. We attribute much of these results to the relationship between the environmental range of the pseudo-absences (i.e. the extent of the environmental space being considered) and the environmental range of the presence observations (i.e. under which environmental conditions the species occurs). The number of generated pseudo-absences had a direct effect on the predicted probability, which translated to different distribution areas. Pseudo-absence observations that fell within grid cells with presence observations were purposely included in our analyses. We discourage the practice of excluding certain pseudo-absence data because it involves arbitrary assumptions about what are (un)suitable environments for the species being modelled.
Article
In the last decade, the application of predictive models of species distribution in ecology, evolution, and conservation biology has increased dramatically. However, limited available data and the lack of reliable absence data have become a major challenge to overcome. At least two approaches have been proposed to generate pseudo-absences; however it is not clear how the number of pseudo-absences created affect model performance. Moreover, the spatial bias in the collecting localities of a species (presence data) may add extra noise to the final distribution model. Here, we use a virtual species to assess the effects of spatial sampling bias, and number and location of pseudo-absences on model accuracy. We found that both number of pseudo-absences and spatial bias in sampling localities, as well as their interaction, significantly influence all accuracy measures (AUC, sensitivity, and specificity). However, location of pseudo-absences (either generated across the entire study area or only outside the environmental envelope of the species) does not affect model performance. These results provide some methodological guidelines for developing reliable distribution hypotheses when presence data are scarce.
Indigenous Highland Agroforestry Systems of Northern Thailand, Chiang Doa Watershed Research Station
  • Pornchai Preechapanya
Pornchai Preechapanya, Indigenous Highland Agroforestry Systems of Northern Thailand, Chiang Doa Watershed Research Station, Chiang Mai, Thailand, 2000.