PresentationPDF Available

Selecting predictors to form the most accurate predictive model for count data

Authors:
  • Data2action Australia

Abstract and Figures

Abstract: Spatial predictive models have been increasingly employed to generate spatial predictions for environmental management and conservation. The accuracy of predictive models and their predictions is essential to support evidence-based decision making with high-quality information. However, the development of predictive models with high predictive accuracy can be challenging in environmental sciences because causal predictors are often unavailable and proxy predictors are used instead. Therefore, it is essential to select a subset from a set of potential proxy predictors to form a predictive model with high predictive accuracy. In this study, we use a dataset from a marine conservation area in northern Australia containing 77 samples with sponge species richness (SSR) as the response variable and 49 seabed biophysical variables (i.e., proxy predictors). We use random forests (RF), generalised linear models (GLM), lasso and elastic-net regularized GLM (glmnet) and their hybrid methods with geostatistical techniques (i.e. ordinary kriging (OK) and inverse distance weighting (IDW)). For RF, five feature selection methods were used: 1) averaged variable importance (AVI); 2) Boruta; 3) knowledge informed AVI (KIAVI); 4) recursive feature selection (rfe), and; 5) variable selection using RF (VSURF). For GLM, six variable selection (i.e., model selection in statistics) methods were employed to select GLM predictive models: 1) stepAIC; 2) dropterm; 3) anova; 4) regsubsets; 5) bestglm and; 6) RF. The accuracy of predictive models was validated based on 10-fold cross validation in terms of VEcv (i.e., variance explained for predictive models based on cross-validation). In this paper, we discuss the research findings in relation to: 1) feature selection methods; 2) the influence of initial input predictors and highly correlated predictors for RF; 3) variable selection methods for GLM and glmnet; 4) goodness of fit and predictive models for GLM and; 5) the hybrid methods for count data. The case study provides an example of predicting the spatial distribution of count data using RF, RFOK and RFIDW, and offers guidelines for selecting predictors for RF, GLM and glmnet predictive models based on predictive accuracy instead of parsimoniousness. This study also provides important environmental baseline information with improved accuracy for the informed monitoring of ecosystem health within the Oceanic Shoals Commonwealth Marine Reserve, northern Australia.
No caption available
… 
Content may be subject to copyright.
Selecting predictors to form the most accurate predictive model
for count data
Jin Li1*, Belinda Alvarez2, Justy Siwabessy1, Maggie Tran1, Zhi Huang1,
Rachel Przeslawski1, Lynda Radke1, Floyd Howard1, and Scott Nichol1
1Geoscience Australia, GPO Box 378, Canberra, ACT 2601, Australia
2 Museum and Art Gallery of the Northern Territory, PO Box 4646, Darwin NT 0801
Australia
MODSIM 2017 Hobart
Spatial distribution of sponge
species richness and its
relationship with environmental
variables are important baseline
information for marine ecosystem
management, but they are either
unavailable or unknown.
Predictive models for species
richness could be used to address
the spatial data gaps and
investigate the relationship.
Machine learning methods were
introduced to spatial statistics by
combining them with commonly
used geostatistical methods and
have shown high predictive
accuracy[1-10].
The hybrid methods, however,
have not been applied to count
data.
Introduction Methods Results & Discussion Summary Acknowledgements
Machine
learning
methods
and
their
hybrids
with
geostatistic
al methods
1
Inverse distance weighting (IDW)
2
Generalised least squares trend estimation (GLS)
3
Kriging with an external drift (KED)
4
Ordinary cokriging (OCK)
5
Ordinary kriging (OK)
6
Universal kriging (UK)
7
Generalized boosted regression modelling (GBM)
8
General Regression Neural Network (GRNN)
9
RandomForest (RF)
10
Regression tree (RT)
11
Support vector machine (SVM)
12
Thin plate splines (TPS)
13
Linear models and OK (RKlm)
14
Generalised linear models and OK (
RKglm
or GLMOK)
15
Generalised linear models and IDW (GLMIDW)
16
GLS and OK (RKGLS)
17
GBM and OK (GBMOK)
18
GBM and IDS (GBMIDS)
19
GRNN and OK (GRNNOK)
20
GRNN and IDS (GRNNIDS)
21
RF and IDS (RKIDS)
22
RF and OK (RKOK)
23
RT and OK (RTOK)
24
RT and IDS (RTIDS)
25
SVM and OK (SVMOK)
26
SVM and OK (SVMIDS)
In this study, we aim to select the most accurate predictive model to predict the spatial
distribution of sponge species richness count data.
In this study, we will:
1) test the accuracy of predictive models based on GLM, RF and their hybrid
methods with OK or IDW, as well as GLMNET for count data;
2) test the effects of various predictor sets on the accuracy of predictive models;
3) examine five feature selection (FS) methods for RF, six model selection methods
for GLM and one model selection method for GLMNET; and
4) predict the spatial distribution of sponge richness using the most accurate model
and visually examined the predictions.
Introduction Methods Results & Discussion Summary Acknowledgements
MODSIM 2017 Hobart
Study area
Introduction Methods Results & Discussion Summary Acknowledgements
In total 77 sponge species richness
samples were available for this study [10].
Oceanic Shoals
Commonwealth Marine
Reserve (CMR).
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
No
Predictive variable
No
Predictive variable
1
long
26
tpi5
2
lat
27
tpi7
3
sand
28
lmi3
4
gravel
29
plan_curv3
5
bs25
30
plan_curv5
6
bs11
31
plan_curv7
7
bs14
32
relief3
8
bs34
33
relief5
9
bs_o
34
relief7
10
bs_homo_o
35
slope3
11
bs_entro_o
36
slope5
12
bs_var_o
37
prof_curv3
13
bs_lmi_o
38
prof_curv5
14
bathy_o
39
prof_curv7
15
tpi_o
40
bs_entro3
16
slope_o
41
bs_entro5
17
plan_cur_o
42
bs_entro7
18
prof_cur_o
43
bs_homo3
19
relief_o
44
bs_homo5
20
rugosity_o
45
bs_homo7
21
dist.coast
46
bs_var3
22
rugosity3
47
bs_var5
23
rugosity5
48
bs_var7
24
rugosity7
49
bs_lmi5
25
tpi3
50
geom
Predictive variables
location information
bathymetry
sediments
backscatter
the derived variables based
on acoustic multibeam data
geomorphic feature
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
Predictive methods
1. RF
2. RFOK
3. RFIDW
4. Average of RFOK and RFIDW (RFOKRFIDW)
5. Average of RF, RFOK and RFIDW (RFRFOKRFIDW)
6. GLM
7. GLMOK
8. GLMIDW
9. Average of GLMOK and GLMIDW (GLMOKGLMIDW)
10. Average of GLM, GLMOK and GLMIDW (GLMGLMOKGLMIDW)
11. GLMNET
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
Model selection and validation
Model selection
For RF[10, 11]
averaged variable importance (AVI)
Boruta
knowledge informed AVI (KIAVI)
recursive feature selection (rfe)
variable selection using RF (VSURF)
For GLM[12-15]
stepAIC
dropterm
Anova
RF informed
regsubsets
bestglm
For GLMNET[16]
bestglm
Model validation
Cross-validation[17-19]
10-fold cross-validation
repeated 100 times
Cross-validation function [16, 20]
functions in spm
cv.glmnet
Accuracy measurement[21, 22]
Relative mean absolute error (RMAE)
relative root mean square error (RRMSE)
Variance explained by predictive models
(VEcv)
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
RRMSE (mean: black line; minimum and
maximum: dash red lines) of models for RF and
its hybrid methods with different predictor
sets for sponge species richness; and the model
with the minimum mean RRMSE (circle) [10, 23].
a) models 1 23 based on the AVI using 49
variables;
b) models 24 33 based on KIAVI using the
important variable (or predictor) based on
the predictive accuracy (IVPA) [11] identified
from models 1-23 and also included
geomorphic features;
c) models 34 40 based on KIAVI using model
24;
d) models 41 46 based on KIAVI by
removing the unimportant variable (or
predictor) based on the predictive accuracy
(UVPA) [11] identified from model 3440.
All 49 variables were selected by rfe. Only three
variables were selected using VSURF, and the
resultant model was identical to model 22.
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
RF models 1 3 developed based on Boruta:
(with maxRuns of 5000, 2000, default (i.e. 100)) using
49 variables [10, 23].
RF model 47 based on the most accurate predictive model
identified so far (i.e. model 43) and with sqrt(res).
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The fitted values vs. observed values for RF models 1 (i.e.
the full model) and 43 for sponge species richness [10, 23].
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
A summary of GLM modelling process
Predictor
Selection method
Selection criteria
Resultant
model
All 49 variables
stepAIC BIC GLM1
Predictors in
GLM1 dropterm and
anova p-value GLM2
Predictors in
GLM1 & two-
way
interactions & second orders
stepAIC, dropterm
and anova BIC and p-value GLM3
Predictors in RF43
GLM4
Predictors in
GLM4 stepAIC BIC GLM5
Predictors in
GLM5 dropterm and
anova p-value GLM6
Predictors in
GLM6 & two-
way
interactions & second order
stepAIC BIC GLM7
Predictors in
GLM6 & second
orders
stepAIC BIC GLM8
Predictors in
GLM8 & two-
way
interactions
stepAIC BIC GLM9
All 49 variables
regsubsets Mallows’ Cp GLM10
All 49 variables
bestglm Cross-validation GLM11
All 49 variables
bestglm BIC GLM12
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The minimum, mean and maximum of RRMSE (%) of GLM models 1 to 12.
GLM GLMok GLMidw GLMokGLMidw GLMGLMokGLMidw
min mean max min mean max min
mean max min
mean max min
mean max
94.87
1.69E+
09 1.68E+1
1 93.65
1.69E+0
9 1.68E+1
1 94.0
9
1.69E+
09 1.68E+1
1 93.8
1
1.69E+
09 1.68E+1
1 94.0
7 1.69E+0
9 1.68E+1
1
175.5
6
2.57E+
07 1.79E+0
9 174.9
2 2.57E+0
7 1.79E+0
9 179.
19
2.57E+
07 1.79E+0
9 176.
63
2.57E+
07 1.79E+0
9 175.
19 2.57E+0
7 1.79E+0
9
103.2
7
4.89E+
05 2.59E+0
7 95.91
4.89E+0
5 2.59E+0
7 99.2
7
4.89E+
05 2.59E+0
7 97.1
4
4.89E+
05 2.59E+0
7 96.9
7 4.89E+0
5 2.59E+0
7
91.96
97.59 142.36 81.88
89.02 133.17 82.3
3 91 135.57 80.2
4 88.16 133.27 81.8
4 87.95 134.24
91.09
96.36 143.24 81.4 88.12 135.42 80.6
1 90.15 137.84 79.1
2 87.34 135.59 80.8
86.92 136.11
6 91.14
93.72 100.16 80.59
87.95 99.62 78.5
9 89.21 97.24 77.3
8 86.45 96.39 79.1
85.54 94.37
92.04
113.41
195.8 92.51
114.28 197.85 95.5
9 116.23
200.2 92.2
9 113.67
198.16 90.9
6 112.19 196.5
90.64
94.39 101.19 83 89.19 100.74 81.7
4 91.1 99.29 80.3
5 87.93 97.93 81.1
5 86.72 96.66
93.1 135.72
194.53 93.94
137.51 199.33 92.1
2 138.35
202.21 91.0
4 136.45
199.78 90.2
135.12 197.23
68.68
938.63
32304.22
70.63
940.07 32307.48
76.6
3 944.92
32308.17
72.8
9 941.38
32307.82
70.8
6 939.72 32306.62
82.01
84.13 87.29 74.22
79.21 90.84 78.7
84.01 90.48 75.0
3 80.11 89.03 75.5
6 78.9 86.71
73.88
1626.86
73097.75
71.49
1627.03 73101.43
71.4
7 1629 73102.52
70.2
9
1627.15
97043 70.2
6 1625.89 73100.56
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The BIC, deviance explained (%), deviance explained adjusted (%), VEcv (%)
of GLM, and VEcv (%) of the most accurate method of GLM, GLMOK,
GLMIDW, GLMOKGLMIDW and GLMGLMOKGLMIDW for models 1 to 12.
The traditional model selection methods such as AIC and BIC for regression models
attempt to select the most parsimonious models that are not necessarily the most
accurate predictive models, especially when proxy variables are used as predictors
instead of causal variables.
To select GLM predictive models, predictive accuracy should be used instead of AIC
or deviance explained (%).
Model
BIC Dev expl (%) Dev expl adj (%) VEcv (%) for GLM
VEcv
(%) for the most
accurate
1
468.63
93.16 88.45 -2.86E+16 -2.86E+16
2
622.86
62.91 55.26 -6.65E+12 -6.65E+12
3
587.78
66.35 60.65 -2.40E+09 -2.40E+09
4
779.1 39.75 32.66 4.42 22.37
5
776.15
39.56 33.43 6.82 24.18
6
783.27
37.51 33.11 11.85 26.57
7
702.79
51.38 43.15 -29.08 -26.31
8
781.94
37.68 33.29 10.59 24.53
9
722.58
47.1 40.88 -84.85 -83.22
10
523.21
74.74 70.48 -8741.62 -8741.62
11
787.04
35.88 33.24 28.97 37.53
12
546.1 71.2 66.84 -26460.93 -26429.27
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The predictors for glmnet4-6 are the same as those for glm11.
The most accurate model, GLMNET6, is still less accurate than GLM for model11
(VEcv = 28.97).
GLMNET is much less accurate than RF for model 41 (VEcv = 42.14).
Model
Predictors Penalty (α) VEcv (%)
GLMNET1
All predictors
0.5 (elastic
-
net)
-0.57
GLMNET2
All predictors 1 (lasso) -0.81
GLMNET3
All predictors 0 (ridge) -0.78
GLMNET4
Selected model
0.5 28.21
GLMNET5
Selected model
1 28.16
GLMNET6
Selected model
0 28.33
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The fitted values of GLM
models 1 to 9 vs. the
observed values of sponge
species richness [10, 23].
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The fitted values vs. observed values for RF models 1 (i.e.
the full model) and 43 for sponge species richness [10, 23].
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
The VEcv (%) of the most accurate predictive
models developed using various model selection
methods (i.e., GLM11: glmglmokglmidw11; GLM6:
glmglmokglmidw6; RF.Boruta2: rfokrfidw2; RF1:
rfidw1; RF43: rfokrfidw43; RF47: rfokrfidw47) [10, 23].
Comparison of the predictive accuracy of the
most accurate models with the average
performance of previously published predictive
models in the environmental sciences[10, 21, 23].
MODSIM 2017 Hobart
Partial plot of RF model 47,
indicating the relationships of sponge
species richness with the eight
predictors in the model (i.e.,
long: longitude,
lat: latitude,
tpi3: topographic position index using a
3X3 grid-cell window,
bs_var7: backscatter-based variance
using a 7X7 grid-cell window,
bs_entro7: backscatter-based entropy
using a 7X7 grid-cell window,
dist.coast: distance to coast,
bs34 and bs11: backscatter mosaics
normalised to incidence angles at
34° and 11°) [10, 23].
Of these predictors, bs11 and bs34
are highly correlated, with a ρ=0.95
and a r=0.95.
Introduction Methods Results & Discussion Summary Acknowledgements
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
Spatial predictions of sponge species richness using model 47 for RFOKRFIDW in Oceanic
Shoals area H: a) predictions, b) bs11, c) tpi3, d) bs_entro7, e) bs_var7 and f) geomorphic
features overlaid on bathymetry [10, 23].
The importance of eight predictors in the model: long > lat > dist.coast > bs11 > tpi3 >
bs34 > bs_entro7 > bs_var7.
High sponge species richness is usually associated with hard seabed features.
Since the spatial patterns of bs34 were similar to that of bs11, it was not presented.
MODSIM 2017 Hobart
The initial input predictors affect the importance status of variables;
KIAVI is recommended for selecting RF predictive model;
The criteria for assessing the goodness of fit should not be used to select GLM predictive
models;
Joint application of RF and AIC is a useful model selection approach for developing GLM
predictive models;
bestglm is a useful model selection tool for developing GLM and GLMNET predictive models;
RF is much more accurate than GLM and GLMNET, and GLM is even slightly more accurate than
GLMNET;
The hybrid methods have significantly improved the predictive accuracy for both RF and GLM;
The hybrid methods of RF and geostatistical methods are considerably more accurate than
other methods and able to effectively model count data;
The relationships of sponge species richness with the predictors are non-linear;
High sponge species richness is usually associated with hard seabed features; and
This study also provides important information for future monitoring design, particularly on the areas
where the management and conservation of sponge gardens should be focused.
Introduction Methods Results & Discussion Summary Acknowledgements
MODSIM 2017 Hobart
Introduction Methods Results & Discussion Summary Acknowledgements
This study was undertaken as an activity of the Marine Biodiversity Hub, the Australian
Government’s National Environmental Research Program (NERP).
Funding for samples collected in 2009 and 2010 was provided through the Australian
Government’s Offshore Energy Security Program.
We thank the Master and crew of the RV Solander and scientific staff at the Australian
Institute of Marine Science (AIMS) for their support in conducting the surveys and the
Field and Engineering Support staff at GA.
MODSIM 2017 Hobart
References
[1] Li, J. 2011. "Novel spatial interpolation methods for environmental properties: using point samples of mud content as an example." The Survey Statistician: The
Newsletter of the International Association of Survey Statisticians No. 63: 15-16.
[2] Li, J. 2013. Predictive Modelling Using Random Forest and Its Hybrid Methods with Geostatistical Techniques in Marine Environmental Geosciences. In: The
proceedings of the Eleventh Australasian Data Mining Conference (AusDM 2013), Canberra, Australia, 13-15 November 2013.
[3] Li, J. and A. D. Heap. 2014. Spatial interpolation methods applied in the environmental sciences: A review. Environmental Modelling & Software 53:173-189.
[4] Li J., Heap A., Potter A., Daniell J.J., 2011. Predicting Seabed Mud Content across the Australian Margin II: Performance of Machine Learning Methods and Their
Combination with Ordinary Kriging and Inverse Distance Squared. Geoscience Australia: Record 2011/07 69.
[5] Li J., Heap A.D., Potter A., Daniell J., 2011. Application of machine learning methods to spatial interpolation of environmental variables. Environmental Modelling &
Software 26: 1647-1659.
[6] Li, J., A. D. Heap, A. Potter, Z. Huang, and J. Daniell. 2011. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud
content across the southwest Australian margin. Continental Shelf Research 31:1365-1376.
[7] Li, J., A. Potter, Z. Huang, J. J. Daniell, and A. Heap. 2010. Predicting Seabed Mud Content across the Australian Margin: Comparison of Statistical and Mathematical
Techniques Using a Simulation Experiment. Record, Geoscience Australia, 2010/11, 146pp.
[8] Li, J., Potter, A., Huang, Z. and Heap, A. D., 2012. Predicting Seabed Sand Content across the Australian Margin Using Machine Learning and Geostatistical Methods.
Geoscience Australia, Record 2012/48, 115 pp.
[9] Sanabria, L. A., X. Qin, J. Li, R. P. Cechet, and C. Lucas. 2013. Spatial interpolation of McArthur’s forest fire danger index across Australia: observational study.
Environmental Modelling & Software 50:37-50.
[10] Li, J., Alvarez, B., Siwabessy, J., Tran, M., Huang, Z., Przeslawski, R., Radke, L., Howard, F., Nichol, S., 2017. Application of random forest, generalised linear model
and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness. Environmental Modelling & Software 97 112-129.
[11] Li, J., Tran, M., Siwabessy, J., 2016. Selecting optimal random forest predictive models: a case study on predicting the spatial distribution of seabed hardness. PLOS
ONE 11(2) e0149089.
[12] Arthur, A.D., Li, J., Henry, S., Cunningham, S.A., 2010. Influence of woody vegetation on pollinator densities in oilseed Brassica fields in an Australian temperate
landscape. Basic and Applied Ecology 11 406-414.
[13] Venables, W.N., Ripley, B.D., 2002. Modern Applied Statistics with S-Plus, 4th ed. Springer-Verlag, New York.
[14] Lumley, T., Miller, A., 2009. leaps: Regression Subset Selection. R package version 3.0.
[15] McLeod, A.I., Xu, C., 2017. bestglm: Best Subset GLM. R package version 0.36.
[16] Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent.
Journal of Statistical Software, 33(1), 1-22. URL http://www.jstatsoft.org/v33/i01/.
[17] Li, J., Siwabessy, J., Tran, M., Huang, Z., Heap, A., 2013. Predicting Seabed Hardness Using Random Forest in R, In: Zhao, Y., Cen, Y. (Eds.), Data Mining
Applications with R. Elsevier, pp. 299-329.
[18] Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
[19] Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection, International Joint Conference on Artificial Intelligence
(IJCAI), Morgan Kaufmann, pp. 1137-1143
[20] Li, J., 2017. spm: Spatial Predictive Modelling. R package version 1.0.0. https://CRAN.R-project.org/package=spm.
[21] Li, J., 2016. Assessing spatial predictive models in the environmental sciences: accuracy measures, data variation and variance explained. Environmental Modelling
& Software 80 1-8.
[22] Li J., Heap A., 2011. A review of comparative studies of spatial interpolation methods: performance and impact factors. Ecological Informatics 6: 228-241.
[23] J. Li, B. Alvarez, J. Siwabessy, M. Tran, Z. Huang, R. Przeslawski, L. Radke, F. Howard, and S. Nichol. 2016. Predictive modelling of biological count data using
random forest and generalised linear models and their hybrid methods with geostatistical techniques. In: Australian Statistical Conference 2016, Canberra. Dec. 2016.
MODSIM 2017 Hobart
Phone: +61 2 6249 9111
Web: www.ga.gov.au
Email: clientservices@ga.gov.au
Address: Cnr Jerrabomberra Avenue and Hindmarsh Drive, Symonston ACT 2609
Postal Address: GPO Box 378, Canberra ACT 2601
Thank you
ResearchGate has not been able to resolve any citations for this publication.
Research
Full-text available
A new package, spm: Spatial Predictive Modelling, on CRAN for R users. The package introduces novel, accurate, hybrid geostatistical and machine learning methods for spatial predictive modelling. R package version 1.0.0. https://CRAN.R-project.org/package=spm
Article
Full-text available
A comprehensive assessment of the performance of predictive models is necessary as they have been increasingly employed to generate spatial predictions for environmental management and conservation and their accuracy is crucial to evidence-informed decision making and policy. In this study, we clarified relevant issues associated with variance explained (VEcv) by predictive models, established the relationships between VEcv and commonly used accuracy measures and unified these measures under VEcv that is independent of unit/scale and data variation. We quantified the relationships between these measures and data variation and found about 65% compared models and over 45% recommended models for generating spatial predictions explained no more than 50% data variance. We classified the predictive models based on VEcv, which provides a tool to directly compare the accuracy of predictive models for data with different unit/scale and variation and establishes a cross-disciplinary context and benchmark for assessing predictive models in future studies.
Article
Full-text available
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia's marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to 'small p and large n' problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Spatial distribution of sponge species richness (SSR) and its relationship with environment are important for marine ecosystem management, but they are either unavailable or unknown. Hence we applied random forest (RF), generalised linear model (GLM) and their hybrid methods with geostatistical techniques to SSR data by addressing relevant issues with variable selection and model selection. It was found that: 1) of five variable selection methods, one is suitable for selecting optimal RF predictive models; 2) traditional model selection methods are unsuitable for identifying GLM predictive models and joint application of RF and AIC can select accuracy-improved models; 3) highly correlated predictors may improve RF predictive accuracy; 4) hybrid methods for RF can accurately predict count data; and 5) effects of model averaging are method-dependent. This study depicted the non-linear relationships of SSR and predictors, generated spatial distribution of SSR with high accuracy and revealed the association of high SSR with hard seabed features.
Chapter
The spatial information of the seabed biodiversity is important for marine zone management in Australia. The biodiversity is often predicted using spatially continuous data of seabed biophysical properties. Seabed hardness is an important property for predicting the biodiversity and is often inferred from multibeam backscatter data. Seabed hardness can also be inferred based on underwater video footage that is, however, only available at a limited number of sampled locations. In this study, we predict the spatial distribution of seabed hardness using random forest (RF) based on video classification and seabed properties. We illustrate the effects of cross-validation methods including a new cross-validation function (. rf.cv) on selecting the most optimal predictive model. We also test the effects of various predictor sets on the accuracy of predictive models. This study provides an example of predicting the spatial distribution of environmental properties using RF in R.
Article
Spatially continuous data of environmental variables are often required for environmental sciences and management. However, information for environmental variables is usually collected by point sampling, particularly for the mountainous region and deep ocean area. Thus, methods generating such spatially continuous data by using point samples become essential tools. Spatial interpolation methods (SIMs) are, however, often data-specific or even variable-specific. Many factors affect the predictive performance of the methods and previous studies have shown that their effects are not consistent. Hence it is difficult to select an appropriate method for a given dataset. This review aims to provide guidelines and suggestions regarding application of SIMs to environmental data by comparing the features of the commonly applied methods which fall into three categories, namely: non-geostatistical interpolation methods, geostatistical interpolation methods and combined methods. Factors affecting the performance, including sampling design, sample spatial distribution, data quality, correlation between primary and secondary variables, and interaction among factors, are discussed. A total of 25 commonly applied methods are then classified based on their features to provide an overview of the relationships among them. These features are quantified and then clustered to show similarities among these 25 methods. An easy to use decision tree for selecting an appropriate method from these 25 methods is developed based on data availability, data nature, expected estimation, and features of the method. Finally, a list of software packages for spatial interpolation is provided.