David I WartonUNSW Sydney | UNSW · School of Mathematics and Statistics
David I Warton
PhD Macquarie University
About
157
Publications
94,504
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,253
Citations
Introduction
Additional affiliations
January 2004 - December 2013
January 2000 - May 2003
January 1997 - December 1999
Publications
Publications (157)
1. A critical property of count data is its mean–variance relationship, yet this is rarely considered in
multivariate analysis in ecology.
2. This study considers what is being implicitly assumed about the mean–variance relationship in
distance-based analyses – multivariate analyses based on a matrix of pairwise distances – and what
the effect is o...
Summary Modeling the spatial distribution of a species is a fundamental problem in ecology. A number of modeling methods have been developed, an extremely popular one being MAXENT, a maximum entropy modeling approach. In this article, we show that MAXENT is equivalent to a Poisson regression model and hence is related to a Poisson point process mod...
Technological advances have enabled a new class of multivariate models for ecology, with the potential now to specify a statistical model for abundances jointly across many taxa, to simultaneously explore interactions across taxa and the response of abundance to environmental variables. Joint models can be used for several purposes of interest to e...
Visualising data is a key step in data analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modelling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence–...
Multivariate random effects with unstructured variance-covariance matrices of large dimensions, $q$, can be a major challenge to estimate. In this paper, we introduce a new implementation of a reduced-rank approach to fit large dimensional multivariate random effects by writing them as a linear combination of $d < q$ latent variables. By adding red...
Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well-established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community dat...
Integrated distribution models (IDMs) predict where species might occur using data from multiple sources, a technique thought to be especially useful when data from any individual source are scarce. Recent advances allow us to fit such models with latent terms to account for dependence within and between data sources, but they are computationally c...
Muscle volume must increase substantially during childhood growth to generate the power required to propel the growing body. One unresolved but fundamental question about childhood muscle growth is whether muscles grow at equal rates; that is, if muscles grow in synchrony with each other. In this study, we used magnetic resonance imaging (MRI) and...
We introduce community‐level basis function models (CBFMs) as an approach for spatiotemporal joint distribution modelling. CBFMs can be viewed as related to spatiotemporal latent variable models, where the latent variables are replaced by a set of pre‐specified spatiotemporal basis functions which are common across species.
In a CBFM, the coefficie...
Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well‐established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community dat...
In regression modelling, measurement error models are often needed to correct for uncertainty arising from measurements of covariates/predictor variables. The literature on measurement error (or errors-in-variables) modelling is plentiful, however, general algorithms and software for maximum likelihood estimation of models with measurement error ar...
Log-Gaussian Cox processes (LGCPs) offer a framework for regression-style modeling of point patterns that can accommodate spatial latent effects. These latent effects can be used to account for missing predictors or other sources of clustering that could not be explained by a Poisson process. Fitting LGCP models can be challenging because the margi...
The life span of leaves increases with their mass per unit area (LMA). It is unclear why. Here, we show that this empirical generalization (the foundation of the worldwide leaf economics spectrum) is a consequence of natural selection, maximizing average net carbon gain over the leaf life cycle. Analyzing two large leaf trait datasets, we show that...
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Gen...
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that e...
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that e...
The accurate extraction of species‐abundance information from DNA‐based data (metabarcoding, metagenomics) could contribute usefully to diet analysis and food‐web reconstruction, the inference of species interactions, the modelling of population dynamics and species distributions, the biomonitoring of environmental state and change, and the inferen...
1. Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well-established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community...
In this chapter we will revise two of the most commonly used statistical tools—the two-sample t-test and simple linear regression. Then we will see a remarkable equivalence—that these are actually exactly the same thing! This is a very important result; it will give us some intuition for how we can write most of the statistical techniques you have...
Often it is not clear which model you should use for the data at hand—maybe because it is not known ahead of time which combination of variables should be used to predict the response, or maybe it is not obvious how the response should be modelled. In this chapter we will take a look at a few strategies for comparing different models and choosing b...
No doubt you’ve done some stats before—probably in high school and at university, even though it might have been some time ago. I’m not expecting you to remember all of it, all of it, and in this Chapter you will find some important lessons to reinforce before we get cracking.
While the previous two chapters focused on studying community-environment associations, here we will focus on characterising associations between taxa within communities, as in Exercise 17.1.
Any factor that is not treated as random is referred to as fixed. To this point, we have treated everything as fixed (fixed effects models). A model with both fixed and random effects in it is called a mixed effects model.
Model-based inference
—assume your model is correct (or nearly correct) and use theory, or sometimes simulation from your model, to make inferences. Most methods of inference we have discussed so far have been model-based—the confint function, summary, and anova as applied to lm or glm objects, Akaike information criterion, Bayesian information cri...
This book has been a long journey! But this is not the end. Something I’ve learned over my career so far is that the more I know, the more I realise that I don’t know. I am regularly finding out about methods and problems I wasn’t previously aware of and new techniques that have recently been developed.
A key step in any analysis is data visualisation. This is a challenging topic for multivariate data, because (if we have more than two responses) it is not possible to jointly visualise the data in a way that captures correlation across responses or how they relate to predictors. In this chapter, we will discuss a few key techniques to try.
Recall from Sect. 4.4.2 that a “linear model” does not need to be linear. Mathematically, we say a model is linear if it is a linear function of its coefficients (“something times β
1 plus something times β
2…”)
. But if we include non-linear functions of x as predictors, we can use this framework to fit a non-linear function of x to data.
Multiple regression is pretty much the same as simple linear regression, except you have more than one predictor variable. But effects should be interpreted as conditional not marginal, and multi-collinearity should be checked (if important).
Recall from Chap. 1 that a critical assumption we often make in statistics is independence—when this assumption is violated, you are stuffed. Well, unless you have a good idea of how this assumption is violated. This chapter describes a few such situations where we expect data to be dependent and can specify a model to describe this dependence, so...
In Exercise 1.13, David and Alistair looked at invertebrate epifauna settling on algal beds (seaweed) with different levels of isolation (0, 2, or 10 m buffer) from each other, at two sampling times (5 and 10 weeks). They observed the following presence (+ )/absence (−) patterns for crabs (across 10 replicates).
Recall that the type of regression model you use is determined mostly by the properties of the response variable. Well what if you have more than one response variable?
The most common type of multivariate data collected in ecology is also one of the most challenging types to analyse—when some abundance-related measure (e.g. counts, presence–absence, biomass) is simultaneously collected for all taxa or species encountered in a sample, as in Exercises 14.1–14.3. The rest of the book will focus on the analysis of th...
Consider again Lena’s wind farm study (Exercise 14.3). She would like to predict what fish occur where.
When studying how a community responds to its environment, it is typically the case that different taxa will respond in different ways. An important challenge for the ecologist is to go deeper (Shipley, From plant traits to vegetation structure: chance and selection in the assembly of ecological communities. Cambridge University Press, 2010; McGill...
In this chapter we will look at some other common fixed effects designs, all of which can be understood as special cases of the linear model.
In ecological community studies it is often of interest to study the effect of species related trait variables on abundances or presence‐absences. Specifically, the interest may lay in the interactions between environmental and trait variables. An increasingly popular approach for studying such interactions is to use the so‐called fourth‐corner mod...
Visualising data is a vital part of analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence-abse...
Point process models are a natural approach for modelling data that arise as point events. In the case of Poisson counts, these may be fitted easily as a weighted Poisson regression. Point processes lack the notion of sample size. This is problematic for model selection, because various classical criteria such as the Bayesian information criterion...
Multiple imputation and maximum likelihood estimation (via the expectation-
maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochast...
Urbanised estuaries, ports and harbours are often utilised for recreational purposes, notably recreational angling. Yet there has been little quantitative assessment of the footprint and intensity of these activities at scales suitable for spatial management. Urban and industrialised estuaries have previously been considered as having low conservat...
Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Genera...
Aim
Tropical species are thought to be more susceptible to climate warming than are higher latitude species. This prediction is largely based on the assumption that tropical species can tolerate a narrower range of temperatures. While this prediction holds for some animal taxa, we do not yet know the latitudinal trends in temperature tolerance for...
There has been rapid development in tools for multivariate analysis based on fully specified statistical models or ‘joint models’. One approach attracting a lot of attention is generalized linear latent variable models (GLLVMs). However, software for fitting these models is typically slow and not practical for large datasets.
The r package gllvm of...
Ecologists often investigate co‐occurrence patterns in multi‐species data in order to gain insight into the ecological causes of observed co‐occurrences. Apart from direct associations between the two species of interest, they may co‐occur because of indirect effects, where both species respond to another variable, whether environmental or biotic (...
A large array of species distribution model (SDM) approaches has been developed for explaining and predicting the occurrences of individual species or species assemblages. Given the wealth of existing models, it is unclear which models perform best for interpolation or extrapolation of existing data sets, particularly when one is concerned with spe...
Generalized linear latent variable models (GLLVM) are popular tools for modeling multivariate, correlated responses. Such data are often encountered, for instance, in ecological studies, where presence-absences, counts, or biomass of interacting species are collected from a set of sites. Until very recently, the main challenge in fitting GLLVMs has...
Proof of the variational approximation of the likelihood of GLLVMs.
(PDF)
Full results for the starting value comparisons.
(PDF)
Additional simulation results.
Results of the negative binomial GLLVM simulation for the Indonesian birds data and the Bernoulli GLLVM simulation for the testate amoebae data.
(PDF)
Ecologists often investigate co-occurrence patterns in multi-species data in order to gain insight into the ecological causes of observed co-occurrences. Apart from direct associations between two species, two species may co-occur because they both respond in similar ways to environmental variables, or due to the presence of other (mediator) specie...
Attrition-corrosion is a dental wear process containing both mechanical (attrition) and chemical (corrosion) effects. As contact load is a critical parameter for wear, its effects on enamel wear were investigated in vitro in the present study. Enamel cusp-on-flat configuration reciprocating wear tests were performed with acetic acid (at pH 3.2 and...
In ecology, the true causal structure for a given problem is often not known, and several plausible models and thus model predictions exist. It has been claimed that using weighted averages of these models can reduce prediction error, as well as better reflect model selection uncertainty. These claims, however, are often demonstrated by isolated ex...
We propose an algorithm that generalizes to discrete data any given covariance modeling algorithm originally intended for Gaussian responses, via a Gaussian copula approach. Covariance modeling is a powerful tool for extracting meaning from multivariate data, and fast algorithms for Gaussian data, such as factor analysis and Gaussian graphical mode...
Multivariate adaptive regression splines (MARS) is a popular nonparametric regression tool often used for prediction and for uncovering important data patterns between the response and predictor variables. The standard MARS algorithm assumes responses are normally distributed and independent, but in this article we relax both of these assumptions b...
In this paper we consider generalized linear latent variable models that can handle overdispersed counts and continuous but non-negative data. Such data are common in ecological studies when modelling multivariate abundances or biomass. By extending the standard generalized linear modelling framework to include latent variables, we can account for...
The mean‐variance relationship is a central property of multivariate abundances – it has been shown that when not accounted for, potentially serious artifacts can be introduced to analyses. One such effect is the confounding of location and dispersion.
Roberts (in press) recently argued that mean‐variance relationships are not important in understa...
Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Pois...
While data transformation is a common strategy to satisfy linear modeling assumptions, a theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts. Under broad assumptions, as counts get smaller, it is shown that the variance becomes proportional to the mean under monotonic transfor...
1.Restoration of degraded plant communities requires understanding of community assembly processes. Human land use can influence plant community assembly by altering environmental conditions and species’ dispersal patterns. Flooding, including from environmental flows, may counteract land use effects on wetland vegetation. We examined the influence...
Cross‐disciplinary research between ecologists and statisticians has considerable potential for significant new advances, capitalising on recent advances in technology for collecting and analysing data.
We introduce a Special Feature of five papers showcasing interdisciplinary collaboration, centred around problems estimating biodiversity and how i...
Occupancy‐detection models that account for imperfect detection have become widely used in many areas of ecology. As with any modelling exercise, it is important to assess whether the fitted model encapsulates the main sources of variation in the data, yet there have been few methods developed for occupancy‐detection models that would allow practit...
Generalized Linear Latent Variable Models (GLLVMs) are a powerful class of models for understanding the relationships among multiple, correlated responses. Estimation however presents a major challenge, as the marginal likelihood does not possess a closed form for non-normal responses. We propose a variational approximation (VA) method for estimati...
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor p...
Question
How do contrasting influences of inundation and historical land uses affect restoration of soil propagule bank composition in floodplain wetlands?
Location
Northern Nature Reserve (large ephemeral floodplain), Macquarie Marshes, New South Wales, Australia.
Methods
We conducted germination assays on soil samples collected from fields with...
Beginning in the mid-1980s, the Laurentian Great Lakes underwent successive invasions by Ponto-Caspian species. We quantified major changes in the diversity and relative abundance of pre-invasion benthic macroinvertebrates at the same study site in southwestern Lake Ontario from 1983–2014. The zebra mussel Dreissena polymorpha Pallas arrived at the...
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they...
Aim
A ‘good’ classification should provide information about the composition and abundance of the species within communities, if it serves as an informative surrogate for biodiversity. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and multivariate species data (site by species matrix...
The two most common approaches for analysing count data are to use a generalized linear model ( GLM ), or transform data, and use a linear model ( LM ). The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. We make three points on this issu...
Multi-species distribution modeling, which relates the occurrence of multiple
species to environmental variables, is an important tool used by ecologists for
both predicting the distribution of species in a community and identifying the
important variables driving species co-occurrences. Recently, Dunstan, Foster
and Darnell [Ecol. Model. 222 (2011...
Choosing the number of components in a finite mixture model is a challenging task. In this article, we study the behaviour
of information criteria for selecting the mixture order, based on either the observed likelihood or the complete likelihood
including component labels. We propose a new observed likelihood criterion called aicmix, which is show...
In this paper, a case is made for the use of model-based approaches for the analysis of community data. This involves the direct specification of a statistical model for the observed multivariate data. Recent advances in statistical modelling mean that it is now possible to build models that are appropriate for the data which address key ecological...
Cross‐disciplinary research between ecologists and statisticians has led to significant advances in many different aspects of biodiversity modelling – including species distribution modelling, multivariate analysis and the measurement of diversity.
We introduce a Special Feature of seven papers showcasing recent interdisciplinary collaboration betw...
Presence‐only data are widely used for species distribution modelling, and point process regression models are a flexible tool that has considerable potential for this problem, when data arise as point events.
In this paper, we review point process models, some of their advantages and some common methods of fitting them to presence‐only data.
Advan...
The adaptive Lasso is a commonly applied penalty for variable selection in regression modeling. Like all penalties though, its performance depends critically on the choice of the tuning parameter. One method for choosing the tuning parameter is via information criteria, such as those based on AIC and BIC. However, these criteria were developed for...
For speciose, but poorly known groups, such as terrestrial arthropods, functional traits present a potential avenue to assist in predicting responses to environmental change. Species turnover is common along environmental gradients, but it is unclear how this is reflected in species traits. Community-level change in arthropod traits, other than bod...
A functional traits-based theory of organismal communities is critical for understanding the principles underlying community assembly, and predicting responses to environmental change. This is particularly true for terrestrial arthropods, of which only 20 % are described. Using epigaeic ant assemblages, we asked: (1) can we use morphological variat...
S hipley, V ile & G arnier ( Science 2006; 314 : 812) proposed a maximum entropy approach to studying how species relative abundance is mediated by their traits, ‘community assembly via trait selection’ ( CATS ).
In this paper, we build on recent equivalences between the maximum entropy formalism and Poisson regression to show that CATS is equivale...
Unconstrained ordination is commonly used in ecology to visualize multivariate data, in particular, to visualize the main trends between different sites in terms of their species composition or relative abundance.
Methods of unconstrained ordination currently used, such as non‐metric multidimensional scaling, are algorithm‐based techniques develope...
Spatial climate variables are routinely used in species distribution models (SDMs) without accounting for the fact that they have been predicted with uncertainty, which can lead to biased estimates, erroneous inference and poor performances when predicting to new settings – for example under climate change scenarios.
We show how information on unce...
AimWe analyse how and why ‘topoclimate’ mapping methodologies improve on macroclimatic variables in modelling the distribution of biodiversity. Further, we consider the implications for climate change projections.LocationGreater Hunter Valley region (c. 60,000 km2), New South Wales, Australia.Methods
We fitted generalised linear models to 295 speci...
An important problem encountered by ecologists in species distribution modelling ( SDM ) and in multivariate analysis is that of understanding why environmental responses differ across species, and how differences are mediated by functional traits.
We describe a simple, generic approach to this problem – the core idea being to fit a predictive mode...
Non‐random species loss and gain in local communities change the compositional heterogeneity between communities over time, which is traditionally quantified with dissimilarity‐based approaches. Yet, dissimilarities summarize the multivariate species data into a univariate index and obscure the species‐level patterns of change, which are central to...
We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed criterion is based on score statistics which incorporate correlated response data. The main advantage of the SIC is that it is much faster to compute than existing model selection criteria when the nu...