David I Warton

David I Warton
UNSW Sydney | UNSW · School of Mathematics and Statistics

PhD Macquarie University

About

157
Publications
94,504
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,253
Citations
Additional affiliations
January 2004 - December 2013
UNSW Sydney
January 2000 - May 2003
Macquarie University
January 1997 - December 1999
The University of Sydney

Publications

Publications (157)
Article
Full-text available
1. A critical property of count data is its mean–variance relationship, yet this is rarely considered in multivariate analysis in ecology. 2. This study considers what is being implicitly assumed about the mean–variance relationship in distance-based analyses – multivariate analyses based on a matrix of pairwise distances – and what the effect is o...
Article
Full-text available
Summary Modeling the spatial distribution of a species is a fundamental problem in ecology. A number of modeling methods have been developed, an extremely popular one being MAXENT, a maximum entropy modeling approach. In this article, we show that MAXENT is equivalent to a Poisson regression model and hence is related to a Poisson point process mod...
Article
Technological advances have enabled a new class of multivariate models for ecology, with the potential now to specify a statistical model for abundances jointly across many taxa, to simultaneously explore interactions across taxa and the response of abundance to environmental variables. Joint models can be used for several purposes of interest to e...
Article
Visualising data is a key step in data analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modelling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence–...
Preprint
Full-text available
Multivariate random effects with unstructured variance-covariance matrices of large dimensions, $q$, can be a major challenge to estimate. In this paper, we introduce a new implementation of a reduced-rank approach to fit large dimensional multivariate random effects by writing them as a linear combination of $d < q$ latent variables. By adding red...
Preprint
Full-text available
Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well-established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community dat...
Article
Integrated distribution models (IDMs) predict where species might occur using data from multiple sources, a technique thought to be especially useful when data from any individual source are scarce. Recent advances allow us to fit such models with latent terms to account for dependence within and between data sources, but they are computationally c...
Article
Muscle volume must increase substantially during childhood growth to generate the power required to propel the growing body. One unresolved but fundamental question about childhood muscle growth is whether muscles grow at equal rates; that is, if muscles grow in synchrony with each other. In this study, we used magnetic resonance imaging (MRI) and...
Article
Full-text available
We introduce community‐level basis function models (CBFMs) as an approach for spatiotemporal joint distribution modelling. CBFMs can be viewed as related to spatiotemporal latent variable models, where the latent variables are replaced by a set of pre‐specified spatiotemporal basis functions which are common across species. In a CBFM, the coefficie...
Article
Full-text available
Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well‐established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community dat...
Article
Full-text available
In regression modelling, measurement error models are often needed to correct for uncertainty arising from measurements of covariates/predictor variables. The literature on measurement error (or errors-in-variables) modelling is plentiful, however, general algorithms and software for maximum likelihood estimation of models with measurement error ar...
Article
Full-text available
Log-Gaussian Cox processes (LGCPs) offer a framework for regression-style modeling of point patterns that can accommodate spatial latent effects. These latent effects can be used to account for missing predictors or other sources of clustering that could not be explained by a Poisson process. Fitting LGCP models can be challenging because the margi...
Article
Full-text available
The life span of leaves increases with their mass per unit area (LMA). It is unclear why. Here, we show that this empirical generalization (the foundation of the worldwide leaf economics spectrum) is a consequence of natural selection, maximizing average net carbon gain over the leaf life cycle. Analyzing two large leaf trait datasets, we show that...
Article
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Gen...
Article
Full-text available
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that e...
Preprint
Full-text available
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that e...
Article
Full-text available
The accurate extraction of species‐abundance information from DNA‐based data (metabarcoding, metagenomics) could contribute usefully to diet analysis and food‐web reconstruction, the inference of species interactions, the modelling of population dynamics and species distributions, the biomonitoring of environmental state and change, and the inferen...
Preprint
Full-text available
1. Sample size estimation through power analysis is a fundamental tool in planning an ecological study, yet there are currently no well-established procedures for when multivariate abundances are to be collected. A power analysis procedure would need to address three challenges: designing a parsimonious simulation model that captures key community...
Chapter
In this chapter we will revise two of the most commonly used statistical tools—the two-sample t-test and simple linear regression. Then we will see a remarkable equivalence—that these are actually exactly the same thing! This is a very important result; it will give us some intuition for how we can write most of the statistical techniques you have...
Chapter
Often it is not clear which model you should use for the data at hand—maybe because it is not known ahead of time which combination of variables should be used to predict the response, or maybe it is not obvious how the response should be modelled. In this chapter we will take a look at a few strategies for comparing different models and choosing b...
Chapter
No doubt you’ve done some stats before—probably in high school and at university, even though it might have been some time ago. I’m not expecting you to remember all of it, all of it, and in this Chapter you will find some important lessons to reinforce before we get cracking.
Chapter
While the previous two chapters focused on studying community-environment associations, here we will focus on characterising associations between taxa within communities, as in Exercise 17.1.
Chapter
Any factor that is not treated as random is referred to as fixed. To this point, we have treated everything as fixed (fixed effects models). A model with both fixed and random effects in it is called a mixed effects model.
Chapter
Model-based inference —assume your model is correct (or nearly correct) and use theory, or sometimes simulation from your model, to make inferences. Most methods of inference we have discussed so far have been model-based—the confint function, summary, and anova as applied to lm or glm objects, Akaike information criterion, Bayesian information cri...
Chapter
This book has been a long journey! But this is not the end. Something I’ve learned over my career so far is that the more I know, the more I realise that I don’t know. I am regularly finding out about methods and problems I wasn’t previously aware of and new techniques that have recently been developed.
Chapter
Body size and brain size were recorded for 28 mammals.
Chapter
A key step in any analysis is data visualisation. This is a challenging topic for multivariate data, because (if we have more than two responses) it is not possible to jointly visualise the data in a way that captures correlation across responses or how they relate to predictors. In this chapter, we will discuss a few key techniques to try.
Chapter
Recall from Sect. 4.4.2 that a “linear model” does not need to be linear. Mathematically, we say a model is linear if it is a linear function of its coefficients (“something times β 1 plus something times β 2…”) . But if we include non-linear functions of x as predictors, we can use this framework to fit a non-linear function of x to data.
Chapter
Multiple regression is pretty much the same as simple linear regression, except you have more than one predictor variable. But effects should be interpreted as conditional not marginal, and multi-collinearity should be checked (if important).
Chapter
Recall from Chap. 1 that a critical assumption we often make in statistics is independence—when this assumption is violated, you are stuffed. Well, unless you have a good idea of how this assumption is violated. This chapter describes a few such situations where we expect data to be dependent and can specify a model to describe this dependence, so...
Chapter
In Exercise 1.13, David and Alistair looked at invertebrate epifauna settling on algal beds (seaweed) with different levels of isolation (0, 2, or 10 m buffer) from each other, at two sampling times (5 and 10 weeks). They observed the following presence (+ )/absence (−) patterns for crabs (across 10 replicates).
Chapter
Recall that the type of regression model you use is determined mostly by the properties of the response variable. Well what if you have more than one response variable?
Chapter
The most common type of multivariate data collected in ecology is also one of the most challenging types to analyse—when some abundance-related measure (e.g. counts, presence–absence, biomass) is simultaneously collected for all taxa or species encountered in a sample, as in Exercises 14.1–14.3. The rest of the book will focus on the analysis of th...
Chapter
Consider again Lena’s wind farm study (Exercise 14.3). She would like to predict what fish occur where.
Chapter
When studying how a community responds to its environment, it is typically the case that different taxa will respond in different ways. An important challenge for the ecologist is to go deeper (Shipley, From plant traits to vegetation structure: chance and selection in the assembly of ecological communities. Cambridge University Press, 2010; McGill...
Chapter
In this chapter we will look at some other common fixed effects designs, all of which can be understood as special cases of the linear model.
Article
Full-text available
In ecological community studies it is often of interest to study the effect of species related trait variables on abundances or presence‐absences. Specifically, the interest may lay in the interactions between environmental and trait variables. An increasingly popular approach for studying such interactions is to use the so‐called fourth‐corner mod...
Preprint
Full-text available
Visualising data is a vital part of analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence-abse...
Article
Point process models are a natural approach for modelling data that arise as point events. In the case of Poisson counts, these may be fitted easily as a weighted Poisson regression. Point processes lack the notion of sample size. This is problematic for model selection, because various classical criteria such as the Bayesian information criterion...
Article
Full-text available
Multiple imputation and maximum likelihood estimation (via the expectation- maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochast...
Article
Urbanised estuaries, ports and harbours are often utilised for recreational purposes, notably recreational angling. Yet there has been little quantitative assessment of the footprint and intensity of these activities at scales suitable for spatial management. Urban and industrialised estuaries have previously been considered as having low conservat...
Preprint
Full-text available
Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Genera...
Article
Aim Tropical species are thought to be more susceptible to climate warming than are higher latitude species. This prediction is largely based on the assumption that tropical species can tolerate a narrower range of temperatures. While this prediction holds for some animal taxa, we do not yet know the latitudinal trends in temperature tolerance for...
Article
There has been rapid development in tools for multivariate analysis based on fully specified statistical models or ‘joint models’. One approach attracting a lot of attention is generalized linear latent variable models (GLLVMs). However, software for fitting these models is typically slow and not practical for large datasets. The r package gllvm of...
Article
Full-text available
Ecologists often investigate co‐occurrence patterns in multi‐species data in order to gain insight into the ecological causes of observed co‐occurrences. Apart from direct associations between the two species of interest, they may co‐occur because of indirect effects, where both species respond to another variable, whether environmental or biotic (...
Article
Full-text available
A large array of species distribution model (SDM) approaches has been developed for explaining and predicting the occurrences of individual species or species assemblages. Given the wealth of existing models, it is unclear which models perform best for interpolation or extrapolation of existing data sets, particularly when one is concerned with spe...
Article
Full-text available
Generalized linear latent variable models (GLLVM) are popular tools for modeling multivariate, correlated responses. Such data are often encountered, for instance, in ecological studies, where presence-absences, counts, or biomass of interacting species are collected from a set of sites. Until very recently, the main challenge in fitting GLLVMs has...
Data
Proof of the variational approximation of the likelihood of GLLVMs. (PDF)
Data
Full results for the starting value comparisons. (PDF)
Data
Additional simulation results. Results of the negative binomial GLLVM simulation for the Indonesian birds data and the Bernoulli GLLVM simulation for the testate amoebae data. (PDF)
Preprint
Full-text available
Ecologists often investigate co-occurrence patterns in multi-species data in order to gain insight into the ecological causes of observed co-occurrences. Apart from direct associations between two species, two species may co-occur because they both respond in similar ways to environmental variables, or due to the presence of other (mediator) specie...
Article
Attrition-corrosion is a dental wear process containing both mechanical (attrition) and chemical (corrosion) effects. As contact load is a critical parameter for wear, its effects on enamel wear were investigated in vitro in the present study. Enamel cusp-on-flat configuration reciprocating wear tests were performed with acetic acid (at pH 3.2 and...
Article
In ecology, the true causal structure for a given problem is often not known, and several plausible models and thus model predictions exist. It has been claimed that using weighted averages of these models can reduce prediction error, as well as better reflect model selection uncertainty. These claims, however, are often demonstrated by isolated ex...
Article
We propose an algorithm that generalizes to discrete data any given covariance modeling algorithm originally intended for Gaussian responses, via a Gaussian copula approach. Covariance modeling is a powerful tool for extracting meaning from multivariate data, and fast algorithms for Gaussian data, such as factor analysis and Gaussian graphical mode...
Article
Full-text available
Multivariate adaptive regression splines (MARS) is a popular nonparametric regression tool often used for prediction and for uncovering important data patterns between the response and predictor variables. The standard MARS algorithm assumes responses are normally distributed and independent, but in this article we relax both of these assumptions b...
Article
In this paper we consider generalized linear latent variable models that can handle overdispersed counts and continuous but non-negative data. Such data are common in ecological studies when modelling multivariate abundances or biomass. By extending the standard generalized linear modelling framework to include latent variables, we can account for...
Article
Full-text available
The mean‐variance relationship is a central property of multivariate abundances – it has been shown that when not accounted for, potentially serious artifacts can be introduced to analyses. One such effect is the confounding of location and dispersion. Roberts (in press) recently argued that mean‐variance relationships are not important in understa...
Article
Full-text available
Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Pois...
Data
Proofs of theorems. (PDF)
Article
Full-text available
While data transformation is a common strategy to satisfy linear modeling assumptions, a theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts. Under broad assumptions, as counts get smaller, it is shown that the variance becomes proportional to the mean under monotonic transfor...
Article
1.Restoration of degraded plant communities requires understanding of community assembly processes. Human land use can influence plant community assembly by altering environmental conditions and species’ dispersal patterns. Flooding, including from environmental flows, may counteract land use effects on wetland vegetation. We examined the influence...
Article
Cross‐disciplinary research between ecologists and statisticians has considerable potential for significant new advances, capitalising on recent advances in technology for collecting and analysing data. We introduce a Special Feature of five papers showcasing interdisciplinary collaboration, centred around problems estimating biodiversity and how i...
Article
Occupancy‐detection models that account for imperfect detection have become widely used in many areas of ecology. As with any modelling exercise, it is important to assess whether the fitted model encapsulates the main sources of variation in the data, yet there have been few methods developed for occupancy‐detection models that would allow practit...
Article
Full-text available
Generalized Linear Latent Variable Models (GLLVMs) are a powerful class of models for understanding the relationships among multiple, correlated responses. Estimation however presents a major challenge, as the marginal likelihood does not possess a closed form for non-normal responses. We propose a variational approximation (VA) method for estimati...
Article
Full-text available
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor p...
Article
Question How do contrasting influences of inundation and historical land uses affect restoration of soil propagule bank composition in floodplain wetlands? Location Northern Nature Reserve (large ephemeral floodplain), Macquarie Marshes, New South Wales, Australia. Methods We conducted germination assays on soil samples collected from fields with...
Article
Full-text available
Beginning in the mid-1980s, the Laurentian Great Lakes underwent successive invasions by Ponto-Caspian species. We quantified major changes in the diversity and relative abundance of pre-invasion benthic macroinvertebrates at the same study site in southwestern Lake Ontario from 1983–2014. The zebra mussel Dreissena polymorpha Pallas arrived at the...
Article
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they...
Article
Aim A ‘good’ classification should provide information about the composition and abundance of the species within communities, if it serves as an informative surrogate for biodiversity. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and multivariate species data (site by species matrix...
Article
The two most common approaches for analysing count data are to use a generalized linear model ( GLM ), or transform data, and use a linear model ( LM ). The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. We make three points on this issu...
Article
Full-text available
Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 (2011...
Article
Full-text available
Choosing the number of components in a finite mixture model is a challenging task. In this article, we study the behaviour of information criteria for selecting the mixture order, based on either the observed likelihood or the complete likelihood including component labels. We propose a new observed likelihood criterion called aicmix, which is show...
Article
Full-text available
In this paper, a case is made for the use of model-based approaches for the analysis of community data. This involves the direct specification of a statistical model for the observed multivariate data. Recent advances in statistical modelling mean that it is now possible to build models that are appropriate for the data which address key ecological...
Article
Cross‐disciplinary research between ecologists and statisticians has led to significant advances in many different aspects of biodiversity modelling – including species distribution modelling, multivariate analysis and the measurement of diversity. We introduce a Special Feature of seven papers showcasing recent interdisciplinary collaboration betw...
Article
Presence‐only data are widely used for species distribution modelling, and point process regression models are a flexible tool that has considerable potential for this problem, when data arise as point events. In this paper, we review point process models, some of their advantages and some common methods of fitting them to presence‐only data. Advan...
Article
Full-text available
The adaptive Lasso is a commonly applied penalty for variable selection in regression modeling. Like all penalties though, its performance depends critically on the choice of the tuning parameter. One method for choosing the tuning parameter is via information criteria, such as those based on AIC and BIC. However, these criteria were developed for...
Article
For speciose, but poorly known groups, such as terrestrial arthropods, functional traits present a potential avenue to assist in predicting responses to environmental change. Species turnover is common along environmental gradients, but it is unclear how this is reflected in species traits. Community-level change in arthropod traits, other than bod...
Article
A functional traits-based theory of organismal communities is critical for understanding the principles underlying community assembly, and predicting responses to environmental change. This is particularly true for terrestrial arthropods, of which only 20 % are described. Using epigaeic ant assemblages, we asked: (1) can we use morphological variat...
Article
S hipley, V ile & G arnier ( Science 2006; 314 : 812) proposed a maximum entropy approach to studying how species relative abundance is mediated by their traits, ‘community assembly via trait selection’ ( CATS ). In this paper, we build on recent equivalences between the maximum entropy formalism and Poisson regression to show that CATS is equivale...
Article
Unconstrained ordination is commonly used in ecology to visualize multivariate data, in particular, to visualize the main trends between different sites in terms of their species composition or relative abundance. Methods of unconstrained ordination currently used, such as non‐metric multidimensional scaling, are algorithm‐based techniques develope...
Article
Full-text available
Spatial climate variables are routinely used in species distribution models (SDMs) without accounting for the fact that they have been predicted with uncertainty, which can lead to biased estimates, erroneous inference and poor performances when predicting to new settings – for example under climate change scenarios. We show how information on unce...
Article
AimWe analyse how and why ‘topoclimate’ mapping methodologies improve on macroclimatic variables in modelling the distribution of biodiversity. Further, we consider the implications for climate change projections.LocationGreater Hunter Valley region (c. 60,000 km2), New South Wales, Australia.Methods We fitted generalised linear models to 295 speci...
Article
An important problem encountered by ecologists in species distribution modelling ( SDM ) and in multivariate analysis is that of understanding why environmental responses differ across species, and how differences are mediated by functional traits. We describe a simple, generic approach to this problem – the core idea being to fit a predictive mode...
Article
Non‐random species loss and gain in local communities change the compositional heterogeneity between communities over time, which is traditionally quantified with dissimilarity‐based approaches. Yet, dissimilarities summarize the multivariate species data into a univariate index and obscure the species‐level patterns of change, which are central to...
Article
We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed criterion is based on score statistics which incorporate correlated response data. The main advantage of the SIC is that it is much faster to compute than existing model selection criteria when the nu...