# David I WartonUNSW Sydney | UNSW · School of Mathematics and Statistics

David I Warton

PhD Macquarie University

## About

122

Publications

68,053

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

18,183

Citations

Introduction

Additional affiliations

January 2004 - December 2013

January 2000 - May 2003

January 1997 - December 1999

## Publications

Publications (122)

1. A critical property of count data is its mean–variance relationship, yet this is rarely considered in
multivariate analysis in ecology.
2. This study considers what is being implicitly assumed about the mean–variance relationship in
distance-based analyses – multivariate analyses based on a matrix of pairwise distances – and what
the effect is o...

Summary Modeling the spatial distribution of a species is a fundamental problem in ecology. A number of modeling methods have been developed, an extremely popular one being MAXENT, a maximum entropy modeling approach. In this article, we show that MAXENT is equivalent to a Poisson regression model and hence is related to a Poisson point process mod...

Technological advances have enabled a new class of multivariate models for ecology, with the potential now to specify a statistical model for abundances jointly across many taxa, to simultaneously explore interactions across taxa and the response of abundance to environmental variables. Joint models can be used for several purposes of interest to e...

While data transformation is a common strategy to satisfy linear modeling assumptions, a theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts. Under broad assumptions, as counts get smaller, it is shown that the variance becomes proportional to the mean under monotonic transfor...

1. Visualising data is a key step in data analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presenc...

In ecological community studies it is often of interest to study the effect of species related trait variables on abundances or presence‐absences. Specifically, the interest may lay in the interactions between environmental and trait variables. An increasingly popular approach for studying such interactions is to use the so‐called fourth‐corner mod...

Visualising data is a vital part of analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence-abse...

Point process models are a natural approach for modelling data that arise as point events. In the case of Poisson counts, these may be fitted easily as a weighted Poisson regression. Point processes lack the notion of sample size. This is problematic for model selection, because various classical criteria such as the Bayesian information criterion...

Multiple imputation and maximum likelihood estimation (via the expectation-
maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochast...

Urbanised estuaries, ports and harbours are often utilised for recreational purposes, notably recreational angling. Yet there has been little quantitative assessment of the footprint and intensity of these activities at scales suitable for spatial management. Urban and industrialised estuaries have previously been considered as having low conservat...

Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Genera...

Aim
Tropical species are thought to be more susceptible to climate warming than are higher latitude species. This prediction is largely based on the assumption that tropical species can tolerate a narrower range of temperatures. While this prediction holds for some animal taxa, we do not yet know the latitudinal trends in temperature tolerance for...

1.There has been rapid development in tools for multivariate analysis based on fully specified statistical models or “joint models”. One approach attracting a lot of attention is generalized linear latent variable models (GLLVMs). However, software for fitting these models is typically slow and not practical for large datsets. 2.The R package gllvm...

1.Ecologists often investigate co‐occurrence patterns in multi‐species data in order to gain insight into the ecological causes of observed co‐occurrences. Apart from direct associations between the two species of interest, they may co‐occur because of indirect effects, where both species respond to another variable, whether environmental or biotic...

A large array of species distribution model (SDM) approaches have been developed for explaining and predicting the occurrences of individual species or species assemblages. Given the wealth of existing models, it is unclear which models perform best for interpolation or extrapolation of existing data sets, particularly when one is concerned with sp...

Generalized linear latent variable models (GLLVM) are popular tools for modeling multivariate, correlated responses. Such data are often encountered, for instance, in ecological studies, where presence-absences, counts, or biomass of interacting species are collected from a set of sites. Until very recently, the main challenge in fitting GLLVMs has...

Proof of the variational approximation of the likelihood of GLLVMs.
(PDF)

Full results for the starting value comparisons.
(PDF)

Additional simulation results.
Results of the negative binomial GLLVM simulation for the Indonesian birds data and the Bernoulli GLLVM simulation for the testate amoebae data.
(PDF)

Ecologists often investigate co-occurrence patterns in multi-species data in order to gain insight into the ecological causes of observed co-occurrences. Apart from direct associations between two species, two species may co-occur because they both respond in similar ways to environmental variables, or due to the presence of other (mediator) specie...

Attrition-corrosion is a dental wear process containing both mechanical (attrition) and chemical (corrosion) effects. As contact load is a critical parameter for wear, its effects on enamel wear were investigated in vitro in the present study. Enamel cusp-on-flat configuration reciprocating wear tests were performed with acetic acid (at pH 3.2 and...

In ecology, the true causal structure for a given problem is often not known, and several plausible models and thus model predictions exist. It has been claimed that using weighted averages of these models can reduce prediction error, as well as better reflect model selection uncertainty. These claims, however, are often demonstrated by isolated ex...

We propose an algorithm that generalizes to discrete data any given covariance modeling algorithm originally intended for Gaussian responses, via a Gaussian copula approach. Covariance modeling is a powerful tool for extracting meaning from multivariate data, and fast algorithms for Gaussian data, such as factor analysis and Gaussian graphical mode...

Multivariate adaptive regression splines (MARS) is a popular nonparametric regression tool often used for prediction and for uncovering important data patterns between the response and predictor variables. The standard MARS algorithm assumes responses are normally distributed and independent, but in this article we relax both of these assumptions b...

In this paper we consider generalized linear latent variable models that can handle overdispersed counts and continuous but non-negative data. Such data are common in ecological studies when modelling multivariate abundances or biomass. By extending the standard generalized linear modelling framework to include latent variables, we can account for...

The mean-variance relationship is a central property of multivariate abundances – it has been shown that when not accounted for, potentially serious artifacts can be introduced to analyses. One such effect is the confounding of location and dispersion. Roberts (in press) recently argued that mean-variance relationships are not important in understa...

Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Pois...

1.Restoration of degraded plant communities requires understanding of community assembly processes. Human land use can influence plant community assembly by altering environmental conditions and species’ dispersal patterns. Flooding, including from environmental flows, may counteract land use effects on wetland vegetation. We examined the influence...

Cross-disciplinary research between ecologists and statisticians has considerable potential for significant new advances, capitalising on recent advances in technology for collecting and analysing data. We introduce a Special Feature of five papers showcasing interdisciplinary collaboration, centred around problems estimating biodiversity and how i...

Occupancy-detection models that account for imperfect detection have become widely used in many areas of ecology. As with any modelling exercise, it is important to assess whether the fitted model encapsulates the main sources of variation in the data, yet there have been few methods developed for occupancy-detection models that would allow practit...

Generalized Linear Latent Variable Models (GLLVMs) are a powerful class of models for understanding the relationships among multiple, correlated responses. Estimation however presents a major challenge, as the marginal likelihood does not possess a closed form for non-normal responses. We propose a variational approximation (VA) method for estimati...

Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor p...

How do contrasting influences of inundation and historical land uses affect restoration of soil propagule bank composition in floodplain wetlands? Northern Nature Reserve (large ephemeral floodplain), Macquarie Marshes, New South Wales, Australia We conducted germination assays on soil samples collected from fields with different land use histories...

Beginning in the mid-1980s, the Laurentian Great Lakes underwent successive invasions by Ponto-Caspian species. We quantified major changes in the diversity and relative abundance of pre-invasion benthic macroinvertebrates at the same study site in southwestern Lake Ontario from 1983–2014. The zebra mussel Dreissena polymorpha Pallas arrived at the...

Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they...

A ‘good’ classification should provide information about the composition and abundance of the species within communities, if it serves as an informative surrogate for biodiversity. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and multivariate species data (site by species matrix) is...

The two most common approaches for analysing count data are to use a generalized linear model (GLM), or transform data, and use a linear model (LM). The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. We make three points on this issue. P...

Multi-species distribution modeling, which relates the occurrence of multiple
species to environmental variables, is an important tool used by ecologists for
both predicting the distribution of species in a community and identifying the
important variables driving species co-occurrences. Recently, Dunstan, Foster
and Darnell [Ecol. Model. 222 (2011...

Choosing the number of components in a finite mixture model is a challenging task. In this article, we study the behaviour
of information criteria for selecting the mixture order, based on either the observed likelihood or the complete likelihood
including component labels. We propose a new observed likelihood criterion called aicmix, which is show...

In this paper, a case is made for the use of model-based approaches for the analysis of community data. This involves the direct specification of a statistical model for the observed multivariate data. Recent advances in statistical modelling mean that it is now possible to build models that are appropriate for the data which address key ecological...

Cross-disciplinary research between ecologists and statisticians has led to significant advances in many different aspects of biodiversity modelling – including species distribution modelling, multivariate analysis and the measurement of diversity.We introduce a Special Feature of seven papers showcasing recent interdisciplinary collaboration betwe...

1.Presence-only data are widely used for species distribution modelling, and point process regression models are a exible tool that has considerable potential for this problem, when data arise as point events.2.In this paper we review point process models, some of their advantages, and some common methods of fitting them to presence-only data.3.Adv...

The adaptive Lasso is a commonly applied penalty for variable selection in regression modeling. Like all penalties though, its performance depends critically on the choice of the tuning parameter. One method for choosing the tuning parameter is via information criteria, such as those based on AIC and BIC. However, these criteria were developed for...

For speciose, but poorly known groups, such as terrestrial arthropods, functional traits present a potential avenue to assist in predicting responses to environmental change. Species turnover is common along environmental gradients, but it is unclear how this is reflected in species traits. Community-level change in arthropod traits, other than bod...

A functional traits-based theory of organismal communities is critical for understanding the principles underlying community assembly, and predicting responses to environmental change. This is particularly true for terrestrial arthropods, of which only 20 % are described. Using epigaeic ant assemblages, we asked: (1) can we use morphological variat...

1.Shipley et al. (2006) proposed a maximum entropy approach to studying how species relative abundance is mediated by their traits, “community assembly via trait selection” (CATS).2.In this paper we build on recent equivalences between the maximum entropy formalism and Poisson regression to show that CATS is equivalent to a generalised linear model...

1.Unconstrained ordination is commonly used in ecology to visualise multivariate data, in particular, to visualise the main trends between different sites in terms of their species composition or relative abundance.2.Methods of unconstrained ordination currently used, such as non-metric multi-dimensional scaling, are algorithm-based techniques deve...

1.Spatial climate variables are routinely used in species distribution models (SDMs) without accounting for the fact that they have been predicted with uncertainty, which can lead to biased estimates, erroneous inference and poor performances when predicting to new settings - e.g. under climate change scenarios.2.We show how information on uncertai...

AimWe analyse how and why ‘topoclimate’ mapping methodologies improve on macroclimatic variables in modelling the distribution of biodiversity. Further, we consider the implications for climate change projections.LocationGreater Hunter Valley region (c. 60,000 km2), New South Wales, Australia.Methods
We fitted generalised linear models to 295 speci...

An important problem encountered by ecologists in species distribution modelling and in multivariate analysis is that of understanding why environmental responses differ across species, and how differences are mediated by functional traits.We describe a simple, generic approach to this problem – the core idea being to fit a predictive model for spe...

Non-random species loss and gain in local communities change the compositional heterogeneity between communities over time, which is traditionally quantified with dissimilarity-based approaches. Yet, dissimilarities summarize the multivariate species data into a univariate index and obscure the species-level patterns of change, which are central to...

We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed criterion is based on score statistics which incorporate correlated response data. The main advantage of the SIC is that it is much faster to compute than existing model selection criteria when the nu...

Presence-only data, where information is available concerning species presence but not species absence, are subject to bias due to observers being more likely to visit and record sightings at some locations than others (hereafter "observer bias"). In this paper, we describe and evaluate a model-based approach to accounting for observer bias directl...

A topic of particular current interest is community-level approaches to species distribution modelling (SDM), i.e. approaches that simultaneously analyse distributional data for multiple species. Previous studies have looked at the advan- tages of community-level approaches for parameter estimation, but not for model selection – the process of choo...

Understanding how species distributions respond as a function of environmental gradients is a key question in ecology, and will benefit from a multi-species approach. Multi-species data are often high dimensional, in that the number of species sampled is often large relative to the number of sites, and are commonly quantified as either presence–abs...

Species distribution models (SDMs) are an important tool for studying the patterns of species across environmental and geographic space. For community data, a common approach involves fitting an SDM to each species separately, although the large number of models makes interpretation difficult and fails to exploit any similarities between individual...

Background/Question/Methods
We aimed to test the widely held idea that introduced species have dispersal and recruitment advantages over native species. We compiled data on mean and maximum dispersal distance for 56 introduced and 367 native species, and data on survival through germination (seed to seedling survival), early seedling survival (su...

We provide the first global test of the idea that introduced species have greater seed dispersal distances than do native species, using data for 51 introduced and 360 native species from the global literature. Counter to our expectations, there was no significant difference in mean or maximum dispersal distance between introduced and native specie...

Graphs of relationships between dispersal distance of introduced vs. native species when accounting for seed mass, plant height or dispersal syndrome individually.
(DOC)

Details of analyses of dispersal distance of native vs. introduced species when accounting for seed mass, plant height or dispersal syndrome, individually.
(DOC)

Attributes of introduced and native species.
(DOC)

Total number of missing values for trait data.
(DOC)

Supporting Information S1.
Analyses including a random effect for site.
(DOC)

Comparisons of introduced and native species’ seed dispersal distances using a subset of data with no missing values.
(DOC)

In allometry, the study of how size variables scale against each other, it is often of interest to fit lines to bivariate data and test hypotheses about slope and elevation about one or several lines. The nature of the problem suggests that bivariate techniques related to principal component analysis are more appropriate than linear regression. Inf...

1. The problems of analysing used-available data and presence-only data are equivalent, and this paper uses this equivalence as a platform for exploring opportunities for advancing analysis methodology.
2. We suggest some potential methodological advances in used-available analysis, made possible via lessons learnt in the presence-only literature,...

Ecologists are increasingly recognizing the conservation significance of microrefugia, but it is inherently difficult to locate these small patches with unusual climates, and hence they are also referred to as cryptic refugia. Here we intro-duce a new methodology to quantify and locate potential microrefugia using fine-scale topoclimatic grids that...

1. The mvabund package for R provides tools for model-based analysis of multivariate abundance
data in ecology.
2. This includes methods for visualising data, fitting predictive models, checking model assumptions,
as well as testing hypotheses about the community–environment association.
3. This paper briefly introduces the package and demonstrates...

1. The Standardised Major Axis Tests and Routines (SMATR) software provides tools for estimation and inference about allometric lines, currently widely used in ecology and evolution.
2. This paper describes some significant improvements to the functionality of the package, now available on R in smatr version 3.
3. New inclusions in the package in...