ArticlePDF Available

Does model-free forecasting really outperform the true model?

Authors:

Abstract and Figures

Estimating population models from uncertain observations is an important problem in ecology. Perretti et al. observed that standard Bayesian state-space solutions to this problem may provide biased parameter estimates when the underlying dynamics are chaotic. Consequently, forecasts based on these estimates showed poor predictive accuracy compared to simple "model-free" methods, which lead Perretti et al. to conclude that "Model-free forecasting outperforms the correct mechanistic model for simulated and experimental data". However, a simple modification of the statistical methods also suffices to remove the bias and reverse their results.
Content may be subject to copyright.
A preview of the PDF is not available
... We view this work as complementary to recent publications on forecasting [41,42,44]. The authors of [41,42] advocate nonparametric methods over parametric methods in general, while a letter [44] addressing the work of [41] showed that a more sophisticated method for model fitting results in better parameter estimates and therefore model-based predictions which outperform model-free methods. ...
... We view this work as complementary to recent publications on forecasting [41,42,44]. The authors of [41,42] advocate nonparametric methods over parametric methods in general, while a letter [44] addressing the work of [41] showed that a more sophisticated method for model fitting results in better parameter estimates and therefore model-based predictions which outperform model-free methods. Our results support the view that no one method is uniformly better than the other. ...
Article
Full-text available
Scientific analysis often relies on the ability to make accurate predictions of a system's dynamics. Mechanistic models, parameterized by a number of unknown parameters, are often used for this purpose. Accurate estimation of the model state and parameters prior to prediction is necessary, but may be complicated by issues such as noisy data and uncertainty in parameters and initial conditions. At the other end of the spectrum exist nonparametric methods, which rely solely on data to build their predictions. While these nonparametric methods do not require a model of the system, their performance is strongly influenced by the amount and noisiness of the data. In this article, we consider a hybrid approach to modeling and prediction which merges recent advancements in nonparametric analysis with standard parametric methods. The general idea is to replace a subset of a mechanistic model's equations with their corresponding nonparametric representations, resulting in a hybrid modeling and prediction scheme. Overall, we find that this hybrid approach allows for more robust parameter estimation and improved short-term prediction in situations where there is a large uncertainty in model parameters. We demonstrate these advantages in the classical Lorenz-63 chaotic system and in networks of Hindmarsh-Rose neurons before application to experimentally collected structured population data.
... In this section, we show that the use of Gibbs-LAIS cab be useful in complex inference scenarios where sophisticated MCMC techniques seem to fail [43,44]. We consider the estimation problem of parameters in a chaotic system, which is considered a very challenging framework in the literature [43,44,22]. This is due to the very tight and sharp posteriors induced by this model. ...
Article
Full-text available
Monte Carlo sampling methods are the standard procedure for approximating complicated integrals of multidimensional posterior distributions in Bayesian inference. In this work, we focus on the class of layered adaptive importance sampling algorithms, which is a family of adaptive importance samplers where Markov chain Monte Carlo algorithms are employed to drive an underlying multiple importance sampling scheme. The modular nature of the layered adaptive importance sampling scheme allows for different possible implementations, yielding a variety of different performances and computational costs. In this work, we propose different enhancements of the classical layered adaptive importance sampling setting in order to increase the efficiency and reduce the computational cost, of both upper and lower layers. The different variants address computational challenges arising in real-world applications, for instance with highly concentrated posterior distributions. Furthermore, we introduce different strategies for designing cheaper schemes, for instance, recycling samples generated in the upper layer and using them in the final estimators in the lower layer. Different numerical experiments show the benefits of the proposed schemes, comparing with benchmark methods presented in the literature, and in several challenging scenarios.
... Without prior information, observation error variances are often poorly identified, which then creates bias in other estimates (Knape, 2008;Lebreton & Gimenez, 2013;Auger-Méthé et al., 2016;Certain et al., 2018). Such biases are likely to be even stronger if the dynamics contains large fluctuations, especially chaotic ones, and more refined model fitting techniques may have to be employed (Wood, 2010;Hartig & Dormann, 2013). ...
Article
Full-text available
Inferring interactions between populations of different species is a challenging statistical endeavour, which requires a large amount of data. There is therefore some incentive to combine all available sources of data into a single analysis to do so. In demography and single-population studies, Integrated Population Models combine population counts, capture-recapture and reproduction data to fit matrix population models. Here, we extend this approach to the community level in a stage-structured predator-prey context. We develop Integrated Community Models (ICMs), implemented in a Bayesian framework, to fit multispecies nonlinear matrix models to multiple data sources. We assessed the value of the different sources of data using simulations of ICMs under different scenarios contrasting data availability. We found that combining all data types (capture-recapture, counts, and reproduction) allows the estimation of both demographic and interaction parameters, unlike count-only data which typically generate high bias and low precision in interaction parameter estimates for short time series. Moreover, reproduction surveys informed the estimation of interactions particularly well when compared to capture-recapture programs, and have the advantage of being less costly. Overall, ICMs offer an accurate representation of stage structure in community dynamics, and foster the development of efficient observational study designs to monitor communities in the field.
... There are two 515 reasons why this seems improbable. First, the complex nature of microbial interactions we have 516 described, even with such data, still present difficult challenges in formulating all the present 517 relationships into mathematical formulations (Hartig and Dormann 2013;Perretti et al. 2013a, b;518 De Angelis and Yurek 2015). Moreover, time-dependency of species relationships also made it 519 difficult to describe dynamics by a simple mathematical formulation even with longer and more 520 precise time series. ...
Article
1. Mapping the network of ecological interactions is key to understanding the composition, stability, function and dynamics of microbial communities. In recent years various approaches have been used to reveal microbial interaction networks from metagenomic sequencing data, such as time-series analysis, machine learning and statistical techniques. Despite these efforts it is still not possible to capture details of the ecological interactions behind complex microbial dynamics. 2. We developed the sparse S-map method (SSM), which generates a sparse interaction network from a multivariate ecological time-series without presuming any mathematical formulation for the underlying microbial processes. The advantage of the SSM over alternative methodologies is that it fully utilizes the observed data using a framework of empirical dynamic modelling. This makes the SSM robust to non-equilibrium dynamics and underlying complexity (nonlinearity) in microbial processes. 3. We showed that an increase in dataset size or a decrease in observational error improved the accuracy of SSM whereas, the accuracy of a comparative equation-based method was almost unchanged for both cases and equivalent to the SSM at best. Hence, the SSM outperformed a comparative equation-based method when datasets were large and the magnitude of observational errors were small. The results were robust to the magnitude of process noise and the functional forms of inter-specific interactions that we tested. We applied the method to a microbiome data of six mice and found that there were different microbial interaction regimes between young to middle age (4-40 week-old) and middle to old age (36-72 week-old) mice. 4. The complexity of microbial relationships impedes detailed equation-based modeling. Our method provides a powerful alternative framework to infer ecological interaction networks of microbial communities in various environments and will be improved by further developments in metagenomics sequencing technologies leading to increased dataset size and improved accuracy and precision.
... Strong autocorrelation can result in statistical models being relatively good predictors, even compared to models that contain covariates such as climate and other species (Bahn & McGill 2007). Furthermore, simple state-space reconstructions based on relatively little observed data outperform more complex mechanistic models (though see Hartig & Dormann 2013;Perretti et al. 2013a,b) and still can distinguish causality from correlation (Sugihara et al. 2012). Similarly, the most accurate model of some wild animal population dynamics was the one that used the most recent observation as the forecast (Ward et al. 2014), and statistical models of species distributions have outperformed more mechanistic ones (Bahn & McGill 2007). ...
Article
Full-text available
Forecasts of ecological dynamics in changing environments are increasingly important, and are available for a plethora of variables, such as species abundance and distribution, community structure and ecosystem processes. There is, however, a general absence of knowledge about how far into the future, or other dimensions (space, temperature, phylogenetic distance), useful ecological forecasts can be made, and about how features of ecological systems relate to these distances. The ecological forecast horizon is the dimensional distance for which useful forecasts can be made. Five case studies illustrate the influence of various sources of uncertainty (e.g. parameter uncertainty, environmental variation, demographic stochasticity and evolution), level of ecological organisation (e.g. population or community), and organismal properties (e.g. body size or number of trophic links) on temporal, spatial and phylogenetic forecast horizons. Insights from these case studies demonstrate that the ecological forecast horizon is a flexible and powerful tool for researching and communicating ecological predictability. It also has potential for motivating and guiding agenda setting for ecological forecasting research and development. © 2015 The Authors Ecology Letters published by John Wiley & Sons Ltd and CNRS.
Article
Full-text available
Compositional multistability is widely observed in multispecies ecological communities. Since differences in community composition often lead to differences in community function, understanding compositional multistability is essential to comprehend the role of biodiversity in maintaining ecosystems. In community assembly studies, it has long been recognized that the order and timing of species migration and extinction influence structure and function of communities. The study of multistability in ecology has focused on the change in dynamical stability across environmental gradients, and was developed mainly for low‐dimensional systems. As a result, methodologies for studying the compositional stability of empirical multispecies communities are not well developed. Here, we show that models previously used in ecology can be analyzed from a new perspective, the energy landscape, to unveil compositional stability in observational data. To show that our method can be applicable to real‐world ecological communities, we simulated assembly dynamics driven by population‐level processes, and show that results were mostly robust to different simulation assumptions. Our method reliably captured the change in the overall compositional stability of multispecies communities over environmental change, and indicated a small fraction of community compositions that may be channels for transitions between stable states. When applied to murine gut microbiota, our method showed the presence of two alternative states whose relationship changes with age, and suggested mechanisms by which aging affects the compositional stability of the murine gut microbiota. Our method provides a practical tool to study the compositional stability of communities in a changing world, and will facilitate empirical studies that integrate the concept of multistability from different fields.
Preprint
Full-text available
Mapping the network of ecological interactions is key to understanding the composition, stability, function and dynamics of microbial communities. In recent years various approaches have been used to reveal microbial interaction networks from metagenomic sequencing data, such as time-series analysis, machine learning and statistical techniques. Despite these efforts it is still not possible to capture details of the ecological interactions behind complex microbial dynamics. We developed the sparse S-map method (SSM), which generates a sparse interaction network from a multivariate ecological time-series without presuming any mathematical formulation for the underlying microbial processes. The advantage of the SSM over alternative methodologies is that it fully utilizes the observed data using a framework of empirical dynamic modelling. This makes the SSM robust to non-equilibrium dynamics and underlying complexity (nonlinearity) in microbial processes. We showed that an increase in dataset size or a decrease in observational error improved the accuracy of SSM whereas, the accuracy of a comparative equation-based method was almost unchanged for both cases and equivalent to the SSM at best. Hence, the SSM outperformed a comparative equation-based method when datasets were large and the magnitude of observational errors were small. The results were robust to the magnitude of process noise and the functional forms of inter-specific interactions that we tested. We applied the method to a microbiome data of six mice and found that there were different microbial interaction regimes between young to middle age (4-40 week-old) and middle to old age (36-72 week-old) mice. The complexity of microbial relationships impedes detailed equation-based modeling. Our method provides a powerful alternative framework to infer ecological interaction networks of microbial communities in various environments and will be improved by further developments in metagenomics sequencing technologies leading to increased dataset size and improved accuracy and precision.
Article
In this paper, we present a new method for the prediction and uncertainty quantification of data-driven multivariate systems. Traditionally, either mechanistic or non-mechanistic modeling methodologies have been used for prediction; however, it is uncommon for the two to be incorporated together. We compare the forecast accuracy of mechanistic modeling, using Bayesian inference, a non-mechanistic modeling approach based on state space reconstruction, and a novel hybrid methodology composed of the two for an age-structured population data set. The data come from cannibalistic flour beetles, in which it is observed that the adults preying on the eggs and pupae result in non-equilibrium population dynamics. Uncertainty quantification methods for the hybrid models are outlined and illustrated for these data. We perform an analysis of the results from Bayesian inference for the mechanistic model and hybrid models to suggest reasons why hybrid modeling methodology may enable more accurate forecasts of multivariate systems than traditional approaches.
Chapter
Nonlinear dynamics is a huge field in mathematics and physics, and we will hardly be able to scratch the surface here. Nevertheless, this field is so tremendously important for our theoretical understanding of brain function and time series phenomena that I felt a book on statistical methods in neuroscience should not go without discussing at least some of its core concepts. Having some grasp of nonlinear dynamical systems may give important insights into how the observed time series were generated. In fact, nonlinear dynamics provides a kind of universal language for mathematically describing the deterministic part of the dynamical systems generating the observed time series—we will see later (Sect. 9.3) how to connect these ideas to stochastic processes and statistical inference. ARMA and state space models as discussed in Sects. 7.2 and 7.5 are examples of discrete-time, linear dynamical systems driven by noise. However, linear dynamical systems can only exhibit a limited repertoire of dynamical behaviors and typically do not capture a number of prominent and computationally important phenomena observed in physiological recordings. In the following, we will distinguish between models that are defined in discrete time (Sect. 9.1), as all the time series models discussed so far, and continuous-time models (Sect. 9.2).
Article
Full-text available
Bayesian inference often requires efficient numerical approximation algorithms, such as sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) methods. The Gibbs sampler is a well-known MCMC technique, widely applied in many signal processing problems. Drawing samples from univariate full-conditional distributions efficiently is essential for the practical application of the Gibbs sampler. In this work, we present a simple, self-tuned and extremely efficient MCMC algorithm which produces virtually independent samples from these univariate target densities. The proposal density used is self-tuned and tailored to the specific target, but it is not adaptive. Instead, the proposal is adjusted during an initial optimization stage, following a simple and extremely effective procedure. Hence, we have named the newly proposed approach as FUSS (Fast Universal Self-tuned Sampler), as it can be used to sample from any bounded univariate distribution and also from any bounded multi-variate distribution, either directly or by embedding it within a Gibbs sampler. Numerical experiments, on several synthetic data sets (including a challenging parameter estimation problem in a chaotic system) and a high-dimensional financial signal processing problem, show its good performance in terms of speed and estimation accuracy.
Article
Full-text available
Accurate predictions of species abundance remain one of the most vexing challenges in ecology. This observation is perhaps unsurprising, because population dynamics are often strongly forced and highly nonlinear. Recently, however, numerous statistical techniques have been proposed for fitting highly parameterized mechanistic models to complex time series, potentially providing the machinery necessary for generating useful predictions. Alternatively, there is a wide variety of comparatively simple model-free forecasting methods that could be used to predict abundance. Here we pose a rather conservative challenge and ask whether a correctly specified mechanistic model, fit with commonly used statistical techniques, can provide better forecasts than simple model-free methods for ecological systems with noisy nonlinear dynamics. Using four different control models and seven experimental time series of flour beetles, we found that Markov chain Monte Carlo procedures for fitting mechanistic models often converged on best-fit parameterizations far different from the known parameters. As a result, the correctly specified models provided inaccurate forecasts and incorrect inferences. In contrast, a model-free method based on state-space reconstruction gave the most accurate short-term forecasts, even while using only a single time series from the multivariate system. Considering the recent push for ecosystem-based management and the increasing call for ecological predictions, our results suggest that a flexible model-free approach may be the most promising way forward.
Article
Ecology Letters (2011) 14: 816–827 Statistical models are the traditional choice to test scientific theories when observations, processes or boundary conditions are subject to stochasticity. Many important systems in ecology and biology, however, are difficult to capture with statistical models. Stochastic simulation models offer an alternative, but they were hitherto associated with a major disadvantage: their likelihood functions can usually not be calculated explicitly, and thus it is difficult to couple them to well-established statistical theory such as maximum likelihood and Bayesian statistics. A number of new methods, among them Approximate Bayesian Computing and Pattern-Oriented Modelling, bypass this limitation. These methods share three main principles: aggregation of simulated and observed data via summary statistics, likelihood approximation based on the summary statistics, and efficient sampling. We discuss principles as well as advantages and caveats of these methods, and demonstrate their potential for integrating stochastic simulation models into a unified framework for statistical modelling.
Article
Chaotic ecological dynamic systems defy conventional statistical analysis. Systems with near-chaotic dynamics are little better. Such systems are almost invariably driven by endogenous dynamic processes plus demographic and environmental process noise, and are only observable with error. Their sensitivity to history means that minute changes in the driving noise realization, or the system parameters, will cause drastic changes in the system trajectory. This sensitivity is inherited and amplified by the joint probability density of the observable data and the process noise, rendering it useless as the basis for obtaining measures of statistical fit. Because the joint density is the basis for the fit measures used by all conventional statistical methods, this is a major theoretical shortcoming. The inability to make well-founded statistical inferences about biological dynamic models in the chaotic and near-chaotic regimes, other than on an ad hoc basis, leaves dynamic theory without the methods of quantitative validation that are essential tools in the rest of biological science. Here I show that this impasse can be resolved in a simple and general manner, using a method that requires only the ability to simulate the observed data on a system from the dynamic model about which inferences are required. The raw data series are reduced to phase-insensitive summary statistics, quantifying local dynamic structure and the distribution of observations. Simulation is used to obtain the mean and the covariance matrix of the statistics, given model parameters, allowing the construction of a 'synthetic likelihood' that assesses model fit. This likelihood can be explored using a straightforward Markov chain Monte Carlo sampler, but one further post-processing step returns pure likelihood-based inference. I apply the method to establish the dynamic nature of the fluctuations in Nicholson's classic blowfly experiments.
Article
We discuss the possibility of applying some standard statistical methods (the least-square method, the maximum likelihood method, and the method of statistical moments for estimation of parameters) to deterministically chaotic low-dimensional dynamic system (the logistic map) containing an observational noise. A "segmentation fitting" maximum likelihood (ML) method is suggested to estimate the structural parameter of the logistic map along with the initial value x(1) considered as an additional unknown parameter. The segmentation fitting method, called "piece-wise" ML, is similar in spirit but simpler and has smaller bias than the "multiple shooting" previously proposed. Comparisons with different previously proposed techniques on simulated numerical examples give favorable results (at least, for the investigated combinations of sample size N and noise level). Besides, unlike some suggested techniques, our method does not require the a priori knowledge of the noise variance. We also clarify the nature of the inherent difficulties in the statistical analysis of deterministically chaotic time series and the status of previously proposed Bayesian approaches. We note the trade off between the need of using a large number of data points in the ML analysis to decrease the bias (to guarantee consistency of the estimation) and the unstable nature of dynamical trajectories with exponentially fast loss of memory of the initial condition. The method of statistical moments for the estimation of the parameter of the logistic map is discussed. This method seems to be the unique method whose consistency for deterministically chaotic time series is proved so far theoretically (not only numerically).
Article
The maximum likelihood method is a basic statistical technique for estimating parameters and variables, and is the starting point for many more sophisticated methods, like Bayesian methods. This paper shows that maximum likelihood fails to identify the true trajectory of a chaotic dynamical system, because there are trajectories that appear to be far more (infinitely more) likely than truth. This failure occurs for unbounded noise and for bounded noise when it is sufficiently large and will almost certainly have consequences for parameter estimation in such systems. The reason for the failure is rather simple; in chaotic dynamical systems there can be trajectories that are consistently closer to the observations than the true trajectory being observed, and hence their likelihood dominates truth. The residuals of these truth-dominating trajectories are not consistent with the noise distribution; they would typically have too small standard deviation and many outliers, and hence the situation may be remedied by using methods that examine the distribution of residuals and are not entirely maximum likelihood based.