ArticlePDF Available

Criticizing AERIUS/OPS Model Performance

Authors:

Abstract and Figures

We investigate the AERIUS/OPS model predictive skill. It has none compared to a simple “mean” model. Skill is the demonstrated superiority of one model over another, given specific verification mea- sures. OPS does not have skill compared to a simple “mean” model, which beats OPS using some measures. Further, the verification measures used are themselves weak and inadequate, leading to the judgment that AERIUS/OPS should be shelved until an adequate replacement can be found.
Content may be subject to copyright.
A preview of the PDF is not available
... After we published on Research Gate the article "Criticizing AERIUS/OPS model performance" [1], RIVM submitted a critique, or reaction, to our paper on their site. [4]. ...
Article
Full-text available
The RIVM responded to our paper“Criticizing AERIUS/OPS model performance”. We rebut those critiques below. In brief: RIVM agrees that OPS performs poorly in the short term, but believes it performs well over long periods. Yet they failed to address our demonstration on how averaging over these longer terms can cause spurious “good results.” That is, this averaging causes OPS to produce bogus results. Alarmingly, RIVM does not understand skill: we prove a simple mean model beats OPS often. This simple model takes the observed averages as predictions. RIVM says these averages are not available, and so this simple model would be unavailable for policy, but that is false. Those averages are right there in OPS, as we show in model runs of OPS. And we show here that even in times and places where averages are not available, good guesses of them are adequate. It is still true that for the OPS model runs we made, one time with 400 cows at a farm, another with half that, and another still with one quarter, the nitrogen deposition (per ha, per year) shows trivial, unmeasurable differences. That being said, uncertainty in OPS model is not being sufficiently considered, and is substantial, aggravating previous observations. Taking proper stock of uncertainty is crucial because this model is being used to make far-reaching policy decisions. The solution is to design experiments and independently test OPS on completely new data. Pending those experiments, OPS must be shelved as to forestall extensive societal and economic damages resulting from its continued use in policy making.
Article
Full-text available
I briefly summarize prior research showing that tests of statistical significance are improperly used even in leading scholarly journals. Attempts to educate researchers to avoid pitfalls have had little success. Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions. I then show that tests of statistical significance are harmful to the development of scientific knowledge because they distract the researcher from the use of proper methods. I illustrate the dangers of significance tests by examining a re-analysis of the M3-Competition. Although the authors of the re-analysis conducted a proper series of statistical tests, they suggested that the original M3-Competition was not justified in concluding that combined forecasts reduce errors, and that the selection of the best method is dependent on the selection of a proper error measure. I show that the original conclusions were correct. Authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses. Practitioners should ignore significance tests and journals should discourage them.
Article
We present high-resolution model results of air pollution and deposition over the Netherlands with three models, the Eulerian grid model LOTOS-EUROS, the Gaussian plume model OPS and the hybrid model LEO. The latter combines results from LOTOS-EUROS and OPS using source apportionment techniques. The hybrid modelling combines the efficiency of calculating at high-resolution around sources with the plume model, and the accuracy of taking into account long-range transport and chemistry with a Eulerian grid model. We compare calculations from all three models with measurements for the period 2009–2011 for ammonia, NOx, secondary inorganic aerosols, particulate matter (PM10) and wet deposition of acidifying and eutrophying components (ammonium, nitrate and sulfate). It is found that concentrations of ammonia, NOx and the wet deposition components are best represented by the Gaussian plume model OPS. Secondary inorganic aerosols are best modelled with the LOTOS-EUROS model, and PM10 is best described with the LEO model. Subsequently for the year 2011, PM10 concentration and reduced nitrogen dry deposition maps are presented with respectively the OPS and LEO model. Using the LEO calculations for the production of the PM10 map, yields an overall better result than using the OPS calculations for this application. This is mainly due to the fact that the spatial distribution of the secondary inorganic aerosols is better described in the LEO model than in OPS, and because more (natural induced) PM10 sources are included in LEO, i.e. the contribution to PM10 of sea-salt and wind-blown dust as calculated by the LOTOS-EUROS model. Finally, dry deposition maps of reduced nitrogen over the Netherlands are compared as calculated by respectively the OPS and LEO model. The differences between both models are overall small (±100 mol/ha) with respect to the peak values observed in the maps (>2000 mol/ha). This is due to the fact that the contribution of dry deposition of reduced nitrogen caused by emissions outside of the Netherlands is small, so effectively most of the calculated deposition in the LEO model comes from the model results from OPS.
Article
A general framework for the problem of absolute verification (AV) is extended to the problem of comparative verification (CV). Absolute verification focuses on the performance of individual forecasting systems (or forecasters), and it is based on the bivariate distributions of forecasts and observations and its two possible factorizations into conditional and marginal distributions. Comparative verification compares the performance of two or more forecasting systems, which may produce forecasts under 1) identical conditions or 2) different conditions. Complexity can be defined in terms of the number of factorizations, the number of basic factors (conditional and marginal distributions) in each factorization, or the total number of basic factors associated with the respective frameworks. Dimensionality is defined as the number of probabilities that must be specified to reconstruct the basic distribution of forecasts and observations. Failure to take account of the complexity and dimensionality of verification problems may lead to an incomplete and inefficient body of verification methodology and, thereby, to erroneous conclusions regarding the absolute and relative quality and/or value of forecasting systems. -from Author
Article
A general framework for forecast verification based on the joint distribution of forecasts and observations is described: 1) the calibration-refinement factorization, which involves the conditional distributions of observations given forecasts and the marginal distribution of forecasts, and 2) the likelihood-base rate factorization, which involves the conditional distributions of forecasts given observations and the marginal distribution of observations. The names given to the factorizations reflect the fact that they relate to different attributes of the forecasts and/or observations. Some insight into the potential utility of the framework is provided by demonstrating that basic elements and summary measures of the joint, conditional, and marginal distributions play key roles in current verification methods. -from Authors
Article
  Probabilistic forecasts of continuous variables take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. A simple theoretical framework allows us to distinguish between probabilistic calibration, exceedance calibration and marginal calibration. We propose and study tools for checking calibration and sharpness, among them the probability integral transform histogram, marginal calibration plots, the sharpness diagram and proper scoring rules. The diagnostic approach is illustrated by an assessment and ranking of probabilistic forecasts of wind speed at the Stateline wind energy centre in the US Pacific Northwest. In combination with cross-validation or in the time series context, our proposal provides very general, nonparametric alternatives to the use of information criteria for model diagnostics and model selection.
Article
A probability assessor or forecaster is a person who assigns subjective probabilities to events which will eventually occur or not occur. There are two purposes for which one might wish to compare two forecasters. The first is to see who has given better forecasts in the past. The second is to decide who will give better forecasts in the future. A method of comparison suitable for the first purpose may not be suitable for the second and vice versa. A criterion called calibration has been suggested for comparing the forecasts of different forecasters. Calibration, in a frequency sense, is a function of long run (future) properties of forecasts and hence is not suitable for making comparisons in the present. A method for comparing forecasters based on past performance is the use of scoring rules. In this paper a general method for comparing forecasters after a finite number of trials is introduced. The general method is proven to include calculating all proper scoring rules as special cases. It also includes comparison of forecasters in all simple two-decision problems as special cases. The relationship between the general method and calibration is also explored. The general method is also translated into a method for deciding who will give better forecasts in the future. An example is given using weather forecasts.
Article
We introduce the Skill Plot, a method that it is directly relevant to a decision maker who must use a diagnostic test. In contrast to ROC curves, the skill curve allows easy graphical inspection of the optimal cutoff or decision rule for a diagnostic test. The skill curve and test also determine whether diagnoses based on this cutoff improve upon a naive forecast (of always present or of always absent). The skill measure makes it easy to directly compare the predictive utility of two different classifiers in an analogy to the area under the curve statistic related to ROC analysis. Finally, this article shows that the skill-based cutoff inferred from the plot is equivalent to the cutoff indicated by optimizing the posterior odds in accordance with Bayesian decision theory. A method for constructing a confidence interval for this optimal point is presented and briefly discussed.
Description and validation of ops
  • J Jaarsveld
J. Jaarsveld. Description and validation of ops-pro 4.1. Technical Report