ArticlePDF Available

Criticizing AERIUS/OPS Model Performance

Authors:

Abstract and Figures

We investigate the AERIUS/OPS model predictive skill. It has none compared to a simple “mean” model. Skill is the demonstrated superiority of one model over another, given specific verification mea- sures. OPS does not have skill compared to a simple “mean” model, which beats OPS using some measures. Further, the verification measures used are themselves weak and inadequate, leading to the judgment that AERIUS/OPS should be shelved until an adequate replacement can be found.
Content may be subject to copyright.
Criticizing AERIUS/OPS Model Performance1
William M. Briggs
matt@wmbriggs.com
Michigan
2
Jaap C. Hanekamp
j.hanekamp@ucr.nl; hjaap@xs4all.nl
University College Roosevelt, Middelburg, the Netherlands
Environmental Health Sciences
University of Massachusetts, Amherst, MA, USA
3
1
Geesje Rotgers
grotgers@gmail.com Stichting Agrifacts (STAF)
4
August 9, 20225
Abstract6
We investigate the AERIUS/OPS model predictive skill. It has7
none compared to a simple “mean” model. Skill is the demonstrated8
superiority of one model over another, given specific verification mea-9
sures. OPS does not have skill compared to a simple “mean” model,10
which beats OPS using some measures. Further, the verification mea-11
sures used are themselves weak and inadequate, leading to the judg-12
ment that AERIUS/OPS should be shelved until an adequate replace-13
ment can be found.14
Keywords: AERIUS/OPS, Atmospheric nitrogen, Model verification, Skill15
1 Executive Summary16
This paper severely criticizes the AERIUS/OPS model, used extensively in17
the Netherlands. This is a most important model. For example, the Dutch18
2
agricultural community is legally bound to this model with respect to all its19
activities, and notably in the protection of Natura2000 habitats.20
We show below that relying on this model is scientifically ill-advised: the21
AERIUS/OPS should be abandoned without delay. Here we summarize our22
main findings.23
It is disconcerting to note that previous numerous reviews of the Dutch24
nitrogen policies, e.g. [12], including scientific reviews of AERIUS/OPS, have25
never audited the model’s validation as we do below. That unfortunately26
includes the Adviescollege Meten en Berekenen Stikstof (2019–2020), which27
produced the latest scientific review, and of which one of the authors of this28
paper (Hanekamp) was a member [19, 18].29
Although we are able to demonstrate various model flaws, we are ham-30
pered in that the research data underlying validation studies (discussed be-31
low) still have not been made publicly available. Considering the political,32
regulatory and societal weight given to AERIUS/OPS, this must be amended33
immediately.34
Even without this data, specific criticisms on the model’s skill, validation35
procedures and the like can still be given, as we do in this paper. Our main36
points of critique are:37
3
Previous genuine validation studies concede the model performs poorly;38
AERIUS/OPS often does not perform as well as a simple background,39
or “mean” model; that is, a simple model beats AERIUS/OPS’s pre-40
dictions. AERIUS/OPS thus has no skill and should not be used;41
The verification measures used in previous studies are substandard and42
incomplete, resulting in poor performance seen as adequate;43
A model executed using straightforward scenarios produces results which44
are trivial, and for all practice purposes not measurable. For instance,45
reducing a farm’s cattle by 50% decreased nitrogen deposition by 0.1%,46
an “event” well within the margin of measurement and modeling error;47
Given its many failures, lack of skill, and general poor performance,48
AERIUS/OPS should not be used for any decision making. A full veri-49
fication, and improvements flowing from those verifications, are needed50
before the model can be trusted.51
At this point, there exists no adequate substitutes for AERIUS/OPS.52
Therefore, to avoid the fallacy of doing something for the sake of doing53
something, no model should be used until one proves itself in indepen-54
dent verifications.55
4
2 Introduction56
We here make general conceptual and initial critiques of the AERIUS/OPS57
model. Specific and detailed model verification can only occur with fresh58
analyses on sets of data with matched model field predictions and observa-59
tions. We do not have these data, as these have not been made public as of60
yet.61
Nevertheless, we can go a good way down the road to this ideal. We62
reveal many shortcomings in previous verification analyses, and build a case63
that the OPS model should not be used to make decisions unless a full and64
proper verification can be accomplished.65
We use visual depictions of model predictions and observations when the66
model was used in several experiments. We examine previous attempts at67
model verification. However, these used a set of verification measures that68
have a several deficiencies, and as such are not particularly informative.69
So weak are these measures that applying them to a very simple replace-70
ment model, which is the “model” of always predicting the average, defeats71
AERIUS/OPS.72
Before investigating previous validation studies, we first demonstrate how73
AERIUS/OPS works, using several common, and simple scenarios. We find74
5
that even if the model is flawless, its results are not persuasive.75
The only validation material available as of this writing comes in the form76
of four documents in which the OPS model was used to make predictions77
matched to observations. These are:78
1. The Kincaid case: Comparing results of the OPS model with measure-79
ments around a high source, from May 2015. Short name: Kincaid.80
The period of this study was 1980–1981, around a 1,100 MW coal-fired81
electricity generating plant near Kincaid, Illinois. SO2and SF6(as a tracer)82
were measured within a circle 20 km from the plant’s stack. Two versions of83
the model made predictions, the short- (OPS-ST) and long-term (OPS-LT).84
As Kincaid says (p. 6), “Taking distance into account, the correlation85
between modelled and measured [SF6] concentration per hour is poor”. It86
is also claimed that (p. 6) “Only comparison of the field maximum per87
hour, regardless of the distance and direction, results in an acceptable model88
performance.” This maximum is plotted in Fig. 1 below (Kincaid’s Fig. 5).89
For SO2, Kincaid says “nearly no performance indicator is within an90
acceptable range” for many periods.91
Kincaid concludes (p. 13), “In general, the correlation between OPS92
results, modelled for this ‘high source’, and observations is low in all test93
6
periods”. We show below that we concur with this.94
2. OPS Test Report Falster and North Carolina: Comparing results of the95
OPS model with measurements around two pig farms in Falster and North96
Carolina, from December 2014. Short name: Falster.97
The Falster pig farm is in Denmark. As with Kincaid, both short- and98
long-term OPS models were used. These were designated by the source of99
input data used by the model, e.g. OPS-ST 3.0.2, where the 3.0.2 indi-100
cates the input parameters. The long-term model was designated OPS-Pro101
2013. NH3was modeled and measured. The models were run with various102
parameterizations and weather conditions, as indicated below in the plots.103
Falster claims (p. 7) that for the Denmark farm, “The predicted con-104
centrations at the more distant points are systematically higher than the105
observations; the overestimation by OPS-LT is stronger.”106
For the North Carolina farm, Falster concludes (p. 13), among other107
things, that “he predicted concentrations at relatively close receptor points108
are systematically higher than the observations, the overestimation by OPS-109
ST is stronger.”110
We agree that the model at both locations performed more poorly for111
larger, and more important, values of NH3.112
7
3. Prairie grass experiment, from March 2017. Short name: Prairie.113
This isn’t the same kind of study as the previous two reports, which114
were “live” model predictions and observations, and therefore apt sources for115
model verification. Here, the data was many years old, published on paper116
in 1958. It was used to “tune” model parameters (of which there are many)117
so that model predictions matched this data well.118
This is, of course, a legitimate form of model investigation, but it is119
not adequate to investigate model ‘real-life” performance, for reasons we120
discuss below. The danger of over-fitting is ever-present in these tuning121
procedures, and the only way to discover if this has happened is to test the122
model independently as in the first two experiments.123
What tuning and fitting studies do is give an idea of the ideal model124
performance. How close that ideal is in independent predictions can only be125
ascertained outside these tuning studies.126
4. The Operational Priority Substances Model: Description and valida-127
tion of OPS-Pro 4.1, from 2004, by J.A. van Jaarsveld, [20]. Short name:128
Jaarsveld.129
This is a detailed look inside the OPS model and its properties. An entire130
chapter (8) is given over to model validation and uncertainty. This uses131
8
outside sources for independent verification, like Kincaid above. It concerns132
itself mostly with spatial and long-term averages of both observables and133
model predictions, which limits its use in a manner we describe below.134
Much of the observation data provided is extracted from gridded sets,135
which are themselves therefore the output of other models (of the geographic136
distribution of the observables). These are taken in Jaarsveld to be as if137
they were measured directly, i.e. without uncertainty. Yet there must be138
some uncertainty in these numbers, which implies the certainty we have in139
the model verification must be decreased to some important degree.140
A work that is similar to Jaarsveld, but more limited in scope, is [25],141
which we discuss below.142
3 Validation Measures143
It is a well known statistical truism that a model may always be built that144
predicts the data used to build that model to any desired level of accuracy.145
Even perfect predictions can be had if the model is complex enough (such146
as increasing its tunable parameters). For that reason, it is never a reliable147
procedure to test models on the data used to build them. The only acceptable148
9
test is on data never used (or seen) by the model builders. Independent149
predictions of entirely new observations must be used. This is the manner,150
for instance, meteorological models are built and tested, and which are forced151
to meet least daily tests of their goodness.152
Model verification is a large, well-developed field, with a host of standard153
references and procedures. See for example [1, 4, 22, 17]. There are a large154
variety of verification measures, measuring various aspects of performance.155
There are also several key concepts, such as calibration and skill, which we156
discuss in part below. About calibration see [15, 14]. For skill see for example157
[9, 3, 10, 21].158
The OPS documents use a handful of performance or verification mea-159
sures. We detail these here, and show also that they are not especially useful.160
Even the crudest, simplest of models look good using them. Partly this is a161
result of relying on single-number summaries of model performance, which162
necessarily strips away useful information. And partly this is because the163
measures themselves aren’t informative.164
All verifications are simple in concept: make a set of predictions and com-165
pare these with the corresponding observations. The comparison is usually166
in the form of a score, or scores, which measures the set’s average predic-167
10
tion error. This error can be more than just distance, such as how far off a168
prediction was. Also on can determine frequencies of sizes of errors, or how169
error depends on size of the observable, and so on.170
To understand the basics of this process, some notation helps.171
Let Ybe the observable of interest (the atmospheric concentrations of172
SO2, SF6, etc.). And let Xbe the prediction of Y. Obviously, a perfect model173
has for all prediction and observations i,Yi=Xi, where i= 1,2, . . . , n, and174
nis the number of prediction-observation pairs.175
Verification measures or scores are functions of the set of prediction and176
observations pairs, i.e. s=f({Y, X}i). It is helpful to write Y=1
nPiYi,177
i.e. the mean of the observations, and similarly for X. Ideally, scores should178
be proper. Briefly, proper scores are those that give the best score when179
the forecaster gives an honest prediction; that is, the scoring rule cannot180
be gamed by issuing predictions aimed toward improving score. See [23] for181
details.182
The verification measures historically preferred by OPS modelers, using183
the same abbreviations found in the documents, are these. It turns out the184
OPS model gives point predictions, which are Xithat are single values, with185
no uncertainty attached to them. For example, OPS says Yiwill be exactly186
11
Xi, for all i. This is a severe limitation, because the best prediction is one187
that carries the full warranted uncertainty (see [26] for a discussion). It also188
limits the verification measures that can be used, which in turn limits what189
can be said about the model.190
The measures used were these.191
Fractional basis:192
F B = 2YX
Y+X.(1)
This is best when F B = 0, which happens when the mean of the predictions193
equals the mean of the observations. Matching means is, of course, a desirable194
trait, but it is the weakest of verification measures, because, experience shows195
and as we show below, even poor models may have equal means of predictions.196
Geometric mean bias:197
MG = exp ln Yln X.(2)
This is best when M G = 1. This is another rendition of a mean-matching198
verification score, and suffers the restriction that predictions and observations199
must be positive. A similar measure that may be as valuable is the simple200
difference in means; i.e., YX.201
12
Normalized mean square error:202
NM SE =(YX)2
Y X .(3)
This is best when N M SE = 0. The normalization aids in comparing models203
of observables with different variabilities. The denominator is usually elimi-204
nated, and, after the square root is taken, the measure becomes the root mean205
square error (RMSE). The NMSE and RMSE are more useful measures be-206
cause they penalize larger departures of predictions from observations more207
than simple differences. They also have other attractive mathematical moti-208
vations relating to standard loss functions. However, the measures don’t say209
where these departures took place, though, nor how often. They still give an210
on-average assessment.211
Geometric variance:212
V G = exp (ln Yln X)2.(4)
This is best when GV = 1. This has the flavor of the RMSE, but with213
different weights for large departures of the prediction from the observed.214
But the logging gives less emphasis to these departures.215
Fraction of close predictions:216
F AC2 = 1
nX
i
I(XiYi/2 & Xi2Yi),(5)
13
where I() is the indicator function; i.e. a function that takes the value of217
1 when its argument is true, else 0. In plain words, FAC2 is the fraction218
of model predictions within a factor of two of the observations. The best219
is F C2 = 1. This is a wholly arbitrary, crude, overly generous, non-proper220
measure. It is only useful to those who do not care how far off they are with221
an error under 200% (i.e. a factor of 2). The size of this error can only be222
described as very large. FAC2 is used in studies of models similar to OPS,223
such as [13].224
Falster cites [11] (but also see [16]), who give a range on these measures225
which indicate “acceptable model performance”. These limits are:226
|F B|<0.3227
0.7< MG < 1.3228
NSME < 1.5229
V G < 4230
F AC2>0.5.231
Lacking in these documents are any notions of calibration or of skill.232
Calibration can be in mean, as when the means match, in distribution, as233
when the empirical cumulative distributions of predictions and observations234
match, and in exceedence, which is related to calibration in distribution (the235
14
frequency at which certain values are exceeded). The idea is that not only236
single points or means should match, but the entire range and frequencies237
should, too; i.e. all measurable characteristics of the predictions and obser-238
vations.239
We cannot assess lack of calibration well without data, and indeed we are240
limited to the historical scores computed from the old studies, but even so,241
some remarks can be given.242
Skill is a key requirement in model assessment. We cannot discover if243
skill was ever assessed for AERIUS. Skill is the formal comparison of an244
“advanced” model, like OPS, with a “simpler” model, where both models are245
judged and compared against a proper score, or to some verification measure,246
such as those measures given above. In its most basic form, a verification247
measure for the advanced and simple model are computed, then compared.248
The model that gives the best performance is preferred.249
Skill as a concept is widely used in meteorological, climate, and pollution250
dispersion models, and is considered a necessary criterion of model viability.251
An advanced model which performs poorer compared to a simpler model is252
said to lack skill; if the advanced model performs better, it has skill, again253
conditional to the verification measure. Any advanced model without skill254
15
should not be used. More precisely, the simpler model should be used in its255
place.256
In the case of the OPS model, a simple competitor model is the “mean257
model”, in which the mean of the observables is the constant prediction (X258
Y). That is, the mean model always predicts the mean on the observations.259
This is a sensible model, too, albeit crude. It works well in many situations,260
too, say, temperature within a month at your home. Obviously, we should261
require the OPS model do a superior job compared against the mean model.262
In spite of its crudeness, the mean model would score F B = 0 (the best),263
MG = 1 (the best), RMSE = σ, the standard deviation of the observables264
(this is simple to show), and GV would be something like RMSE. NSM E265
would be close to the bounds set by [16], as long as the distribution of the266
observations was roughly symmetric. F AC2 would be close to 0.90 0.95,267
i.e. plus or minus twice the standard deviation, unless the observables are268
very highly skewed. Even then, something greater than 0.5 is likely.269
Clearly, at least using these verification measures, the “mean model”270
would be an excellent choice, showing exceptional performance, which would271
be very difficult, or even impossible for OPS to beat, regardless of how the272
observations turned out. The point is that the trivial mean model scores273
16
perfectly on at least two of OPS preferred measures, and very well on the274
others.275
This shows both the deficiencies of the measures, and of the OPS model276
itself, as we shall now discover.277
4 Model Run278
Before examining model performance, it is worthwhile examining model out-279
put under some common scenarios. We cannot in the space here do more280
than show a typical use for the model. Even so, we learn some interesting281
limitations.282
We wanted to determine nitrogen deposition using AERIUS/OPS for an283
ammonia source of different strengths, at a distance of approximately 2000284
meters. We therefore ran the AERIUS/OPS model using the following set-285
tings:286
Cattle barn emissions: no animals, 1 cow, 100 cows, 200 cows and 400287
cows;288
Emission estimated at 10 kg NH3 per year per cow;289
Default values for height emission point and heat content;290
17
At the Natura 2000 site Lieftinghsbroek.291
Our method was this:292
1. We used the publicly available sites calculator.aerius.nl and moni-293
tor.aerius.nl.294
2. The background deposition per hexagon of the Natura 2000 site is295
checked in AERIUS/OPS Monitor for 2019. These background depositions296
apply to what we call the “zero cows” scenario.297
3. In the AERIUS/OPS Calculator, a cattle barn is created at approxi-298
mately 2000 meters from the Natura 2000 site. It is not possible to make a299
calculation with zero cows. Therefore the calculation is based on the smallest300
unit of 1 cow (again, the “zero cows” being just background). We assumed301
per-cow emissions of 10 kg N/year. The AERIUS/OPS Calculator finds the302
hexagon with the highest contribution and calculates its total deposition,303
background plus emissions. This hexagon is marked.304
3. We ran calculations for the same barns for 1, 100, 200 and 400 cows.305
4. The hexagon with the highest contribution in all the calculations is306
the same hexagon.307
The results are in Table 1, which for the hexagon with the highest con-308
centration shows the number of cows, the total deposition in mol N/ha/year,309
18
and the estimated deposition from cows alone.310
The background deposition for “0 cows” is 1938 mol N/ha/year (in 2019).311
Yet at 400 cows, a huge increase, nitrogen increases less than 6 mol N/ha/year,312
a trivial, and likely unmeasurable, number.313
The estimated difference going from 200 to 400 cows is only about 3 mol314
N/ha/year. This is surely a negligible amount. At best, it would extraordi-315
narily difficult to measure this difference in situ, if not impossible. Also, any316
reference to the nitrogen critical loads (NCLs) being further exceeded with317
the modelled 3 mol N/ha/year as an argument for deliberately sidestepping318
the multiple uncertainties in the NCLs and AERIUS/OPS is not only beside319
the point but, worse, circular in nature. The modelled 3 mols cannot, by any320
stretch of the imagination, be anywhere near reality, next to the manifold321
and substantial imprecisions in the NCLs we have documented previously322
[8, 7].323
But suppose the model is perfect, which is an impossibility. Then the324
effect of reducing animals by 50% reduces total deposition only a trivial325
amount. Which could never, in practice, be measured or verified.326
Therefore, for this plausible situation, the model is just about useless.327
It does nothing more in essence than report a background estimated mean328
19
Table 1: Results of the AERIUS/OPS model for various scenarios. The
“0” cows is the background deposition only. Total depositions are in mol
N/ha/year, and the estimated deposition just from cows.
Cows Total deposition (mol N/ha/year) Deposition from cows
0 1938 0
1 1938.07 0.01
100 1939.52 1.47
200 1940.99 2.93
400 1943.92 5.87
deposition, and given as a number with no uncertainty in it, which the es-329
timated means have; see [7, 8]. It is because of this that it is practically330
impossible to verify its performance.331
We next examine how the model performed in actual situations.332
20
5 General Critiques333
5.1 Kincaid334
From Kincaid (Fig. 4), Fig. 1 shows the OPS (the AERIUS prefix is gen-335
erally left off in these sources, a practice we follow here) model predictions336
of maximum per-hour SF6(y-axis) against observations (x-axis); the unit is337
µg/m2. A red one-to-one line is over-plotted. Perfect predictions would fall338
on this line. Obviously, both model predictions and observations cannot fall339
below 0. Predictions range from 0 to just under 1.5, as do most observations,340
with the exception of about 10 or so observations that are larger than 1.5.341
Visually, there does not appear much, if any, correlation between the342
model and the observations, as discussed above. However, the model at343
least reproduces the approximate range, which is one of the weakest forms344
of verification. To be sure, the measured range itself is small. This is due to345
the chemical tracked and its relatively meager atmospheric levels.346
It is clear from Fig. 1 that the simple “mean model”, which predicts the347
observation mean each time would rival, or even be superior, to the OPS in348
this case. For instance, the mean of OPS predictions is almost certainly not349
equal to the mean of SF6observations; whereas it is by definition equal in350
21
Figure 1: OPS model predictions of maximum per-hour SF6(y-axis) against
observations (x-axis), with dates indicated, in µg/m2. A red one-to-one line
is over-plotted. Perfect predictions would fall on this line.
22
the mean model. Thus, the OPS has no skill relative to, say, F B , where the351
mean model would be referred.352
Figs. 2 and 3 (Fig. 6 from Kincaid) show the OPS SO2predictions for353
various periods and experimental runs (YR 3.0.2, etc.). As above, one-to-one354
lines are included. The letters in the upper left corners were added by us for355
easier identification.356
Figure 2: OPS model predictions of SO2(y-axis) against observations (x-
axis), with dates indicated, for various periods and experimental runs. A red
one-to-one line is over-plotted. Perfect predictions would fall on this line.
We first warn the reader against letting the one-to-one line interfere with357
interpreting model bias. For all runs, except possibly (d) and (f), the model358
does not have much visible correlation with the observables, except approxi-359
mately the mean and range, as above. Skill would likely be lacking for some,360
or maybe even all, verification measures for the same reasons.361
23
Figure 3: Extension of Fig. 2.
24
For (d) and (f ), there is slight correlation, in the sense that a straight line362
(say, a regression) through the points might depart from a horizontal line (of363
no correlation, as is likely in Fig. 1). Even so, it’s clear (1) the model badly364
under-predicts the values of SO2, and that (2) model performance becomes365
worse at larger values of SO2.366
Figure 4: Verification statistics for various OPS model runs. The shaded
regions represent “acceptable model performance”, according to Kincaid.
Verification statistics for various OPS model runs are presented in Fig.367
4 (Kincaid Table 4). The shaded regions represent “acceptable model per-368
formance”, according to Kincaid. It’s clear from these the the simple mean369
model bests OPS in FB and MG exactly, and likely in FAC2. It might be a370
25
rival, or even better than OPS, in terms of NSME and VG. If, for instance,371
the mean of the observations was larger than it’s standard deviation, which372
appears to be true from the figures, then NSME for the mean model would373
be less than 1. How much less we can’t tell from the figures alone (and the374
same goes for the more complicated VG).375
In any case, there is nothing obvious in these figures to recommend the376
OPS model over the simple mean model. While a formal verification might377
change this judgement, and the OPS could be superior relative to some of378
these measures, at least some times, the OPS model cannot be said to have379
performed well at all.380
Kincaid also presents results from Jaarsveld (below), which shows 8 sep-381
arate model prediction and observation means across “several models” for382
both April-May 1980, and then May-June 1980, for both the OPS-PRO4383
and the OPS v.120E. These are shown in Fig. 5.384
At first glance, it appears at least the OPS-PRO4 model might be a skillful385
(and that OPS v.120E is much poorer). But this figure is a conglomerate,386
rather, an average across several model runs at separate points. It would be387
as if the points in (a) in Fig. 2 were averaged, the graph them consisting388
of three points, at (Y , X) for each of the runs YR 3.0.2, YR 10.3.2, and389
26
Figure 5: Quoting Jaarsveld: “South-to-north concentration profiles mea-
sured at the Kincaid power plant versus profiles modelled by several models.”
LT-2013.390
That is, it would look something like the simulated data in Fig. 6.391
On the left is a figure simulated with n= 30 random normals with means392
of 7 and standard deviations 2, which appear to match the actual data, but393
this is not crucial. They might represent the model within one year, making394
two to three predictions a month. There is by design no correlation between395
the predictions and observations, and indeed not much signal is seen in the396
figure on the left.397
On the right are m= 12 years, each with n= 30 simulated uncorrelated398
predictions and observations. The points within each year are averaged. This399
27
Figure 6: Simulated data, n= 30 points within a year (left), and m= 12
years, each with n= 30 simulations, all of uncorrelated noise. Notice how
much better the model appears after the averaging.
28
greatly improves the model performance, at least visually. It is true that the400
means match, in the simulations. In practice, however, they do not match401
exactly, and the naive mean model still beats the simulation.402
We chose to plot the simulations as a time series instead because, as we403
shall see below, this is how Jaarsveld portrayed the same data. However, a404
scatter plot (as above in the original plots) makes the point equally well.405
Figure 7: Simulated data, as in the right panel of Fig. 6, but here as a scatter
plot.
Here in Fig. 7 is the same data in the right panel, this time as a scatter406
29
plot. The story remains the same: the averaging deceptively improves the407
visual performance.408
5.2 Falster409
Fig. 8 (Falster Fig. 2.4) shows OPS model predictions and observations of410
NH3for the Falster pig house experiment. Results from different versions of411
the same model with different parameterizations are shown. A one-to-one412
line is over-plotted.413
The model here does a better job than in Kincaid. But it over-predicts414
for small values of NH3, i.e. those less than 2 µg/m3. And after this, the415
model has little visual correlation with NH3, or it under-predicts. There are416
only 5 data points above 2 µg/m3, so it is difficult to tell.417
The same kind of results are found in Fig. 9 (Falster Fig. 2.5), but for418
different local conditions (pig house emissions including the slurry tank).419
This has the same criticisms as before: with NH3values less than about420
2.5µg/m3, the model over-predicts, and after it either under-predicts or the421
model breaks down.422
It’s worth mentioning the mean model still beats the OPS model here423
with respect to verification measures FB and MG, since the mean model424
30
Figure 8: OPS model predictions and observations of NH3for the Falster pig
house experiment for various model parameterizations. As above, a one-to-
one line is over-plotted.
31
Figure 9: Same as Fig. 8 but for different local conditions.
32
scores always scores perfectly with these. It might be here that the OPS425
finally has skill with respect to NMSE, or there may be insufficient data to426
tell: only a formal analysis can tell.427
More model runs, again for varying conditions, are found in Fig. 10428
(Falster Fig. 3.5).
Figure 10: Same as Fig. 8 but again for different local conditions.
429
As before, for low levels of NH3, the model over-predicts. But it now430
shows very large over-predictions for large, and thereby important, values of431
33
NH3.432
At observed values around 60µg/m3, the model predicts anywhere from433
about 220 to 230 µg/m3, a huge error.434
The Falster report also includes values of the formal verification measures,435
but we do not show them here as they are much they same as in Kincaid,436
and in any case they are not especially valuable.437
5.3 Prairie438
Fig. 11 shows the first of many prediction-observation experiments. Many439
more were done in the Prairie data than in the previous two experiments, so440
we only show a few representative plots, the others being very similar. This441
experiment used old data from two sites and used it to tune the model, to442
see how well the model could be made to fit the data.443
Note the scale change. The model does reasonably well, but with some444
obvious biases at low values of SO2, but become progressively worse for larger445
values, over-predicting SO2levels.446
Fig. 12 is much the same, though it is evident the settings of the model447
can lead to obvious biases (the red dots, for instance). The model generally448
over-predicts and becomes worse for larger values of SO2. Though it is likely449
34
Figure 11: OPS model predictions and observations of SO2g/m2(note the
different scale) for the Prairie grass experiment.
Figure 12: As in Fig. 11 but for different model runs in the Prairie grass
experiment.
35
to have skill over the mean model conditional on most measures.450
Finally, we present Fig. 13, which shows again various model runs, but451
here with optimal calibration factors calculated from the previous runs. The452
model appears best here, and should. But it is an example that models can453
always be made to fit already observed data. It is not the ideal test of model454
performance, as in Kincaid and Falster above.455
5.4 Jaarsveld456
Jaarsveld (Chapter 8) represents a systematic approach to OPS model ver-457
ification, though most of the (necessary) effort is spent examining model458
internals. The author does, however, present several graphics of averaged459
model performance. Two highly representative figures are reproduced below.460
Fig. 14 at first glance looks impressive. But on consideration it is seen461
the points are all averaged over many locations and model runs. The detail462
is lost. This is exactly like the Kincaid situation in which the average across463
model runs given the false appearance of model improvement. See Fig. 7 for464
a simulated example.465
In any case, because the simpler mean model is just that, the mean, it466
would win for any of the verification measures. OPS has no skill in relation467
36
Figure 13: OPS model predictions and observations of SO2g/m2after cali-
bration for the Prairie grass experiment.
37
to it.468
Figure 14: OPS model predictions and observations of NOyaveraged over
several sites and time points. Compare this to Fig. 7.
Next compare Fig. 15 with Figs. 2 and 3, which are the same experiment,469
but while the former are averaged in a formal way, the latter are original. A470
completely different picture emerges: the average, as discussed, gives a false471
sense of model value.472
A completely different picture emerges: the average, as discussed, gives a473
false sense of model value.474
Jaarsveld of course presents much rich information on the OPS model475
and its inner workings, and is an excellent reference. However, it is clear476
38
Figure 15: OPS model predictions and observations of SO2µg/m2averaged
for Kincaid.
that much more work on verifying the model needs to be completed before477
it can be trusted. Averaging observations across time and space, and also478
model predictions, gives an overly optimistic idea of model performance.479
As said in the Introduction, some of the averaging of observations used480
by Jaarsveld is itself the output of gridding models. The uncertainty in this481
process, which Jaarsveld does not report, should be considered in any future482
analysis.483
A similar analysis to Jaarsveld is in [25], which compares OPS with two484
other similar models, LOTOS-EUROS and LEO, which is a hybrid between485
39
LOTUS-EUROS and OPS. This uses the Measuring Ammonia in Nature486
(MAN) data in the Netherlands, which has significant uncertainty in it (see487
[6]). The analysis is similar to Jaarsveld because the models were fit to488
previously observed data, from the years 2009–2011, and where previously489
used data was used to drive the model inputs. The models were not, however,490
tuned to the MAN data, as in the other studies mentioned above.491
For example for NH3, [25] found that the LEO results were almost identi-492
cal to OPS (not surprisingly, since it was in part based on it). And that both493
the LOTUS-EUROS and OPS “over-estimate concentrations of NH3starting494
from observed levels of 2.5µg/m3.” The authors find that LOTUS-EUROS495
was inferior to the other two models.496
The method of verification was regression analysis, the atmospheric mea-497
surement (i.e. the “x” variable) predicting the prediction (the “y” variable).498
Perfect predictions would have a slope of 1, with no variability, indicating499
a straight line through the observation-prediction pairs. A slope less than500
1 indicates over-prediction. For OPS (with LEO being very similar), the501
slopes were 0.82 ±0.39 for NH4, 0.63 ±0.60 for NH3and 0.35 ±0.12 for SO4.502
Over-prediction was thus prevalent.503
Besides being for previously observed data, these results are also for yearly504
40
averages presented at different geographic points, and so are likely to be bet-505
ter than for predictions made for individual grid squares and more frequent506
periods, for the reasons discussed in the Jaarsveld section.507
6 Conclusion508
A clear picture has emerged from examining verification studies of the AERIUS-509
OPS model, which is that this model is rather poor, in general and in detail,510
as indeed Kincaid and others concluded.511
It seemingly does well in reproducing averages, although as we discussed512
(here and in previous work) the average itself is not known with certainty.513
The model does a reasonable job at reproducing ranges some of the time,514
but not always. Error is clearly a function of size, unfortunately becoming515
unacceptably large when atmospheric concentrations increase, which is of516
course when they are most important.517
Previous verifications suffered substantially from using sub-par measures518
(like F B, etc.), and there was no consideration of skill. A simple “mean519
model”, which always and everywhere predicts the mean of the observations520
easily beats the AERIUS/OPS using the accepted verification measures.521
41
By this alone, AERIUS/OPS needs to be shelved until it is improved.522
The situation is worse for AERIUS/OPS because that mean model by itself523
is not known with certainty. This is because the mean at each geographic524
point of interest must first be known. If it is not known, or known only with525
significant uncertainty, then even the simple, and often superior, mean model526
is insufficient.527
It should be recalled that all models, AERIUS/OPS included, only say528
what they are told to say; see [5] for proof. Therefore, AERIUS/OPS has529
been told to say the wrong things far too often. Improvements to the model530
are thus needed. After these are made, independent planned experiments in531
which the AERIUS/OPS is used to make predictions at various points and532
conditions. The observations taken at these points should be measured with-533
out error, to the extent that that is possible. Given its political importance,534
the verification of the model should be carried out by a group independent535
of those who created it.536
At this point, there are no adequate replacements for AERIUS/OPS.537
This is shown, albeit indirectly, also in [24]. In a review of models, includ-538
ing the OPS, Theobald [24] shows that the OPS is very similar to other539
models (ADMS, AERMOD, and LADD; LADD is rejected as useful in this540
42
paper). Azoz et al. [2] also find OPS produces similar output as the chem-541
istry transport model CHIMERE (that work was not concerned with model542
verification).543
The verification of these models in Theobald uses the same substandard544
measures as detailed above (FB, etc.). Because the models (other than545
LADD) are so similar, they share the same shortcomings as AERIUS/OPS.546
Indeed, the authors themselves only say model performance is “‘adequate’”—547
the scare quotes are in the original—a judgement formed when analyzing the548
Falster data.549
References550
[1] J. S. Armstrong. Significance testing harms progress in forecasting (with551
discussion). International Journal of Forecasting, 23:321–327, 2007.552
[2] N. Azouz, J.-L. Drouet, M. Beekmann, G. Siour, R. W. Kruit, and P. C.553
1. Comparison of spatial patterns of ammonia concentration and dry554
deposition flux between a regional eulerian chemistry-transport model555
and a local gaussian plume model. Air Quality, Atmosphere & Health,556
12:719–729, 2019.557
43
[3] W. M. Briggs. A general method of incorporating forecast cost and loss558
in value scores. Monthly Weather Review, 133(11):3393–3397, 2005.559
[4] W. M. Briggs. Uncertainty: The Soul of Probability, Modeling & Statis-560
tics. Springer, New York, 2016.561
[5] W. M. Briggs. Models only say what they’re told to say. In N. N.562
Thach, D. T. Ha, N. D. Trung, and V. Kreinovich, editors, Predic-563
tion and Causality in Econometrics and Related Topics, pages 35–42.564
Springer, New York, 2022.565
[6] W. M. Briggs and J. Hanekamp. Uncertainty in the man data calibration566
& trend estimates, 2019.567
[7] W. M. Briggs and J. Hanekamp. Outlining a new method to quantify568
uncertainty in nitrogen critical loads. 2020.569
[8] W. M. Briggs and J. C. Hanekamp. Nitrogen critical loads: Critical570
reflections on past experiments, ecological endpoints, and uncertainties.571
Dose-Response, 20(1):15593258221075513, 2022. PMID: 35185419.572
[9] W. M. Briggs and D. Ruppert. Assessing the skill of yes/no predictions.573
Biometrics, 61(3):799–807, 2005.574
44
[10] W. M. Briggs and R. A. Zaretzki. The skill plot: a graphical technique575
for evaluating continuous diagnostic tests. Biometrics, 64:250–263, 2008.576
(with discussion).577
[11] J. Chang and S. Hanna. Air quality model performance evaluation.578
Meteorology and Atmospheric Physics, 87:167–196, 2004.579
[12] M. de Heer, F. Roozen, and R. Maas. The integrated approach to nitro-580
gen in the netherlands: A preliminary review from a societal, scientific,581
juridical and practical perspective. Journal for Nature Conservation,582
35:101–111, 2017.583
[13] A. Dore, D. Carslaw, C. Braban, M. Cain, C. Chemel, C. Conolly,584
R. Derwent, S. Griffiths, J. Hall, G. Hayman, S. Lawrence, S. Metcalfe,585
A. Redington, D. Simpson, M. Sutton, P. Sutton, Y. Tang, M. Vieno,586
M. Werner, and J. Whyatt. Evaluation of the performance of different587
atmospheric chemical transport models and inter-comparison of nitrogen588
and sulphur deposition estimates for the uk. Atmospheric Environment,589
119:131–143, 2015.590
[14] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction,591
and estimation. JASA, 102:359–378, 2007.592
45
[15] T. Gneiting, A. E. Raftery, and F. Balabdaoui. Probabilistic forecasts,593
calibration and sharpness. Journal of the Royal Statistical Society Series594
B: Statistical Methodology, 69:243–268, 2007.595
[16] S. Hanna and J. Chang. Acceptance criteria for urban dispersion model596
evaluation. Meteorology and Atmospheric Physics, 116:133–146, 2012.597
[17] H. Hersbach. Decompostion of the continuous ranked probability score598
for ensemble prediction systems. Weather and Forecasting, 15:559–570,599
2000.600
[18] L. Hordijk, J. W. Erisman, H. Eskes, J. Hanekamp, M. Krol, P. Lev-601
elt, M. Schaap, W. de Vries, and A. Visser. Meer meten, robuuster602
rekenen eindrapport van het adviescollege meten en berekenen stikstof.603
Technical Report June 2020, Adviescollege Meten en Berekenen Stikstof,604
Netherlands, 2020.605
[19] L. Hordijk, J. W. Erisman, H. Eskes, J. Hanekamp, M. Krol, P. Lev-606
elt, M. Schaap, W. de Vries, and A. Visser. Niet uit de lucht gegrepen607
eerste rapport van het adviescollege meten en berekenen stikstof. Tech-608
nical Report 2 March 2020, Adviescollege Meten en Berekenen Stikstof,609
Netherlands, 2020.610
46
[20] J. Jaarsveld. Description and validation of ops-pro 4.1. Technical Report611
RIVM report 500045001/2004, Rijksinstituut voor Volksgezondheid en612
Milieu, Bithoven, Nederland, 2004.613
[21] A. H. Murphy. Forecast verification: its complexity and dimensionality.614
Monthly Weather Review, 119:1590–1601, 1991.615
[22] A. H. Murphy and R. L. Winkler. A general framework for forecast616
verification. Monthly Weather Review, 115:1330–1338, 1987.617
[23] M. Schervish. A general method for comparing probability assessors.618
Annals of Statistics, 17:1856–1879, 1989.619
[24] M. R. Theobald, P. Lofstrom, J. Walker, H. V. Andersen, P. Pedersen,620
A. Vallejo, and M. A. Sutton. An intercomparison of models used to621
simulate the short-range atmospheric dispersion of agricultural ammonia622
emissions. Environmental Modelling & Software, 37:90–102, 2012.623
[25] E. van der Swaluw, W. de Vries, F. Sauter, J. Aben, G. Velders, and624
A. van Pul. High-resolution modelling of air pollution and deposition625
over the netherlands with plume, grid and hybrid modelling. Atmo-626
spheric Environment, 155:140––153, 2017.627
47
[26] D. S. Wilks. Statistical Methods in the Atmospheric Sciences. Academic628
Press, New York, second edition, 2006.629
48
... After we published on Research Gate the article "Criticizing AERIUS/OPS model performance" [1], RIVM submitted a critique, or reaction, to our paper on their site. [4]. ...
Article
Full-text available
The RIVM responded to our paper“Criticizing AERIUS/OPS model performance”. We rebut those critiques below. In brief: RIVM agrees that OPS performs poorly in the short term, but believes it performs well over long periods. Yet they failed to address our demonstration on how averaging over these longer terms can cause spurious “good results.” That is, this averaging causes OPS to produce bogus results. Alarmingly, RIVM does not understand skill: we prove a simple mean model beats OPS often. This simple model takes the observed averages as predictions. RIVM says these averages are not available, and so this simple model would be unavailable for policy, but that is false. Those averages are right there in OPS, as we show in model runs of OPS. And we show here that even in times and places where averages are not available, good guesses of them are adequate. It is still true that for the OPS model runs we made, one time with 400 cows at a farm, another with half that, and another still with one quarter, the nitrogen deposition (per ha, per year) shows trivial, unmeasurable differences. That being said, uncertainty in OPS model is not being sufficiently considered, and is substantial, aggravating previous observations. Taking proper stock of uncertainty is crucial because this model is being used to make far-reaching policy decisions. The solution is to design experiments and independently test OPS on completely new data. Pending those experiments, OPS must be shelved as to forestall extensive societal and economic damages resulting from its continued use in policy making.
Article
Full-text available
I briefly summarize prior research showing that tests of statistical significance are improperly used even in leading scholarly journals. Attempts to educate researchers to avoid pitfalls have had little success. Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions. I then show that tests of statistical significance are harmful to the development of scientific knowledge because they distract the researcher from the use of proper methods. I illustrate the dangers of significance tests by examining a re-analysis of the M3-Competition. Although the authors of the re-analysis conducted a proper series of statistical tests, they suggested that the original M3-Competition was not justified in concluding that combined forecasts reduce errors, and that the selection of the best method is dependent on the selection of a proper error measure. I show that the original conclusions were correct. Authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses. Practitioners should ignore significance tests and journals should discourage them.
Article
We present high-resolution model results of air pollution and deposition over the Netherlands with three models, the Eulerian grid model LOTOS-EUROS, the Gaussian plume model OPS and the hybrid model LEO. The latter combines results from LOTOS-EUROS and OPS using source apportionment techniques. The hybrid modelling combines the efficiency of calculating at high-resolution around sources with the plume model, and the accuracy of taking into account long-range transport and chemistry with a Eulerian grid model. We compare calculations from all three models with measurements for the period 2009–2011 for ammonia, NOx, secondary inorganic aerosols, particulate matter (PM10) and wet deposition of acidifying and eutrophying components (ammonium, nitrate and sulfate). It is found that concentrations of ammonia, NOx and the wet deposition components are best represented by the Gaussian plume model OPS. Secondary inorganic aerosols are best modelled with the LOTOS-EUROS model, and PM10 is best described with the LEO model. Subsequently for the year 2011, PM10 concentration and reduced nitrogen dry deposition maps are presented with respectively the OPS and LEO model. Using the LEO calculations for the production of the PM10 map, yields an overall better result than using the OPS calculations for this application. This is mainly due to the fact that the spatial distribution of the secondary inorganic aerosols is better described in the LEO model than in OPS, and because more (natural induced) PM10 sources are included in LEO, i.e. the contribution to PM10 of sea-salt and wind-blown dust as calculated by the LOTOS-EUROS model. Finally, dry deposition maps of reduced nitrogen over the Netherlands are compared as calculated by respectively the OPS and LEO model. The differences between both models are overall small (±100 mol/ha) with respect to the peak values observed in the maps (>2000 mol/ha). This is due to the fact that the contribution of dry deposition of reduced nitrogen caused by emissions outside of the Netherlands is small, so effectively most of the calculated deposition in the LEO model comes from the model results from OPS.
Article
A general framework for the problem of absolute verification (AV) is extended to the problem of comparative verification (CV). Absolute verification focuses on the performance of individual forecasting systems (or forecasters), and it is based on the bivariate distributions of forecasts and observations and its two possible factorizations into conditional and marginal distributions. Comparative verification compares the performance of two or more forecasting systems, which may produce forecasts under 1) identical conditions or 2) different conditions. Complexity can be defined in terms of the number of factorizations, the number of basic factors (conditional and marginal distributions) in each factorization, or the total number of basic factors associated with the respective frameworks. Dimensionality is defined as the number of probabilities that must be specified to reconstruct the basic distribution of forecasts and observations. Failure to take account of the complexity and dimensionality of verification problems may lead to an incomplete and inefficient body of verification methodology and, thereby, to erroneous conclusions regarding the absolute and relative quality and/or value of forecasting systems. -from Author
Article
A general framework for forecast verification based on the joint distribution of forecasts and observations is described: 1) the calibration-refinement factorization, which involves the conditional distributions of observations given forecasts and the marginal distribution of forecasts, and 2) the likelihood-base rate factorization, which involves the conditional distributions of forecasts given observations and the marginal distribution of observations. The names given to the factorizations reflect the fact that they relate to different attributes of the forecasts and/or observations. Some insight into the potential utility of the framework is provided by demonstrating that basic elements and summary measures of the joint, conditional, and marginal distributions play key roles in current verification methods. -from Authors
Article
  Probabilistic forecasts of continuous variables take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. A simple theoretical framework allows us to distinguish between probabilistic calibration, exceedance calibration and marginal calibration. We propose and study tools for checking calibration and sharpness, among them the probability integral transform histogram, marginal calibration plots, the sharpness diagram and proper scoring rules. The diagnostic approach is illustrated by an assessment and ranking of probabilistic forecasts of wind speed at the Stateline wind energy centre in the US Pacific Northwest. In combination with cross-validation or in the time series context, our proposal provides very general, nonparametric alternatives to the use of information criteria for model diagnostics and model selection.
Article
A probability assessor or forecaster is a person who assigns subjective probabilities to events which will eventually occur or not occur. There are two purposes for which one might wish to compare two forecasters. The first is to see who has given better forecasts in the past. The second is to decide who will give better forecasts in the future. A method of comparison suitable for the first purpose may not be suitable for the second and vice versa. A criterion called calibration has been suggested for comparing the forecasts of different forecasters. Calibration, in a frequency sense, is a function of long run (future) properties of forecasts and hence is not suitable for making comparisons in the present. A method for comparing forecasters based on past performance is the use of scoring rules. In this paper a general method for comparing forecasters after a finite number of trials is introduced. The general method is proven to include calculating all proper scoring rules as special cases. It also includes comparison of forecasters in all simple two-decision problems as special cases. The relationship between the general method and calibration is also explored. The general method is also translated into a method for deciding who will give better forecasts in the future. An example is given using weather forecasts.
Article
We introduce the Skill Plot, a method that it is directly relevant to a decision maker who must use a diagnostic test. In contrast to ROC curves, the skill curve allows easy graphical inspection of the optimal cutoff or decision rule for a diagnostic test. The skill curve and test also determine whether diagnoses based on this cutoff improve upon a naive forecast (of always present or of always absent). The skill measure makes it easy to directly compare the predictive utility of two different classifiers in an analogy to the area under the curve statistic related to ROC analysis. Finally, this article shows that the skill-based cutoff inferred from the plot is equivalent to the cutoff indicated by optimizing the posterior odds in accordance with Bayesian decision theory. A method for constructing a confidence interval for this optimal point is presented and briefly discussed.
Description and validation of ops
  • J Jaarsveld
J. Jaarsveld. Description and validation of ops-pro 4.1. Technical Report