Content uploaded by J.C. Hanekamp

Author content

All content in this area was uploaded by J.C. Hanekamp on Aug 09, 2022

Content may be subject to copyright.

Criticizing AERIUS/OPS Model Performance1

William M. Briggs

matt@wmbriggs.com

Michigan

2

Jaap C. Hanekamp

j.hanekamp@ucr.nl; hjaap@xs4all.nl

University College Roosevelt, Middelburg, the Netherlands

Environmental Health Sciences

University of Massachusetts, Amherst, MA, USA

3

1

Geesje Rotgers

grotgers@gmail.com Stichting Agrifacts (STAF)

4

August 9, 20225

Abstract6

We investigate the AERIUS/OPS model predictive skill. It has7

none compared to a simple “mean” model. Skill is the demonstrated8

superiority of one model over another, given speciﬁc veriﬁcation mea-9

sures. OPS does not have skill compared to a simple “mean” model,10

which beats OPS using some measures. Further, the veriﬁcation mea-11

sures used are themselves weak and inadequate, leading to the judg-12

ment that AERIUS/OPS should be shelved until an adequate replace-13

ment can be found.14

Keywords: AERIUS/OPS, Atmospheric nitrogen, Model veriﬁcation, Skill15

1 Executive Summary16

This paper severely criticizes the AERIUS/OPS model, used extensively in17

the Netherlands. This is a most important model. For example, the Dutch18

2

agricultural community is legally bound to this model with respect to all its19

activities, and notably in the protection of Natura2000 habitats.20

We show below that relying on this model is scientiﬁcally ill-advised: the21

AERIUS/OPS should be abandoned without delay. Here we summarize our22

main ﬁndings.23

It is disconcerting to note that previous numerous reviews of the Dutch24

nitrogen policies, e.g. [12], including scientiﬁc reviews of AERIUS/OPS, have25

never audited the model’s validation as we do below. That unfortunately26

includes the Adviescollege Meten en Berekenen Stikstof (2019–2020), which27

produced the latest scientiﬁc review, and of which one of the authors of this28

paper (Hanekamp) was a member [19, 18].29

Although we are able to demonstrate various model ﬂaws, we are ham-30

pered in that the research data underlying validation studies (discussed be-31

low) still have not been made publicly available. Considering the political,32

regulatory and societal weight given to AERIUS/OPS, this must be amended33

immediately.34

Even without this data, speciﬁc criticisms on the model’s skill, validation35

procedures and the like can still be given, as we do in this paper. Our main36

points of critique are:37

3

•Previous genuine validation studies concede the model performs poorly;38

•AERIUS/OPS often does not perform as well as a simple background,39

or “mean” model; that is, a simple model beats AERIUS/OPS’s pre-40

dictions. AERIUS/OPS thus has no skill and should not be used;41

•The veriﬁcation measures used in previous studies are substandard and42

incomplete, resulting in poor performance seen as adequate;43

•A model executed using straightforward scenarios produces results which44

are trivial, and for all practice purposes not measurable. For instance,45

reducing a farm’s cattle by 50% decreased nitrogen deposition by 0.1%,46

an “event” well within the margin of measurement and modeling error;47

•Given its many failures, lack of skill, and general poor performance,48

AERIUS/OPS should not be used for any decision making. A full veri-49

ﬁcation, and improvements ﬂowing from those veriﬁcations, are needed50

before the model can be trusted.51

•At this point, there exists no adequate substitutes for AERIUS/OPS.52

Therefore, to avoid the fallacy of doing something for the sake of doing53

something, no model should be used until one proves itself in indepen-54

dent veriﬁcations.55

4

2 Introduction56

We here make general conceptual and initial critiques of the AERIUS/OPS57

model. Speciﬁc and detailed model veriﬁcation can only occur with fresh58

analyses on sets of data with matched model ﬁeld predictions and observa-59

tions. We do not have these data, as these have not been made public as of60

yet.61

Nevertheless, we can go a good way down the road to this ideal. We62

reveal many shortcomings in previous veriﬁcation analyses, and build a case63

that the OPS model should not be used to make decisions unless a full and64

proper veriﬁcation can be accomplished.65

We use visual depictions of model predictions and observations when the66

model was used in several experiments. We examine previous attempts at67

model veriﬁcation. However, these used a set of veriﬁcation measures that68

have a several deﬁciencies, and as such are not particularly informative.69

So weak are these measures that applying them to a very simple replace-70

ment model, which is the “model” of always predicting the average, defeats71

AERIUS/OPS.72

Before investigating previous validation studies, we ﬁrst demonstrate how73

AERIUS/OPS works, using several common, and simple scenarios. We ﬁnd74

5

that even if the model is ﬂawless, its results are not persuasive.75

The only validation material available as of this writing comes in the form76

of four documents in which the OPS model was used to make predictions77

matched to observations. These are:78

1. The Kincaid case: Comparing results of the OPS model with measure-79

ments around a high source, from May 2015. Short name: Kincaid.80

The period of this study was 1980–1981, around a 1,100 MW coal-ﬁred81

electricity generating plant near Kincaid, Illinois. SO2and SF6(as a tracer)82

were measured within a circle 20 km from the plant’s stack. Two versions of83

the model made predictions, the short- (OPS-ST) and long-term (OPS-LT).84

As Kincaid says (p. 6), “Taking distance into account, the correlation85

between modelled and measured [SF6] concentration per hour is poor”. It86

is also claimed that (p. 6) “Only comparison of the ﬁeld maximum per87

hour, regardless of the distance and direction, results in an acceptable model88

performance.” This maximum is plotted in Fig. 1 below (Kincaid’s Fig. 5).89

For SO2, Kincaid says “nearly no performance indicator is within an90

acceptable range” for many periods.91

Kincaid concludes (p. 13), “In general, the correlation between OPS92

results, modelled for this ‘high source’, and observations is low in all test93

6

periods”. We show below that we concur with this.94

2. OPS Test Report Falster and North Carolina: Comparing results of the95

OPS model with measurements around two pig farms in Falster and North96

Carolina, from December 2014. Short name: Falster.97

The Falster pig farm is in Denmark. As with Kincaid, both short- and98

long-term OPS models were used. These were designated by the source of99

input data used by the model, e.g. OPS-ST 3.0.2, where the 3.0.2 indi-100

cates the input parameters. The long-term model was designated OPS-Pro101

2013. NH3was modeled and measured. The models were run with various102

parameterizations and weather conditions, as indicated below in the plots.103

Falster claims (p. 7) that for the Denmark farm, “The predicted con-104

centrations at the more distant points are systematically higher than the105

observations; the overestimation by OPS-LT is stronger.”106

For the North Carolina farm, Falster concludes (p. 13), among other107

things, that “he predicted concentrations at relatively close receptor points108

are systematically higher than the observations, the overestimation by OPS-109

ST is stronger.”110

We agree that the model at both locations performed more poorly for111

larger, and more important, values of NH3.112

7

3. Prairie grass experiment, from March 2017. Short name: Prairie.113

This isn’t the same kind of study as the previous two reports, which114

were “live” model predictions and observations, and therefore apt sources for115

model veriﬁcation. Here, the data was many years old, published on paper116

in 1958. It was used to “tune” model parameters (of which there are many)117

so that model predictions matched this data well.118

This is, of course, a legitimate form of model investigation, but it is119

not adequate to investigate model ‘real-life” performance, for reasons we120

discuss below. The danger of over-ﬁtting is ever-present in these tuning121

procedures, and the only way to discover if this has happened is to test the122

model independently as in the ﬁrst two experiments.123

What tuning and ﬁtting studies do is give an idea of the ideal model124

performance. How close that ideal is in independent predictions can only be125

ascertained outside these tuning studies.126

4. The Operational Priority Substances Model: Description and valida-127

tion of OPS-Pro 4.1, from 2004, by J.A. van Jaarsveld, [20]. Short name:128

Jaarsveld.129

This is a detailed look inside the OPS model and its properties. An entire130

chapter (8) is given over to model validation and uncertainty. This uses131

8

outside sources for independent veriﬁcation, like Kincaid above. It concerns132

itself mostly with spatial and long-term averages of both observables and133

model predictions, which limits its use in a manner we describe below.134

Much of the observation data provided is extracted from gridded sets,135

which are themselves therefore the output of other models (of the geographic136

distribution of the observables). These are taken in Jaarsveld to be as if137

they were measured directly, i.e. without uncertainty. Yet there must be138

some uncertainty in these numbers, which implies the certainty we have in139

the model veriﬁcation must be decreased to some important degree.140

A work that is similar to Jaarsveld, but more limited in scope, is [25],141

which we discuss below.142

3 Validation Measures143

It is a well known statistical truism that a model may always be built that144

predicts the data used to build that model to any desired level of accuracy.145

Even perfect predictions can be had if the model is complex enough (such146

as increasing its tunable parameters). For that reason, it is never a reliable147

procedure to test models on the data used to build them. The only acceptable148

9

test is on data never used (or seen) by the model builders. Independent149

predictions of entirely new observations must be used. This is the manner,150

for instance, meteorological models are built and tested, and which are forced151

to meet least daily tests of their goodness.152

Model veriﬁcation is a large, well-developed ﬁeld, with a host of standard153

references and procedures. See for example [1, 4, 22, 17]. There are a large154

variety of veriﬁcation measures, measuring various aspects of performance.155

There are also several key concepts, such as calibration and skill, which we156

discuss in part below. About calibration see [15, 14]. For skill see for example157

[9, 3, 10, 21].158

The OPS documents use a handful of performance or veriﬁcation mea-159

sures. We detail these here, and show also that they are not especially useful.160

Even the crudest, simplest of models look good using them. Partly this is a161

result of relying on single-number summaries of model performance, which162

necessarily strips away useful information. And partly this is because the163

measures themselves aren’t informative.164

All veriﬁcations are simple in concept: make a set of predictions and com-165

pare these with the corresponding observations. The comparison is usually166

in the form of a score, or scores, which measures the set’s average predic-167

10

tion error. This error can be more than just distance, such as how far oﬀ a168

prediction was. Also on can determine frequencies of sizes of errors, or how169

error depends on size of the observable, and so on.170

To understand the basics of this process, some notation helps.171

Let Ybe the observable of interest (the atmospheric concentrations of172

SO2, SF6, etc.). And let Xbe the prediction of Y. Obviously, a perfect model173

has for all prediction and observations i,Yi=Xi, where i= 1,2, . . . , n, and174

nis the number of prediction-observation pairs.175

Veriﬁcation measures or scores are functions of the set of prediction and176

observations pairs, i.e. s=f({Y, X}i). It is helpful to write Y=1

nPiYi,177

i.e. the mean of the observations, and similarly for X. Ideally, scores should178

be proper. Brieﬂy, proper scores are those that give the best score when179

the forecaster gives an honest prediction; that is, the scoring rule cannot180

be gamed by issuing predictions aimed toward improving score. See [23] for181

details.182

The veriﬁcation measures historically preferred by OPS modelers, using183

the same abbreviations found in the documents, are these. It turns out the184

OPS model gives point predictions, which are Xithat are single values, with185

no uncertainty attached to them. For example, OPS says Yiwill be exactly186

11

Xi, for all i. This is a severe limitation, because the best prediction is one187

that carries the full warranted uncertainty (see [26] for a discussion). It also188

limits the veriﬁcation measures that can be used, which in turn limits what189

can be said about the model.190

The measures used were these.191

Fractional basis:192

F B = 2Y−X

Y+X.(1)

This is best when F B = 0, which happens when the mean of the predictions193

equals the mean of the observations. Matching means is, of course, a desirable194

trait, but it is the weakest of veriﬁcation measures, because, experience shows195

and as we show below, even poor models may have equal means of predictions.196

Geometric mean bias:197

MG = exp ln Y−ln X.(2)

This is best when M G = 1. This is another rendition of a mean-matching198

veriﬁcation score, and suﬀers the restriction that predictions and observations199

must be positive. A similar measure that may be as valuable is the simple200

diﬀerence in means; i.e., Y−X.201

12

Normalized mean square error:202

NM SE =(Y−X)2

Y X .(3)

This is best when N M SE = 0. The normalization aids in comparing models203

of observables with diﬀerent variabilities. The denominator is usually elimi-204

nated, and, after the square root is taken, the measure becomes the root mean205

square error (RMSE). The NMSE and RMSE are more useful measures be-206

cause they penalize larger departures of predictions from observations more207

than simple diﬀerences. They also have other attractive mathematical moti-208

vations relating to standard loss functions. However, the measures don’t say209

where these departures took place, though, nor how often. They still give an210

on-average assessment.211

Geometric variance:212

V G = exp (ln Y−ln X)2.(4)

This is best when GV = 1. This has the ﬂavor of the RMSE, but with213

diﬀerent weights for large departures of the prediction from the observed.214

But the logging gives less emphasis to these departures.215

Fraction of close predictions:216

F AC2 = 1

nX

i

I(Xi≥Yi/2 & Xi≤2Yi),(5)

13

where I() is the indicator function; i.e. a function that takes the value of217

1 when its argument is true, else 0. In plain words, FAC2 is the fraction218

of model predictions within a factor of two of the observations. The best219

is F C2 = 1. This is a wholly arbitrary, crude, overly generous, non-proper220

measure. It is only useful to those who do not care how far oﬀ they are with221

an error under 200% (i.e. a factor of 2). The size of this error can only be222

described as very large. FAC2 is used in studies of models similar to OPS,223

such as [13].224

Falster cites [11] (but also see [16]), who give a range on these measures225

which indicate “acceptable model performance”. These limits are:226

|F B|<0.3227

0.7< MG < 1.3228

NSME < 1.5229

V G < 4230

F AC2>0.5.231

Lacking in these documents are any notions of calibration or of skill.232

Calibration can be in mean, as when the means match, in distribution, as233

when the empirical cumulative distributions of predictions and observations234

match, and in exceedence, which is related to calibration in distribution (the235

14

frequency at which certain values are exceeded). The idea is that not only236

single points or means should match, but the entire range and frequencies237

should, too; i.e. all measurable characteristics of the predictions and obser-238

vations.239

We cannot assess lack of calibration well without data, and indeed we are240

limited to the historical scores computed from the old studies, but even so,241

some remarks can be given.242

Skill is a key requirement in model assessment. We cannot discover if243

skill was ever assessed for AERIUS. Skill is the formal comparison of an244

“advanced” model, like OPS, with a “simpler” model, where both models are245

judged and compared against a proper score, or to some veriﬁcation measure,246

such as those measures given above. In its most basic form, a veriﬁcation247

measure for the advanced and simple model are computed, then compared.248

The model that gives the best performance is preferred.249

Skill as a concept is widely used in meteorological, climate, and pollution250

dispersion models, and is considered a necessary criterion of model viability.251

An advanced model which performs poorer compared to a simpler model is252

said to lack skill; if the advanced model performs better, it has skill, again253

conditional to the veriﬁcation measure. Any advanced model without skill254

15

should not be used. More precisely, the simpler model should be used in its255

place.256

In the case of the OPS model, a simple competitor model is the “mean257

model”, in which the mean of the observables is the constant prediction (X≡258

Y). That is, the mean model always predicts the mean on the observations.259

This is a sensible model, too, albeit crude. It works well in many situations,260

too, say, temperature within a month at your home. Obviously, we should261

require the OPS model do a superior job compared against the mean model.262

In spite of its crudeness, the mean model would score F B = 0 (the best),263

MG = 1 (the best), RMSE = σ, the standard deviation of the observables264

(this is simple to show), and GV would be something like RMSE. NSM E265

would be close to the bounds set by [16], as long as the distribution of the266

observations was roughly symmetric. F AC2 would be close to 0.90 ∼0.95,267

i.e. plus or minus twice the standard deviation, unless the observables are268

very highly skewed. Even then, something greater than 0.5 is likely.269

Clearly, at least using these veriﬁcation measures, the “mean model”270

would be an excellent choice, showing exceptional performance, which would271

be very diﬃcult, or even impossible for OPS to beat, regardless of how the272

observations turned out. The point is that the trivial mean model scores273

16

perfectly on at least two of OPS preferred measures, and very well on the274

others.275

This shows both the deﬁciencies of the measures, and of the OPS model276

itself, as we shall now discover.277

4 Model Run278

Before examining model performance, it is worthwhile examining model out-279

put under some common scenarios. We cannot in the space here do more280

than show a typical use for the model. Even so, we learn some interesting281

limitations.282

We wanted to determine nitrogen deposition using AERIUS/OPS for an283

ammonia source of diﬀerent strengths, at a distance of approximately 2000284

meters. We therefore ran the AERIUS/OPS model using the following set-285

tings:286

•Cattle barn emissions: no animals, 1 cow, 100 cows, 200 cows and 400287

cows;288

•Emission estimated at 10 kg NH3 per year per cow;289

•Default values for height emission point and heat content;290

17

•At the Natura 2000 site Lieftinghsbroek.291

Our method was this:292

1. We used the publicly available sites calculator.aerius.nl and moni-293

tor.aerius.nl.294

2. The background deposition per hexagon of the Natura 2000 site is295

checked in AERIUS/OPS Monitor for 2019. These background depositions296

apply to what we call the “zero cows” scenario.297

3. In the AERIUS/OPS Calculator, a cattle barn is created at approxi-298

mately 2000 meters from the Natura 2000 site. It is not possible to make a299

calculation with zero cows. Therefore the calculation is based on the smallest300

unit of 1 cow (again, the “zero cows” being just background). We assumed301

per-cow emissions of 10 kg N/year. The AERIUS/OPS Calculator ﬁnds the302

hexagon with the highest contribution and calculates its total deposition,303

background plus emissions. This hexagon is marked.304

3. We ran calculations for the same barns for 1, 100, 200 and 400 cows.305

4. The hexagon with the highest contribution in all the calculations is306

the same hexagon.307

The results are in Table 1, which for the hexagon with the highest con-308

centration shows the number of cows, the total deposition in mol N/ha/year,309

18

and the estimated deposition from cows alone.310

The background deposition for “0 cows” is 1938 mol N/ha/year (in 2019).311

Yet at 400 cows, a huge increase, nitrogen increases less than 6 mol N/ha/year,312

a trivial, and likely unmeasurable, number.313

The estimated diﬀerence going from 200 to 400 cows is only about 3 mol314

N/ha/year. This is surely a negligible amount. At best, it would extraordi-315

narily diﬃcult to measure this diﬀerence in situ, if not impossible. Also, any316

reference to the nitrogen critical loads (NCLs) being further exceeded with317

the modelled 3 mol N/ha/year as an argument for deliberately sidestepping318

the multiple uncertainties in the NCLs and AERIUS/OPS is not only beside319

the point but, worse, circular in nature. The modelled 3 mols cannot, by any320

stretch of the imagination, be anywhere near reality, next to the manifold321

and substantial imprecisions in the NCLs we have documented previously322

[8, 7].323

But suppose the model is perfect, which is an impossibility. Then the324

eﬀect of reducing animals by 50% reduces total deposition only a trivial325

amount. Which could never, in practice, be measured or veriﬁed.326

Therefore, for this plausible situation, the model is just about useless.327

It does nothing more in essence than report a background estimated mean328

19

Table 1: Results of the AERIUS/OPS model for various scenarios. The

“0” cows is the background deposition only. Total depositions are in mol

N/ha/year, and the estimated deposition just from cows.

Cows Total deposition (mol N/ha/year) Deposition from cows

0 1938 0

1 1938.07 0.01

100 1939.52 1.47

200 1940.99 2.93

400 1943.92 5.87

deposition, and given as a number with no uncertainty in it, which the es-329

timated means have; see [7, 8]. It is because of this that it is practically330

impossible to verify its performance.331

We next examine how the model performed in actual situations.332

20

5 General Critiques333

5.1 Kincaid334

From Kincaid (Fig. 4), Fig. 1 shows the OPS (the AERIUS preﬁx is gen-335

erally left oﬀ in these sources, a practice we follow here) model predictions336

of maximum per-hour SF6(y-axis) against observations (x-axis); the unit is337

µg/m2. A red one-to-one line is over-plotted. Perfect predictions would fall338

on this line. Obviously, both model predictions and observations cannot fall339

below 0. Predictions range from 0 to just under 1.5, as do most observations,340

with the exception of about 10 or so observations that are larger than 1.5.341

Visually, there does not appear much, if any, correlation between the342

model and the observations, as discussed above. However, the model at343

least reproduces the approximate range, which is one of the weakest forms344

of veriﬁcation. To be sure, the measured range itself is small. This is due to345

the chemical tracked and its relatively meager atmospheric levels.346

It is clear from Fig. 1 that the simple “mean model”, which predicts the347

observation mean each time would rival, or even be superior, to the OPS in348

this case. For instance, the mean of OPS predictions is almost certainly not349

equal to the mean of SF6observations; whereas it is by deﬁnition equal in350

21

Figure 1: OPS model predictions of maximum per-hour SF6(y-axis) against

observations (x-axis), with dates indicated, in µg/m2. A red one-to-one line

is over-plotted. Perfect predictions would fall on this line.

22

the mean model. Thus, the OPS has no skill relative to, say, F B , where the351

mean model would be referred.352

Figs. 2 and 3 (Fig. 6 from Kincaid) show the OPS SO2predictions for353

various periods and experimental runs (YR 3.0.2, etc.). As above, one-to-one354

lines are included. The letters in the upper left corners were added by us for355

easier identiﬁcation.356

Figure 2: OPS model predictions of SO2(y-axis) against observations (x-

axis), with dates indicated, for various periods and experimental runs. A red

one-to-one line is over-plotted. Perfect predictions would fall on this line.

We ﬁrst warn the reader against letting the one-to-one line interfere with357

interpreting model bias. For all runs, except possibly (d) and (f), the model358

does not have much visible correlation with the observables, except approxi-359

mately the mean and range, as above. Skill would likely be lacking for some,360

or maybe even all, veriﬁcation measures for the same reasons.361

23

Figure 3: Extension of Fig. 2.

24

For (d) and (f ), there is slight correlation, in the sense that a straight line362

(say, a regression) through the points might depart from a horizontal line (of363

no correlation, as is likely in Fig. 1). Even so, it’s clear (1) the model badly364

under-predicts the values of SO2, and that (2) model performance becomes365

worse at larger values of SO2.366

Figure 4: Veriﬁcation statistics for various OPS model runs. The shaded

regions represent “acceptable model performance”, according to Kincaid.

Veriﬁcation statistics for various OPS model runs are presented in Fig.367

4 (Kincaid Table 4). The shaded regions represent “acceptable model per-368

formance”, according to Kincaid. It’s clear from these the the simple mean369

model bests OPS in FB and MG exactly, and likely in FAC2. It might be a370

25

rival, or even better than OPS, in terms of NSME and VG. If, for instance,371

the mean of the observations was larger than it’s standard deviation, which372

appears to be true from the ﬁgures, then NSME for the mean model would373

be less than 1. How much less we can’t tell from the ﬁgures alone (and the374

same goes for the more complicated VG).375

In any case, there is nothing obvious in these ﬁgures to recommend the376

OPS model over the simple mean model. While a formal veriﬁcation might377

change this judgement, and the OPS could be superior relative to some of378

these measures, at least some times, the OPS model cannot be said to have379

performed well at all.380

Kincaid also presents results from Jaarsveld (below), which shows 8 sep-381

arate model prediction and observation means across “several models” for382

both April-May 1980, and then May-June 1980, for both the OPS-PRO4383

and the OPS v.120E. These are shown in Fig. 5.384

At ﬁrst glance, it appears at least the OPS-PRO4 model might be a skillful385

(and that OPS v.120E is much poorer). But this ﬁgure is a conglomerate,386

rather, an average across several model runs at separate points. It would be387

as if the points in (a) in Fig. 2 were averaged, the graph them consisting388

of three points, at (Y , X) for each of the runs YR 3.0.2, YR 10.3.2, and389

26

Figure 5: Quoting Jaarsveld: “South-to-north concentration proﬁles mea-

sured at the Kincaid power plant versus proﬁles modelled by several models.”

LT-2013.390

That is, it would look something like the simulated data in Fig. 6.391

On the left is a ﬁgure simulated with n= 30 random normals with means392

of 7 and standard deviations 2, which appear to match the actual data, but393

this is not crucial. They might represent the model within one year, making394

two to three predictions a month. There is by design no correlation between395

the predictions and observations, and indeed not much signal is seen in the396

ﬁgure on the left.397

On the right are m= 12 years, each with n= 30 simulated uncorrelated398

predictions and observations. The points within each year are averaged. This399

27

Figure 6: Simulated data, n= 30 points within a year (left), and m= 12

years, each with n= 30 simulations, all of uncorrelated noise. Notice how

much better the model appears after the averaging.

28

greatly improves the model performance, at least visually. It is true that the400

means match, in the simulations. In practice, however, they do not match401

exactly, and the naive mean model still beats the simulation.402

We chose to plot the simulations as a time series instead because, as we403

shall see below, this is how Jaarsveld portrayed the same data. However, a404

scatter plot (as above in the original plots) makes the point equally well.405

Figure 7: Simulated data, as in the right panel of Fig. 6, but here as a scatter

plot.

Here in Fig. 7 is the same data in the right panel, this time as a scatter406

29

plot. The story remains the same: the averaging deceptively improves the407

visual performance.408

5.2 Falster409

Fig. 8 (Falster Fig. 2.4) shows OPS model predictions and observations of410

NH3for the Falster pig house experiment. Results from diﬀerent versions of411

the same model with diﬀerent parameterizations are shown. A one-to-one412

line is over-plotted.413

The model here does a better job than in Kincaid. But it over-predicts414

for small values of NH3, i.e. those less than 2 µg/m3. And after this, the415

model has little visual correlation with NH3, or it under-predicts. There are416

only 5 data points above 2 µg/m3, so it is diﬃcult to tell.417

The same kind of results are found in Fig. 9 (Falster Fig. 2.5), but for418

diﬀerent local conditions (pig house emissions including the slurry tank).419

This has the same criticisms as before: with NH3values less than about420

2.5µg/m3, the model over-predicts, and after it either under-predicts or the421

model breaks down.422

It’s worth mentioning the mean model still beats the OPS model here423

with respect to veriﬁcation measures FB and MG, since the mean model424

30

Figure 8: OPS model predictions and observations of NH3for the Falster pig

house experiment for various model parameterizations. As above, a one-to-

one line is over-plotted.

31

Figure 9: Same as Fig. 8 but for diﬀerent local conditions.

32

scores always scores perfectly with these. It might be here that the OPS425

ﬁnally has skill with respect to NMSE, or there may be insuﬃcient data to426

tell: only a formal analysis can tell.427

More model runs, again for varying conditions, are found in Fig. 10428

(Falster Fig. 3.5).

Figure 10: Same as Fig. 8 but again for diﬀerent local conditions.

429

As before, for low levels of NH3, the model over-predicts. But it now430

shows very large over-predictions for large, and thereby important, values of431

33

NH3.432

At observed values around 60µg/m3, the model predicts anywhere from433

about 220 to 230 µg/m3, a huge error.434

The Falster report also includes values of the formal veriﬁcation measures,435

but we do not show them here as they are much they same as in Kincaid,436

and in any case they are not especially valuable.437

5.3 Prairie438

Fig. 11 shows the ﬁrst of many prediction-observation experiments. Many439

more were done in the Prairie data than in the previous two experiments, so440

we only show a few representative plots, the others being very similar. This441

experiment used old data from two sites and used it to tune the model, to442

see how well the model could be made to ﬁt the data.443

Note the scale change. The model does reasonably well, but with some444

obvious biases at low values of SO2, but become progressively worse for larger445

values, over-predicting SO2levels.446

Fig. 12 is much the same, though it is evident the settings of the model447

can lead to obvious biases (the red dots, for instance). The model generally448

over-predicts and becomes worse for larger values of SO2. Though it is likely449

34

Figure 11: OPS model predictions and observations of SO2g/m2(note the

diﬀerent scale) for the Prairie grass experiment.

Figure 12: As in Fig. 11 but for diﬀerent model runs in the Prairie grass

experiment.

35

to have skill over the mean model conditional on most measures.450

Finally, we present Fig. 13, which shows again various model runs, but451

here with optimal calibration factors calculated from the previous runs. The452

model appears best here, and should. But it is an example that models can453

always be made to ﬁt already observed data. It is not the ideal test of model454

performance, as in Kincaid and Falster above.455

5.4 Jaarsveld456

Jaarsveld (Chapter 8) represents a systematic approach to OPS model ver-457

iﬁcation, though most of the (necessary) eﬀort is spent examining model458

internals. The author does, however, present several graphics of averaged459

model performance. Two highly representative ﬁgures are reproduced below.460

Fig. 14 at ﬁrst glance looks impressive. But on consideration it is seen461

the points are all averaged over many locations and model runs. The detail462

is lost. This is exactly like the Kincaid situation in which the average across463

model runs given the false appearance of model improvement. See Fig. 7 for464

a simulated example.465

In any case, because the simpler mean model is just that, the mean, it466

would win for any of the veriﬁcation measures. OPS has no skill in relation467

36

Figure 13: OPS model predictions and observations of SO2g/m2after cali-

bration for the Prairie grass experiment.

37

to it.468

Figure 14: OPS model predictions and observations of NOyaveraged over

several sites and time points. Compare this to Fig. 7.

Next compare Fig. 15 with Figs. 2 and 3, which are the same experiment,469

but while the former are averaged in a formal way, the latter are original. A470

completely diﬀerent picture emerges: the average, as discussed, gives a false471

sense of model value.472

A completely diﬀerent picture emerges: the average, as discussed, gives a473

false sense of model value.474

Jaarsveld of course presents much rich information on the OPS model475

and its inner workings, and is an excellent reference. However, it is clear476

38

Figure 15: OPS model predictions and observations of SO2µg/m2averaged

for Kincaid.

that much more work on verifying the model needs to be completed before477

it can be trusted. Averaging observations across time and space, and also478

model predictions, gives an overly optimistic idea of model performance.479

As said in the Introduction, some of the averaging of observations used480

by Jaarsveld is itself the output of gridding models. The uncertainty in this481

process, which Jaarsveld does not report, should be considered in any future482

analysis.483

A similar analysis to Jaarsveld is in [25], which compares OPS with two484

other similar models, LOTOS-EUROS and LEO, which is a hybrid between485

39

LOTUS-EUROS and OPS. This uses the Measuring Ammonia in Nature486

(MAN) data in the Netherlands, which has signiﬁcant uncertainty in it (see487

[6]). The analysis is similar to Jaarsveld because the models were ﬁt to488

previously observed data, from the years 2009–2011, and where previously489

used data was used to drive the model inputs. The models were not, however,490

tuned to the MAN data, as in the other studies mentioned above.491

For example for NH3, [25] found that the LEO results were almost identi-492

cal to OPS (not surprisingly, since it was in part based on it). And that both493

the LOTUS-EUROS and OPS “over-estimate concentrations of NH3starting494

from observed levels of ∼2.5µg/m3.” The authors ﬁnd that LOTUS-EUROS495

was inferior to the other two models.496

The method of veriﬁcation was regression analysis, the atmospheric mea-497

surement (i.e. the “x” variable) predicting the prediction (the “y” variable).498

Perfect predictions would have a slope of 1, with no variability, indicating499

a straight line through the observation-prediction pairs. A slope less than500

1 indicates over-prediction. For OPS (with LEO being very similar), the501

slopes were 0.82 ±0.39 for NH4, 0.63 ±0.60 for NH3and 0.35 ±0.12 for SO4.502

Over-prediction was thus prevalent.503

Besides being for previously observed data, these results are also for yearly504

40

averages presented at diﬀerent geographic points, and so are likely to be bet-505

ter than for predictions made for individual grid squares and more frequent506

periods, for the reasons discussed in the Jaarsveld section.507

6 Conclusion508

A clear picture has emerged from examining veriﬁcation studies of the AERIUS-509

OPS model, which is that this model is rather poor, in general and in detail,510

as indeed Kincaid and others concluded.511

It seemingly does well in reproducing averages, although as we discussed512

(here and in previous work) the average itself is not known with certainty.513

The model does a reasonable job at reproducing ranges some of the time,514

but not always. Error is clearly a function of size, unfortunately becoming515

unacceptably large when atmospheric concentrations increase, which is of516

course when they are most important.517

Previous veriﬁcations suﬀered substantially from using sub-par measures518

(like F B, etc.), and there was no consideration of skill. A simple “mean519

model”, which always and everywhere predicts the mean of the observations520

easily beats the AERIUS/OPS using the accepted veriﬁcation measures.521

41

By this alone, AERIUS/OPS needs to be shelved until it is improved.522

The situation is worse for AERIUS/OPS because that mean model by itself523

is not known with certainty. This is because the mean at each geographic524

point of interest must ﬁrst be known. If it is not known, or known only with525

signiﬁcant uncertainty, then even the simple, and often superior, mean model526

is insuﬃcient.527

It should be recalled that all models, AERIUS/OPS included, only say528

what they are told to say; see [5] for proof. Therefore, AERIUS/OPS has529

been told to say the wrong things far too often. Improvements to the model530

are thus needed. After these are made, independent planned experiments in531

which the AERIUS/OPS is used to make predictions at various points and532

conditions. The observations taken at these points should be measured with-533

out error, to the extent that that is possible. Given its political importance,534

the veriﬁcation of the model should be carried out by a group independent535

of those who created it.536

At this point, there are no adequate replacements for AERIUS/OPS.537

This is shown, albeit indirectly, also in [24]. In a review of models, includ-538

ing the OPS, Theobald [24] shows that the OPS is very similar to other539

models (ADMS, AERMOD, and LADD; LADD is rejected as useful in this540

42

paper). Azoz et al. [2] also ﬁnd OPS produces similar output as the chem-541

istry transport model CHIMERE (that work was not concerned with model542

veriﬁcation).543

The veriﬁcation of these models in Theobald uses the same substandard544

measures as detailed above (FB, etc.). Because the models (other than545

LADD) are so similar, they share the same shortcomings as AERIUS/OPS.546

Indeed, the authors themselves only say model performance is “‘adequate’”—547

the scare quotes are in the original—a judgement formed when analyzing the548

Falster data.549

References550

[1] J. S. Armstrong. Signiﬁcance testing harms progress in forecasting (with551

discussion). International Journal of Forecasting, 23:321–327, 2007.552

[2] N. Azouz, J.-L. Drouet, M. Beekmann, G. Siour, R. W. Kruit, and P. C.553

1. Comparison of spatial patterns of ammonia concentration and dry554

deposition ﬂux between a regional eulerian chemistry-transport model555

and a local gaussian plume model. Air Quality, Atmosphere & Health,556

12:719–729, 2019.557

43

[3] W. M. Briggs. A general method of incorporating forecast cost and loss558

in value scores. Monthly Weather Review, 133(11):3393–3397, 2005.559

[4] W. M. Briggs. Uncertainty: The Soul of Probability, Modeling & Statis-560

tics. Springer, New York, 2016.561

[5] W. M. Briggs. Models only say what they’re told to say. In N. N.562

Thach, D. T. Ha, N. D. Trung, and V. Kreinovich, editors, Predic-563

tion and Causality in Econometrics and Related Topics, pages 35–42.564

Springer, New York, 2022.565

[6] W. M. Briggs and J. Hanekamp. Uncertainty in the man data calibration566

& trend estimates, 2019.567

[7] W. M. Briggs and J. Hanekamp. Outlining a new method to quantify568

uncertainty in nitrogen critical loads. 2020.569

[8] W. M. Briggs and J. C. Hanekamp. Nitrogen critical loads: Critical570

reﬂections on past experiments, ecological endpoints, and uncertainties.571

Dose-Response, 20(1):15593258221075513, 2022. PMID: 35185419.572

[9] W. M. Briggs and D. Ruppert. Assessing the skill of yes/no predictions.573

Biometrics, 61(3):799–807, 2005.574

44

[10] W. M. Briggs and R. A. Zaretzki. The skill plot: a graphical technique575

for evaluating continuous diagnostic tests. Biometrics, 64:250–263, 2008.576

(with discussion).577

[11] J. Chang and S. Hanna. Air quality model performance evaluation.578

Meteorology and Atmospheric Physics, 87:167–196, 2004.579

[12] M. de Heer, F. Roozen, and R. Maas. The integrated approach to nitro-580

gen in the netherlands: A preliminary review from a societal, scientiﬁc,581

juridical and practical perspective. Journal for Nature Conservation,582

35:101–111, 2017.583

[13] A. Dore, D. Carslaw, C. Braban, M. Cain, C. Chemel, C. Conolly,584

R. Derwent, S. Griﬃths, J. Hall, G. Hayman, S. Lawrence, S. Metcalfe,585

A. Redington, D. Simpson, M. Sutton, P. Sutton, Y. Tang, M. Vieno,586

M. Werner, and J. Whyatt. Evaluation of the performance of diﬀerent587

atmospheric chemical transport models and inter-comparison of nitrogen588

and sulphur deposition estimates for the uk. Atmospheric Environment,589

119:131–143, 2015.590

[14] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction,591

and estimation. JASA, 102:359–378, 2007.592

45

[15] T. Gneiting, A. E. Raftery, and F. Balabdaoui. Probabilistic forecasts,593

calibration and sharpness. Journal of the Royal Statistical Society Series594

B: Statistical Methodology, 69:243–268, 2007.595

[16] S. Hanna and J. Chang. Acceptance criteria for urban dispersion model596

evaluation. Meteorology and Atmospheric Physics, 116:133–146, 2012.597

[17] H. Hersbach. Decompostion of the continuous ranked probability score598

for ensemble prediction systems. Weather and Forecasting, 15:559–570,599

2000.600

[18] L. Hordijk, J. W. Erisman, H. Eskes, J. Hanekamp, M. Krol, P. Lev-601

elt, M. Schaap, W. de Vries, and A. Visser. Meer meten, robuuster602

rekenen eindrapport van het adviescollege meten en berekenen stikstof.603

Technical Report June 2020, Adviescollege Meten en Berekenen Stikstof,604

Netherlands, 2020.605

[19] L. Hordijk, J. W. Erisman, H. Eskes, J. Hanekamp, M. Krol, P. Lev-606

elt, M. Schaap, W. de Vries, and A. Visser. Niet uit de lucht gegrepen607

eerste rapport van het adviescollege meten en berekenen stikstof. Tech-608

nical Report 2 March 2020, Adviescollege Meten en Berekenen Stikstof,609

Netherlands, 2020.610

46

[20] J. Jaarsveld. Description and validation of ops-pro 4.1. Technical Report611

RIVM report 500045001/2004, Rijksinstituut voor Volksgezondheid en612

Milieu, Bithoven, Nederland, 2004.613

[21] A. H. Murphy. Forecast veriﬁcation: its complexity and dimensionality.614

Monthly Weather Review, 119:1590–1601, 1991.615

[22] A. H. Murphy and R. L. Winkler. A general framework for forecast616

veriﬁcation. Monthly Weather Review, 115:1330–1338, 1987.617

[23] M. Schervish. A general method for comparing probability assessors.618

Annals of Statistics, 17:1856–1879, 1989.619

[24] M. R. Theobald, P. Lofstrom, J. Walker, H. V. Andersen, P. Pedersen,620

A. Vallejo, and M. A. Sutton. An intercomparison of models used to621

simulate the short-range atmospheric dispersion of agricultural ammonia622

emissions. Environmental Modelling & Software, 37:90–102, 2012.623

[25] E. van der Swaluw, W. de Vries, F. Sauter, J. Aben, G. Velders, and624

A. van Pul. High-resolution modelling of air pollution and deposition625

over the netherlands with plume, grid and hybrid modelling. Atmo-626

spheric Environment, 155:140––153, 2017.627

47

[26] D. S. Wilks. Statistical Methods in the Atmospheric Sciences. Academic628

Press, New York, second edition, 2006.629

48