Page 1
Compu. & Opr Rcl., Vol. 3, pp. ZO!J216. Perpmon Press, 1976. Printed in Great Britain
A GENERAL PURPOSE UNIVARIATE PROBABILITY
MODEL FOR ENVIRONMENTAL
DATA ANALYSIS
WAYNE R. On* and DAVID T. MAGEt
Office of Research and Development U.S. Environmental Protection Agency, Washington,D.C. 20460, U.S.A.
Scope and purposeThe purpose of this article is to discuss a probability model that is well suited to the
underlying physical and chemical processes responsible for environmental phenomena. The model should
assist data analysts who must translate environmental monitoring data into more refined probability
statements for decision making and policy assessment purposes.
Abt&raeAnalysis of environmental quality data for decision making purposes (evaluation of compliance
with standards, examination of environmental trends, determination of confidence intervals) generally
requires a suitable univariate probability model. It sometimes is ditlicuk, when many probability models are
available, to select the most appropriate one for a given data set. The underlying physical laws which generate
pollutant concentrationsdiiusion processesoffer insight into which model may be most appropriate for a
variety of situations. Treating the difFusion equation as a stochastic differential equation, the time series of
pollutant concentration data from diffusion phenomena is shown to have a distribution that is best
approximated by the censored, 3parameter lognormal probability model (LN3C). The model is applied to 10
air quality data sets (SO*, OS, CO, particulate, hydrocarbons, and NO2 from the United States, France, West
Germany, and Denmark) and 9 water quality data sets (BOD, coliform, chloride, and sulfate from the Ohio
River). The authors conclude that the LN3C probability model offers data analysts a superior, general
purpose model suitable for a large variety of environmental phenomena.
1. INTRODUCTION
Univariate probability models are of considerable importance in the analysis and interpretation
of environmental monitoring data for decision making purposes. For a given environmental
variable, such as atmospheric pollutant concentration, analysis usually begins by examining the
frequency distribution of values at a particular location. When an environmental policy is to be
evaluated or an environmental decision is to be made using the data, we may require a probability
statement: “The concentration exceeds x ppm with probability P”, or, “Based on these da& a
concentration of x ppm was exceeded, on the average, not more than once a year”. The latter
implies that the probability of exceeding concentration x (for lh averaging periods) was
P I (365 x 24)l. Under the Clean Air Act [l], environmental control programs in the United
States are required to achieve this level of air quality by attainment of National Ambient Air
Quality Standards in designated Air Quality Control Regions.
Because of the probabilistic nature of environmental standards such as these, analysis of
monitoring data usually begins by examining the frequency distribution of measured concentra
tions at a given location. Although the histogram of concentrations gives insight into the
frequency of occurrence of different values, it does not provide a general model from which to
calculate the probability associated with any concentration of interest. A formal probability
model offers a way to make probability calculations with greater convenience and accuracy.
2. UNIVARIATE PROBABILITY MODELS
A univariate probability model (an equation involving a single random variable) has many
uses: (a) it provides a convenient means for interpolating between observed values; (b) it gives a
simple and compact representation of the entire distribution; (c) it provides, when plotted, a
complete pictorial representation; (d) it allows iuferences to be made about the nature of the
underlying physical processes; (e) it assists in carrying out further statistical analyses (examina
tion of trends, studies of correlations, and calculation of confidence intervals).
*Dr. Ott received his Ph.D. degrm from Stanford University in 1971 and holds B.S., B.A., MS., and M.A. degrees in civil
and electrical engineering, engineering science, communications, and economics. He currently is Senior Systems Analyst for
Monitoring Quality Assurance with the Office of Research and Development of the U.S. Environmental Protection Agency,
401 M. St., Washington, D.C. 20460. He has published various papers on environmental modeling, data analysis and
interpretation, environmental indices, and monitoring network design.
tDr. Mage received his Ph.D. degree from the University of Michigan in 1964 in the field of chemical engineering. He
currently is Chief, Chemistry Section, Air Quality Branch, Environmental Monitoring and Support Laboratory, U.S.
Environmental Protection Agency, Las Vegas, NV 89104. He has published a variety of papers in the areas of thermodynamics,
air pollution data analysis, monitoring surveys, and environmental monitoring metbadology.
209
Page 2
210
W. R. OTT and D. T. MAGE
A variety of probability models have been proposed for environmental data. Larsen [2,3] has
applied the 2parameter lognormal model to an extensive body of air pollution measurements;
Singpurwalla[4] has suggested use of the extreme values distributions for enviromental data;
Curran [5] has shown that, for the tail, the 2parameter exponential distribution can be superior to
the 2parameter lognormal model; Lynn[6] has compared the gamma, lognormal, normal, beta,
and Pearson models using air quality data. The literature does not reveal any one model to be
clearly preferable to others in all instances.
Choosing the most appropriate probability model for a given data set may be difficult because
several different models may appear to fit the data equally well. Generally, the test statistic used
for the fitting criterion is itself a random variable; the data set may not be a truly random sample
from the unknown distribution; and the empirical observations contain experimental error. One
always should prefer a model based upon the physical processes from which the data
ariseespecially for environmental data, where extensive statistical verification that one model
is more suitable than another usually is lacking. We shall examine the underlying laws which
govern pollutant concentrationsdiffusion processes to gain insight into whether one model, or
a class of models, may be more appropriate than others.
3. BASIS FOR MODEL
Most environmental variables of interest arise from diffusion processespollutants
air, water, or land carrier medium, and their molecules become mixed and diffused in the medium.
For a volume fixed in space, assuming that the concentration of pollutant Cs entering from a
source with the volume is much greater than the concentration C, the diffusion equation can be
written as follows for any fluid (air or water) carrier medium:
+fc==u,
enter the
(1)
where
U = K,C* + KG  Ks
and
C = pollutant concentration
u, v, w = velocity of carrier medium in the x, y, zdirections
Vz = Laplacian operator, d*/ax*+ a21 ay2+
K1 = homogeneous chemical reaction rate coefficients (KI 2 0)
Kt = dilution coefficient (Kz 10)
K, = zeroth order heterogeneous chemical reaction rate coefficient (K, = 0 if only one phase
is present or if C = 0; otherwise KS > 0)
D = isotropic diffusion coefficient (D > 0)
Cs = initial concentration of pollutant emitted by source within the volume (air source
emission or water effluent stream)
C* = concentration of pollutant if in chemical equilibrium with all other species present
(C* = 0 for irreversible reactions).
a7a.z’
Diffusion of pollutants in an environmental medium is complex, involving many variables that
exhibit random fluctuations. In air, random fluctuations can be attributed to variations in
meteorological variables (weather conditions such as temperature, humidity, atmospheric
stability, solar radiation, wind speed, wind direction, etc.); variation in the sources of the air
pollutant of interest due to changes in industrial and human activities; and variations in the
concentrations of other pollutants present in the atmosphere, which in turn affect chemical
reaction rates. In water, turbulence and variations in flow cause random mixing variations, and
other pollutants in the medium also affect chemical reaction rates. For the terms in brackets in
equation (l), the authors have shown elsewhere in the literature [7] that C is a common factor of
the first and second derivatives and that fluctuations in V do not depend on fluctuations in C but
Page 3
A general purpose univariate probability
model for enviromental data analysis
211
depend directly on fluctuations in u, u, and wthe carrier medium velocities in the x, y, and
zdirections. Thus, the model treats K1, Kz, KS, CS, C*, D, u, v, and w all as random variables,
independent of C, which are lumped into the terms U and V.
The authors do not have solid evidence to conclude that these diverse environmental factors
may be tested as independent random variables. However, with so many apparently unrelated
physical processes, ~dependence appears to be a reasonable conjecture, and we shall make this
assumption and then examine the distributions which result. The authors currently are exploring
the theoretical development of this model further in the context of realistic environmental
conditions.
Treating U and V as independent random variables arising from (unknown) distributions with
finite variance, we examine the distribution of C that may result. As shown elsewhere[7],
equation (1) can be appro~mated as a discrete time series, where each Cn+t is a successive
concentration in time:
c~+,=Ci~~,(l
v,at,+$ ui[jj+,tl ViAO]
(for n=1,2,3 ,... ).
(2)
Nothing has been said about the possible correlation between U and V. There is a physical
reason to suspect that U and V are correlated, especially for air pollution measurements.
Normally, monitoring sites are surrounded by a variety of sources, with wind direction varying
from hour to hour and from day to day. When the wind comes from one direction, source
receptor relationships are different than when the wind comes from another direction. Thus, U
and V will arise from different distributions for wind direction 8% than for wind direction Oz. The
observed concentrations will be the result of many wind directions over a long time period. To
the extent that U and V are correlated with 0, they will appear correlated with each other. That
is, much of the variation in U and V will be explainable by variation in 8. This correlation should
be greater for air pollutants than for water pollutants in streams, where directional effects are
limited due to physical boundaries. The authors do not, however, have strong evidence that U
and V are correlated. Instead, they treat the likelihood of correlation as a reasonable conjecture
and examine the distribution which results.
Mathematical treatment for the cases  1 < rw < + 1 is complex, and so we shall focus on the
extreme case where U and V are linearly related as G = KVf, and ruv = + 1. Then equation (2)
simplifies:
C.,,=(C,K&(1V,At)tK.
(3)
I1
Taking logarithms, we find that log (C,,,  K) is the sum of n independent random variables
involving V, plus a constant:
log (C”,, K)=log(C,K)+&og(lV,Ar)
(4)
By the additive form of the central limit theorem 181 the right hand side of equation (4) is normally
~s~buted for each n = /3, n = 28, II = 38,. . . , where fi SO. The concent~tions C,+,, CZ~+,,
C&s+,, . f . , must therefore have a 3parameter, lognormal distribution, in which K is the third
parameter. The probability density function (PDF) for this distribution is as follows:
1’
1 0 elsewhere
cQ</,<=, u>o,
1
ec1/ 2~cK tPl/ o)2, for c > K
h(c) =
IT(c  K )v%
m<K<m.
In Fig. 1, the PDF is plotted, with X denoting the random variable, and arbitrary constants
~1 = 0.978, o‘ = 0.472, and K = 1.5. When K turns out negative, the curve is shifted to the left.
Page 4
212
W. R. On and D. T. MAGE
2 I 0
I 2 3
x
4 5 6
7 8
Fig. 1. P.D.F. of censored 3parameter lognormal probability model (LN3C) for K =  I.& p = 0.987,
CT = 0.472.
To comply with physical reality, pollutant concentrations always must be positive; therefore, to
avoid negative concentrations, the portion of the curve to the left of the origin must be censored,
and its area represented by a discrete probability value at the origin. In Fig. 1, the magnitude of
the discrete probability is P(0) = 0.109, which is equal to the area under the dotted portion of the
curve and to the left of the origin. This censorship is a natural consequence of equation (l), which
contains a zeroth order heterogeneous chemical reaction rate coefficient, KS, which is set to zero
whenever C = 0. This occurs because sink mechanisms occasionally deplete the carrier medium
of its pollutant. The discrete probability mass at the origin is a consequence of this depletion
phenomenon. When applied to real data, this probability mass also includes the inability of some
measurement techniques to accurately measure very low concentrationsbelow
detectable omitwhich are erroneously recorded as zero values.
In this model, the successive instaneous concentrations C,+,, C28+1, GB+,, . . . are serially
correlated, just as observed air pollution concentrations are known to be. Because of this serial
correlation, the time averages of the instantaneous concentrations retain their lognormal charac
teristics, provided that the averaging times are short (1 h, 8 h, etc.). As the averaging time
increases, however, the serial correlation becomes less pronounced, and, for very long averaging
periods (e.g. one month), the distribution approaches the normal dist~bution.
the minimum
4. APPLICATION OF THE MODEL
To determine whether a given data set is suited to the LN3C probability model, standard
logarithmic probability paper can be used. If the data are lognormally distributed, the plot
appears as a straight line. If, however, the data are better suited to the LN3C model, a curved line
results. By adding the proper constant to the data values and replotting, it is possible to again
obtain a straight line, indicating that log (CK) is normally distributed, with K as the third
parameter. By this graphical trialanderror process, the magnitude of K can be estimated.
Air quulit~ data
Figure 2 illustrates the approach using atmospheric sulfur dioxide concentrations for
Washington, D.C. Here, 43,000 hourly values have been grouped into 11 intervals, and the
cumulative frequencies are plotted on standard (2 cycle) logprobability paper. The original data
(lower curve) plot as a curved line that is concave downward. When an arbitrary value, 0.09 ppm,
is added to each concentration (K = 0.09) and the result is replotted, a new curve that is
concave upward results (top curve). Thus, the optimum value for I( must lie between 0 ppm and
 O.O!J ppm; by repeating the process several times, we estimate that optimum K =  0.0245 ppm.
This gives a straight line, which represents a logprobability plot of C’ = C + 0.0245. It has a point
of censorship at C’ = 0.0245, which corresponds to C = Oppm, the minimum physically
realizable value of sulfur dioxide. The point of censorship corresponds to the 8.5 percentile,
indicating that the model predicts 8.5% of the SOZ concentrations are Oppm.
The advantages of the LN3C probabi~ty model are illustrated dramatic~ly in the case of Los
Angeles atmospheric hydrocarbon data (Fig. 3). Here, a plot of the original data (lower curve)
shows marked curvature; inclusion of a third parameter (K =  7.4 ppm) “straightens out” the
Page 5
A general purpose univariate probability model for environmental data analysis 213
0.6 
0.4 
0.3 
K=0.09 (TOO large)
0.2   0.2
0.08 
K=O (Orlginol doto)
0.03 
 0.02
(Point of
censonhrpf
,’
,f
/iliJ[J
0.01. ’
I2
’
5
’
IO
t:
20 3040506070
Cumulative
11 /[I
I
0 01
80 90 95
%
989999.5 99.9 99.99
frequency,
Fig. 2. Logprobability plot of atmospheric sulfur dioxide concentrations for Washington, D.C. CAMP (lh
values; 1 December 1961l December 1968) showing: (a) Original data; (b) Data transformed as
C’ = C + 0.0245 [Optimum transformation];(c) Data transformed as C’ = C + 0.09 [too large].
100
80 
I I I llllll I I ill I J’OO
 80
60 
 60
40 
g 30 
a
LN3C Model
.iY
_._a40
$ 2o
.
3
‘r
8 !O
5
(Data tronsformed as ka 7.4) b_.*c
p_.‘“,,__”
,A
3o k
 20
.a”
/
,.cr’ ,p
/
/
g
._
z
f
E
::
XI*
/
/**
xf
d
87 _____..*
// data
10
8

6
5
4
i E
Untmnsformed
S
?
E!
e
z
::
s
h
I
4
YP
3
z
d'
3
2
!
1 I I AliIIri II III I
I
/ 2 5
10 20 3040506070 80 90 95 989999.5
frequency,
99.9 9993
Cumulative %
Fig. 3. Logprobability plot of atmospheric hydrocarbon concentrations for the Los Angeles continuous air
monitoring project (CAMP) station, I%3 (n = 7500) showing both the original data and the transformed data
plottedas C’ = C + 7.4.
curvature, suggesting
out that, had only the standard LN2 probability model been used, we wouId be faced with the
diffi~uIt problem of drawing a straight line dour
points.
Table 1 compares the results of fitting various air quality data sets to the LN2 and LN3C
probability models. The values of K are all negative and, for SO* data sets, range from
 0.0051 ppm (  14.7 pg rne3> to  0.0245 ppm, with the parameters on each iteration calculated
by least squares minimization. This corresponds to the maximum likelihood estimation approach
discussed in the References@]. The error values listed on the righthand side of the table
represent the sum of squares of the deviations between the probits of the data and the probits of
that the LN3C model is very suitable for these data. It should be pointed
the considerable curvature of the original data
Page 6
214
W. R. OTT and D. T. MAGE
Table 1. Results of fitting 10 air quality data sets to LN2 and LN3C probability models
Location
Pollutant Reference
Upland, CA
Cincinnati, OH
Los Angeles, CA
Copenhagen
Frankfurt
Los Angeles, CA
Lacq, France
Copenhagen
Washington, DC.
St. Louis, MO
Oxidant
Oxidant
Oxidant
Particulate
co
Hydrocarbon
NO,
so*
so*
SO*
1101
[91
[91
HII
[I21
[91
u31
illI
191
[91
No. of
values
No. of
intervals
7810
36,000
18,100
2OMl

7500

2000
43,000
36,000
16
6
6
9
6
7
7
9
11
7
 1.06ppm
 0.04 ppm
 0.054 ppm
 7.73 pg m’
 23.3 ppm
 7.4 ppm
 30 pg m’
 14.7 pg m’
 0.0245 ppm
 0.0125 ppm
0.1530
0.0340
0.0220
0.0459
0.0349
0.1060 0.0030
0.0524
0.0170
0.1268
0.0075
0.0147
0.0046
0.0082
0.0079
0.0043
0.0041
0.0019
0.0359
0.0027
K
LN2
error
LN3C
error
the model. These error values are measures of the goodnessoffit of the model to the data. The
error for the LN3C model is significantly lower than for the LN2 model in all cases, showing the
LN3C model’s general applicability to a wide variety of air pollutants. The problems of selecting
parameters for the models, along with goodnessoffit statistics, are discussed elsewhere 171 and are
treated in considerable detail in a report currently in preparation by the authors.
Water quality data
Figure 4 shows a logprobability plot of Biochemical Oxygen Demand (BOD) for five stations
on the Ohio River. In this instance, the original data plot as a straight line, and the optimum value
for the third parameter is K = 0, indicating that the LN2 probability model is suitable.
Figure 5 plots the cumulative frequencies of 188 measurements of total coliform per lOOm1
sample for the Ohio River in Cincinnati. Although the points appear relatively straight, some
curvature is evident. For these data, the optimum value for the LN3C constant is K =  2.8/100 ml.
If the data were replotted as C’ = C f 2.8, a straight line would result as in the previous figures.
Rather than replot the data, however, we have plotted the actual LN3C model on the same paper
(dotted line).
Table 2 summarizes the result of analyzing a variety of water quality data sets using the LN2
and LN3C probability models. Again, the error for the LN3C model is less than for the LN2
model, although the improvement in fit is not as dramatic for the water quality variables as it was
for the air quality variables. Unlike the air quality data, K is positive for some water variables.
No censorship is required in these instances, since the probability density function is shifted to
the right and all concentrations are positive.
Cumulative frequency, %
Fig. 4. Logprobability plot of BOD data for 5 stations on the Ohio River (n = 1150). In this instance, the LN2
probability model (straight line) fits the data satisfactorily.
Page 7
A general purpose u&variate proba~ity model for enviro~entai data analysis
Table 2. Results of fitting 9 water quality data sets to LN2 and LN3C probability models*
215
Location PolIutant
No. of
values
No. of
intervals
LN2
error
LN3C
error
K
LouisviUe, KY
WheeI~ng, W.V
Ohio River It
Cincinnati, OH
Addison, OH
Cincinnati, OH
Louisville, KY
Ohio River ID
Ohio RiverIIl
BOD
BOD
BOD
Coliform
Chloride
Chloride
Chloride
CNoride
Sulfate
144
133
1150
188
451
249
306
1006
948
12
10
10
10
10
10
10
10
IO
+ 0.238 mgjl
 22.85 mg/l
 0.0 mg/I
2.8/1OOmt 0.0143 0.6977
t 8.12 mg/l 0.0225
 5.36 mg/l 0.0464
 10.22 mg/l 0.0633
 1.66 mg/l 0.0039
 1.72 n&l 0.0401
0.0133
0.02%
0.0118
0.0046
0.0078
0.0118
0.0051
0.0441
0.0503
0.0034
0.0101
*Source of Data: EPA Region 5 laboratory, Chicago, IL.
tOhio River I includes Cairo, Louisville, Wheeling, Cincinnati, Addison; Ohio River II includes
Cincinnati, Louisville, and Addison.
2
8 200 
5%
c 5
is
100 
80
60 
6,
8 c 40
5
5
E 20
IO 
8
6
lN3C Probability model
i
I
i
/
4
/’
9
i
iJ

200 ii
g
>r
t
 100 E
80 yj
60
i
2
jj
ij
40
20 E
 IO
8
6
I
4
I ”
5
“‘#“I’
1 ’ 1”
90 95 989939.5
1
I
I2
IO 20 3040506070 80 99.9 99.g
Cumulative frequency. %
Fig. 5. Logprobability plot of total coliform on the Ohio River in Cincinnati, showing both the raw data and
the slight curvature of the LN3C property model with k = 2.8/1(K) ml.
5. DISCUSSION
For positive K, observed for some water quality variables, concentrations between 0 and K
have zero probability. Possibly, this phenomenon can be explained by the existence of
background concen~ations of the pollutant in rivers or streams being monitored. Because this
background concentration is never totally purged, concentrations less than I( are not observed.
This situations might be expected to occur at downstream monitoring sites having continuous
effluent sources upstream. A finite quantity of pollutant is always present at the downstream site.
In the case of air pollutants, by contrast, a given monitoring station may receive air
trajectories from all points on the compass. Whenever an incoming wind trajectory originates in
areas where no pollutant sources exist (or where sinks such as rainfall and vegetation have an
opportunity to deplete the air mass of pollutants), values below the minimum detectable Emits of
the measurement technique are observed, and these are recorded as zero.
At the moment, these explanations are purely conjecture, and it is hoped that future findings
will provide more insight into the processes involved.
Page 8
216
W. R. 07~ and D. T. MACE
6. CONCLUSIONS
This paper has attempted to show how basic physical processes such as diffusion can be
probed to gain insight into which probability model may be the most suitable for analyzing
environmental quality data. The censored, 3parameter lognormal probability model (LN3C)
appears to be an excellent choice; a physical rationale can be provided for this model, and it
appears to fit environmental quality data very well, significantly reducing overall error. For the 19
air quality and water quality data sets examined in this paper, the LN3C model either was
equivalent to or performed significantly better than the LN2 model. The authors feel that this
improvement is attributable not just to the addition of another parameter, which improves the
goodnessoffit for any probability model, but to the suitability of the LN3C model for
environmental processes in general.
The excellent fit is impressive and implies that, with minimal error, much of the information
contained in large and complex environmental data sets can be compactly represented by this
simple, 3parameter model. Considerable economy, for example, is achieved when a data set
containing 43,000 atmospheric sulfur dioxide concentrations can be reduced to a simple equation
with three parameters.
Use of this model should greatly facilitate comparison and evaluation of environmental data
from different cities. It should also greatly assist those involved in analyzing and interpreting
environmental data for the purpose of establishing environmental policies and evaluating
environmental decisions.
REFERENCES
1. Public Law 88206, as amended by Public Law 91604, 84 Stat. 1676 (42 USC. 1857 et seq.).
2. C. E. Zimmer and R. I. Larsen, Calculating air quality and its control, J. AirPoW. Control Ass. 15,565572 (1965).
3. R. I. Larsen, A new mathematical model of air pollutant concentration averaging time and frequency, J. Air Pollut.
Control Ass. 28, 2430 (1%9).
4.
5.
T. C. Curran and N. Frank, Assessing the validity of the lognormal model when predicting maximum air pollution
concentrations, Presented at the 68th annual meeting of the Air Pollution Control Association, 1520 June, Paper No.
7551.3 (1975).
D. A. Lynn, Fitting curves to urban suspended particulate data, Proceedings of the Symposium on Statistical Aspects of
Air Quality Data, pp. 1311328. Environmental Protection Agency, Research Triangle Park, NC, EPA650/474038
(1974).
D. T. Mage and W. R. Ott, An improved statistical model for analyzing air pollution concentration data, Presented at the
68th annual meeting of the Air Pollution Control Association, 1520 June, Paper No. 7551.4 (1975).
J. Aitchison and J. A. C. Brown, The Lognonnal Distribution, Cambridge University Press, London (1957).
R. I. Larsen, A mathematical model for relating air quality measurements to air quality standards, Publication No. AP89,
U.S. Environmental Protection Agency, Research Triangle Park, NC (1971).
Personal Communication with Ralph I. Larsen, Environmental Protection Agency, (1975).
W. Lund, Luftforurening i Storkflbenhaon 1%7W72, Storknbenhavns, Luftforureningsudvalg, Copenhagen, Denmark
(1973).
H. W. Georgii, VDIBerichte Nr. 180, pp. 32 (1972).
M. M. Benarie, The use of the relationship between wind velocity and ambient pollutant concentration distributions for
the estimation of average concentrations from gross meteorological data, Proceedings of the Symposium on Statistical
Aspects of Air Quality Data, pp. 5l517. Environmental Protection Agency, Research Triangle Park, NC, EPA650/4
74038 (1974).
6.
7.
8.
9.
10.
11.
12.
13.
Personal communication with Nozer Singpurwalla, George Washington University, Washington, D.C. (1974).