Conditional probability is the probability of observing
one event given that another event has occurred. In an
environmental context, conditional probability helps to assess
the association between an environmental contaminant (i.e.,
the stressor) and the ecological condition of a resource (i.e.,
the response). Th ese analyses, when combined with controlled
experiments and other methodologies, show great promise in
evaluating ecological conditions from observational data and
in defi ning water quality and other environmental criteria.
Current applications of conditional probability analysis (CPA)
are largely done via scripts or cumbersome spreadsheet routines,
which may prove daunting to end-users and do not provide
access to the underlying scripts. Combining spreadsheets with
scripts eases computation through a familiar interface (i.e.,
Microsoft Excel) and creates a transparent process through full
accessibility to the scripts. With this in mind, we developed a
software application, CProb, as an Add-in for Microsoft Excel
with R, R(D)com Server, and Visual Basic for Applications.
CProb calculates and plots scatterplots, empirical cumulative
distribution functions, and conditional probability. In this
short communication, we describe CPA, our motivation for
developing a CPA tool, and our implementation of CPA as a
Microsoft Excel Add-in. Further, we illustrate the use of our
software with two examples: a water quality example and a
landscape example. CProb is freely available for download at
CProb: A Computational Tool for Conducting Conditional Probability Analysis
Jeff rey W. Hollister,* Henry A. Walker, and John F. Paul USEPA
pollutants (i.e., stressors) and ecological condition (i.e.,
response). Understanding this relationship helps to identify areas
of likely ecological impairment, explore potential causes and
consequences of those impairments, and set acceptable levels
of a given stressor so that the health of an ecosystem may be
protected. It is possible to use both fi eld and laboratory methods
to explore stressor-response relationships and set criteria that
are protective of ecological condition. Inference based on fi eld
survey data alone can be problematic because of confounding,
uncontrolled variation in fi eld experiments. Conversely,
controlled laboratory experiments alone are unable to replicate
the full range of factors that aff ect ecosystems. Causal assessment
is an approach that combines fi eld-survey information from
independent datasets with lab and fi eld experimental results
to develop multiple lines of evidence that elucidate stressor-
response associations (Angradi, 1999; USEPA, 2000). An
example of the causal assessment strategy is found in a recent
document on developing water quality criteria for suspended
and bedded sediments (USEPA, 2006). A key component of
this approach is CPA, which was identifi ed as having signifi cant
promise and has been used elsewhere to develop possible water
quality criteria and explore associations between ecological
conditions and human health (Paul and McDonald, 2005; Paul
et al., 2008).
In our experience, causal assessment and CPA are well received
by many state environmental managers; however, a common stum-
bling block for implementation is access to user-friendly software.
As such, we developed a computational tool that uses Microsoft
Excel (Microsoft Corp., 2003) as an interface and the R Language
(R Development Core Team, 2006) for calculating and plotting
conditional probabilities. In this short communication we describe
CPA and our motivation for developing a CPA software tool. We
further demonstrate the use of the tool with two examples: (i) a wa-
ter quality criterion example from Paul and McDonald (2005) and,
(ii) building on prior work, a landscape ecological example (Come-
leo et al., 1996; Paul et al., 2002; Hollister et al., 2008a, 2008b).
goal of many applied environmental science eff orts is
to understand the association between environmental
Abbreviations: CPA, Conditional Probability Analysis; NLCD, National Land Cover
Dataset; EMAP, Environmental Monitoring and Assessment Program.
J.W. Hollister, U.S. Environmental Protection Agency, Offi ce of Research and
Development, National Health and Environmental Eff ects Research Lab., Atlantic
Ecology Div., 27 Tarzwell Drive, Narragansett, RI 02882. H.A. Walker, U.S. Environmental
Protection Agency, Offi ce of Research and Development, National Health and
Environmental Eff ects Research Lab., Atlantic Ecology Div., 27 Tarzwell Drive,
Narragansett, RI 02882. J.F. Paul, U.S. Environmental Protection Agency, Offi ce of
Research and Development, Mail Code B343-06, Research Triangle Park, NC 27711.
Copyright © 2008 by the American Society of Agronomy, Crop Science
Society of America, and Soil Science Society of America. All rights
reserved. No part of this periodical may be reproduced or transmitted
in any form or by any means, electronic or mechanical, including pho-
tocopying, recording, or any information storage and retrieval system,
without permission in writing from the publisher.
Published in J. Environ. Qual. 37:2392–2396 (2008).
Received 9 Oct. 2007.
*Corresponding author (Hollister.jeff @epa.gov).
© ASA, CSSA, SSSA
677 S. Segoe Rd., Madison, WI 53711 USA
Hollister et al.: CProb: A Computational Tool for Conducting Conditional Probability Analysis 2393
Conditional Probability Analysis
A conditional probability is the probability of an event Y
occurring given that some other event X also has occurred. It
is denoted P(Y | X). Th us, a conditional probability describes
the probability of observing an event of interest in a subset of
samples drawn from the original statistical population. Th ese
subsets are defi ned by conditions when X has occurred, in
addition to those used to defi ne the entire statistical popula-
tion. Conditional probability is calculated as the ratio of the
joint probability that Y and X occur simultaneously in a given
sample from the original statistical population, P(Y, X), to the
probability of X in the original population. Th e notation for
this is written:
Th e possible applications of conditional probability are
very broad. We focus on the more limited context of
environmental condition, in which conditional probability
may be interpreted as the probability of environmental or
ecological impairment given a pollutant, nutrient, or other
stressor is larger or smaller than a certain amount (e.g., the
probability of benthic community impact given copper
concentration in sediments exceed 10 ppm). Refer to Paul
and McDonald (2005) for additional information.
Motivation for Developing a Conditional Probability
Conditional probability analysis usually involves informal
spreadsheet routines or scripts. Th e spreadsheet routines require
a high degree of user interaction, which increases the potential
for transcription errors and incorrect formulas and may lead to
an overall decline in the reproducibility of the analysis. Scripts
improve reproducibility and reliability but are much less ac-
cessible as they require statistical programming languages such
as S-Plus (Insightful Corp., 2005) or R. While these software
applications are quite robust and reliable, they may be quite
expensive (e.g., S-Plus) or require a steep learning curve.
Th e anticipated end-users of CPA are local or state en-
vironmental managers who have the technical expertise to
understand CPA but are unlikely to have the time to learn a
new application. Additionally, these users often prefer spread-
sheets, such as Microsoft Excel, for statistical analysis and data
management. Although the limitations of using spreadsheets
for these purposes are well documented (Su et al., 2003;
Knüsel, 2005; McCullough and Wilson, 2005), spreadsheets
provide both a familiar interface for managers and a useful
front-end for other computational resources and new soft-
ware tools (Su et al., 2003; Baier and Neuwirth, 2007). With
this in mind, we used the statistical programming language
R to build a script to calculate CPA, and we combined the
script with the familiar front-end of Microsoft Excel 2003.
We accomplished this with a CPA Excel Add-in, CProb, that
uses the R(D)COM server and RExcel (Baier and Neuwirth,
2006a, 2006b, 2007).
Materials and Methods
Software Requirements and Installation Information
Freely available for download from
version 1.0 is a Microsoft Excel Add-in developed with Mi-
crosoft Offi ce Excel 2003, Visual Basic for Applications, R
version 2.4.0, R(D)Com Server version 2.01, and RExcel
version 1.50. Th e Add-in uses Microsoft Excel as a front-end
interface and calculates CPA using R as the statistical proces-
sor with R(D)Com providing the connection. Details about
acquiring, using, and installing these programs are available
from Microsoft, Th e R Project for Statistical Computing, and
the R(D)Com page at the University of Vienna, respectively
(Table 1). Detailed installation instructions for CProb are
in a README fi le (CProbREADME.txt) included with the
download. Since development, new versions of R, R(D)Com
Server, and RExcel have become available, and CProb is com-
patible with these.
CProb Data Requirements
Data with paired stressor and response variables are re-
quired for CPA. Th e stressor is either a discrete or continuous
variable, such as dissolved oxygen concentration, density of
suspended sediments, or number of landscape patches. Th e
response is either a dichotomous variable or a continuous or
discrete variable transformed to a dichotomous variable (e.g.,
with a threshold) that is thought to be responsive to the stres-
sor. For instance, Paul and McDonald (2005) compared per-
cent fi nes in bedded sediments to probability of taxa richness
of sensitive species less than a threshold. CProb also accepts
inclusion probabilities (i.e., from data acquired with a prob-
ability sampling design) as an optional parameter. Th ese prob-
abilities are used to calculate unbiased estimators from the
data for extrapolation to the statistical population from which
the sample was drawn.
CProb does not accept non-numeric values and will return
an error message. Non-numeric values must be corrected, re-
moved, or converted to a missing value before running CProb.
Allowable missing values are permissible with the missing value
settings in RExcel (either blank cells or “#N/A”). CProb will
pass these to R and, since paired data are required, the records
with missing values will be omitted from all calculations.
Table 1. Software sources and links.
R(D)Com Server and RExcel
R Project for Statistical Computing
2394 Journal of Environmental Quality • Volume 37 • November–December 2008
Bootstrapped Confi dence Intervals
CProb uses standard bootstrap resampling to estimate the
confi dence intervals for the conditional probabilities (Manly,
2007). Th e raw data are resampled with replacement and a
conditional probability is calculated for each bootstrap sam-
ple. Th e upper and lower confi dence intervals are extracted as
percentiles from the distribution of all bootstrapped condi-
tional probabilities. Th e interval is determined by the “Con-
fi dence Interval Alpha” and the default value is 0.05. Th e de-
fault value for the number of bootstrap iterations is 100. Th is
particular default value is provided to allow for an exploratory
analysis that balances the interpretability of the analysis while
limiting the demand for computational resources. One hun-
dred iterations often results in unstable confi dence intervals,
thus the default value should only be used on initial, explor-
atory calculations. As a rule of thumb, users of CProb should
consider at least 1000 iterations with an α of 0.05 and 5000
iterations with an α of 0.01 (Manly, 2007).
Water Quality Criterion Example
To verify the output and test the functionality of CProb,
we repeated the analysis conducted by Paul and McDonald
(2005) for stream impairment and bedded sediments. In this
example, stream condition is defi ned by the total taxa richness
of benthic macroinvertebrates in the orders Ephemeroptera
(mayfl ies), Plecoptera (stonefl ies), and Trichoptera (caddisfl ies;
referred to as EPT Taxa Richness). Bedded sediments are char-
acterized by percentage fi nes in the substrate. Data for this
example are from the USEPA’s Environmental Monitoring
and Assessment Program’s Mid-Atlantic Highlands Streams
Assessment. See Paul and McDonald (2005) for details and
additional references. Th e dataset for our example (included
with the software download as JEQData_wq.csv) includes re-
cords for 99 stations and contains four columns: Station Iden-
tifi er (SITE.ID), Inclusion Probabilities (WGT.FS), EPT Taxa
Richness (EPT.RICH), and Percentage of Embedded Fines
(PCT.FN). We chose arguments to match the methodology in
Paul and McDonald (2005) as closely as possible.
To test CProb independently, we conducted a CPA be-
tween landscape metrics and the condition of estuarine
sediments. It is accepted that developed lands impact the
ecological condition of sediments in Mid-Atlantic estuaries
of the United States (Comeleo et al., 1996; Paul et al., 2002;
Hollister et al., 2008a, 2008b). We used estuarine data (in-
cluded with software download as JEQData_le.csv) from the
USEPA’s Environmental Monitoring and Assessment Program
Estuaries program (EMAP) and total developed land cover
data from the 1992 National Land Cover Dataset (NLCD;
see http://www.epa.gov/emap/ and http://www.epa.gov/mrlc/
nlcd.html for more information). We used 112 stations. Hol-
lister et al. (2008a) identifi ed spatial extents with the strongest
linear association with diff erent sediment metals. Following
suggestions in Hollister et al. (2008a), we defi ned contribut-
ing watersheds as all land area ≤15 km from a station that
also drains to that estuarine system. For this example, we
defi ned impaired ecological condition by any site that had
at least one metal exceeding the eff ects range median (ERM)
value, a benchmark for potential eff ects on benthic biota
(Long et al., 1995). A site with at least one metal exceeding
the ERM value has been previously used as a measure of im-
pairment (Paul et al., 1999). We used the metals examined in
Hollister et al. (2008a). We expected that the probability of
impairment should increase as the total amount of developed
lands in the contributing watershed increases and, conversely,
that the probability of being impaired would decline as total
amount of developed lands decreases.
Results and Discussion
Water Quality Criterion Example
To ensure that CProb works as expected, we compared
the CProb results to those obtained by Paul and McDonald
(2005). For the Mid-Atlantic Highlands Data, the resultant
CProb scatterplots (Fig. 1a) and cumulative distribution func-
tion plots (Fig. 1b) match Fig. 3 and 4 of Paul and McDonald
(2005). Th e conditional probability plot from CProb (Fig. 1c)
very closely matches Paul and McDonald’s (2005) Fig. 5. As
expected, the only diff erences are a product of the confi dence
intervals resulting from the diff erent bootstrapped samples.
Of the 112 stations in the landscape example, 100 have no
metals exceeding an ERM value (Fig. 2a). On average, these sites
have 6169 ha of developed land with a standard deviation of
8041 ha. Th e average size of the contributing watersheds (i.e.,
drainage within 15 km of the station) is 62,749 ha with a stan-
dard deviation of 7513 ha. Th e CPA results for the landscape
example corroborate past studies and suggest that higher amounts
of developed land in contributing watersheds lead to poor condi-
tion in estuaries. Although separating developed land into fi ner
classifi cations (e.g., industrial and residential) might yield stron-
ger relationships, the accuracy of those fi ner classes is less than the
documented accuracy of the coarser developed land classifi cation
and would lead to greater uncertainty (Yang et al., 2001; USGS,
2003; Hollister et al., 2004). Th e cumulative distribution func-
tion indicates that the amount of developed land can be used to
discriminate between unimpaired sites (no metals greater than
an ERM) and impaired sites (at least one metal greater than an
ERM) as indicated by the separation between the two curves
(Fig. 2b). In fact 100% of all observed impaired sites have at least
14000 ha of developed land in the contributing watershed (i.e.,
drainage within 15 km of the sampling station). Th e unimpaired
sites have a range of developed land between 51 and 43,000 ha,
but no sites with less than 14,000 ha of developed land have any
metals exceeding an ERM value. Only 15% of the unimpaired
sites have total developed land amounts that overlapped those of
the impaired sites (Fig. 2b).
Th e conditional probability of impairment increases as total
developed area in a watershed increases (Fig. 2c). Non-over-
Hollister et al.: CProb: A Computational Tool for Conducting Conditional Probability Analysis 2395
Fig. 1. (a) Scatterplot showing relationship between taxa richness
of Ephemeroptera (mayfl ies), Plecoptera (stonefl ies), and
Trichoptera (caddisfl ies; referred to as EPT Taxa Richness)
and Bedded Sediments. (b) Cumulative distribution function
showing percentage of stations with a given amount, or more,
of percent fi nes. (c) Conditional probability plot showing
probability of impairment in streams with a given percentage
of fi ne material in the sediment. Solid lines indicate 95%
confi dence intervals.
Fig. 2. (a) Scatterplot showing relationship between number of
metals exceeding the Eff ects Range Median (ERM) and total
developed land (ha) within 15 km of EMAP stations. (b)
Cumulative distribution function showing percentage of
stations with a given amount, or more, of total developed land
(ha) within 15 km of Environmental Monitoring and Assessment
Program (EMAP) stations. (c) Conditional probability plot
showing probability of at least one metal exceeding the ERM
given the total developed land (ha) is greater than or equal to Xc.
2396 Journal of Environmental Quality • Volume 37 • November–December 2008 Download full-text
lapping confi dence intervals (i.e., where the conditional confi -
dence interval does not overlap the unconditional confi dence
interval) may be used as a conservative test for diff erences. Th e
unconditional probability and confi dence interval is repre-
sented by the fi rst point in the conditional probability plot. In
this case, non-overlapping confi dence intervals suggest that the
conditional probability of impairment is signifi cantly higher
between approximately 7500 and 37,500 ha than the uncon-
ditional probability of impairment of approximately 10% [i.e.,
P(Y)]. Th is suggests that if development was limited to approxi-
mately 7500 ha within 15 km of a sampling station, the proba-
bility of metal contamination would not be signifi cantly greater
than what would be normally expected for the area. Th ere are
few stations with total developed land greater than 37,500 ha,
resulting in signifi cant uncertainty about this part of the curve
prohibiting us from drawing any conclusions.
Our goals for this short communication were to describe
a new tool for conducting CPA and illustrate its use with two
examples — one to validate the results and one to show the
versatility of the tool with an independent dataset. Our results
validate the calculations and analyses from CProb, as we re-
produced the results of Paul and McDonald (2005). Further-
more, the power of CPA is that it provides a measure of the
association between any two variables, environmental or oth-
erwise, that explicitly incorporate uncertainty. Lastly, CProb
is proving to be user-friendly and is gaining acceptance with
a number of state environmental managers in the Northeast.
Th is tool, along with the analytical techniques described here
and in Paul and McDonald (2005), are helping set scientifi -
cally defensible water quality criteria and more clearly describ-
ing stressor-response relationships.
We would like to thank Sandra Benyi, Anita Morzillo,
and Patricia Shaw-Allen for their thoughtful reviews. Th ey
greatly enhanced the quality of the manuscript and of CProb.
Also, we would like to thank the environmental managers
who have used and given feedback on CProb. Th e research
described in this paper has been funded by the USEPA. Th is
paper has not been subjected to USEPA review. Th erefore, it
does not necessary refl ect the views of the USEPA. Mention
of trade names or commercial products does not constitute
endorsement or recommendation for use. Th is is contribution
number AED-07-095 of the Atlantic Ecology Division,
Offi ce of Research and Development, National Health and
Environmental Eff ects Research Laboratory.
Angradi, T.R. 1999. Fine sediment and macroinvertebrate assemblages
in Applachian streams: A fi eld experiment with biomonitoring
applications. J. North Am. Benthol. Soc. 18:49–66.
Baier, T., and E. Neuwirth. 2006a. R(D)com server v 2.01. Available at
http://cran.r-project.org/contrib/extra/dcom/ (verifi ed 8 July 2008).
Baier, T., and E. Neuwirth. 2006b. RExcel v 1.50. Available at http://cran.r-
project.org/contrib/extra/dcom/ (verifi ed 8 July 2008).
Baier, T., and E. Neuwirth. 2007. Excel::Com::R. Comput. Stat. 22:91–108.
Comeleo, R.L., J.F. Paul, P.V. August, J.L. Copeland, C. Baker, S.S. Hale,
and R.W. Latimer. 1996. Relationships between watershed stressors and
sediment contamination in Chesapeake Bay estuaries. Landscape Ecol.
Hollister, J.W., P.V. August, and J.F. Paul. 2008a. Eff ects of spatial extent on
landscape structure and sediment metal concentration relationships
in small estuarine systems of the United States’ mid-Atlantic coast.
Landscape Ecol. 23(S1):91–106.
Hollister, J.W., P.V. August, J.F. Paul, and H.A. Walker. 2008b. Predicting
estuarine sediment metal concentrations and inferred ecological
conditions: An information theoretic approach. J. Environ. Qual.
Hollister, J.W., M.L. Gonzalez, J.F. Paul, P.V. August, and J.L. Copeland.
2004. Assessing the accuracy of the National Land Cover Dataset area
estimates at multiple spatial extents. Photogramm. Eng. Remote Sens.
Insightful Corp. 2005. S-PLUS 7.0 for Windows. Insightful Corp., Seattle, WA.
Knüsel, L. 2005. On the accuracy of statistical distributions in Microsoft
Excel 2003. Comput. Stat. Data Anal. 48:445–449.
Long, E.R., D.D. MacDonald, S.L. Smith, and F.D. Calder. 1995. Incidence
of adverse biological eff ect within ranges of chemical concentration in
marine and estuarine sediment. Environmental Management 19:81–97.
Manly, B.F.J. 2007. Randomization, bootstrap, and Monte Carlo methods in
biology. 3rd ed. Chapman Hall/CRC, Boca Raton.
McCullough, B.D., and B. Wilson. 2005. On the accuracy of statistical procedures
in Microsoft Excel 2003. Comput. Stat. Data Anal. 49:1244–1252.
Microsoft Corp. 2003. Microsoft Offi ce Excel 2003 SP3. Microsoft Corp.,
Paul, J.F., R.L. Comeleo, and J.L. Copeland. 2002. Landscape metrics and
estuarine sediment contamination in the mid-Atlantic and southern
New England regions. J. Environ. Qual. 31:836–845.
Paul, J.F., J.H. Gentile, K.J. Scott, D.E. Campbell, and R.W. Latimer. 1999.
EMAP-Virginian province four-year assessment report (1990–1993),
119 p. U.S. Environmental Protection Agency, Atlantic Ecology Div.,
Paul, J.F., and M.E. McDonald. 2005. Development of empricial,
geographically specifi c water quality criteria: A conditional probability
analysis approach. J. Am. Water Resour. Assoc. 41:1211–1223.
Paul, J.F., M.E. McDonald, and S.F. Hedtke. 2008. Stream condition and
infant mortality in U.S. mid-Atlantic states. Hum. Ecol. Risk Assess.
R Development Core Team. 2006. R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
Su, Y., C.H. Langmuir, and P.D. Asimow. 2003. Petroplot: A plotting and
data management tool set for Microsoft Excel. Geochem. Geophys.
USEPA. 2000. Stressor identifi cation guidance document. Offi ce of Water,
Offi ce of Research and Development, Washington, DC.
USEPA. 2006. Framework for developing suspended and bedded sediments
(sabs) water quality criteria EPA-822-R-06-001. USEPA, Washington, DC.
USGS. 2003. Accuracy assessment of 1992 National Land Cover Data.
USGS, Washington, DC.
Yang, L., S.V. Stehman, J.H. Smith, and J.D. Wickham. 2001. Th ematic
accuracy of MRLC land cover for the eastern United States. Remote
Sens. Environ. 76:418–422.