Page 1

TECHNICAL REPORTS

SHORT COMMUNICATIONS

2392

Conditional probability is the probability of observing

one event given that another event has occurred. In an

environmental context, conditional probability helps to assess

the association between an environmental contaminant (i.e.,

the stressor) and the ecological condition of a resource (i.e.,

the response). Th ese analyses, when combined with controlled

experiments and other methodologies, show great promise in

evaluating ecological conditions from observational data and

in defi ning water quality and other environmental criteria.

Current applications of conditional probability analysis (CPA)

are largely done via scripts or cumbersome spreadsheet routines,

which may prove daunting to end-users and do not provide

access to the underlying scripts. Combining spreadsheets with

scripts eases computation through a familiar interface (i.e.,

Microsoft Excel) and creates a transparent process through full

accessibility to the scripts. With this in mind, we developed a

software application, CProb, as an Add-in for Microsoft Excel

with R, R(D)com Server, and Visual Basic for Applications.

CProb calculates and plots scatterplots, empirical cumulative

distribution functions, and conditional probability. In this

short communication, we describe CPA, our motivation for

developing a CPA tool, and our implementation of CPA as a

Microsoft Excel Add-in. Further, we illustrate the use of our

software with two examples: a water quality example and a

landscape example. CProb is freely available for download at

http://www.epa.gov/emap/nca/html/regions/cprob.

CProb: A Computational Tool for Conducting Conditional Probability Analysis

Jeff rey W. Hollister,* Henry A. Walker, and John F. Paul USEPA

A

pollutants (i.e., stressors) and ecological condition (i.e.,

response). Understanding this relationship helps to identify areas

of likely ecological impairment, explore potential causes and

consequences of those impairments, and set acceptable levels

of a given stressor so that the health of an ecosystem may be

protected. It is possible to use both fi eld and laboratory methods

to explore stressor-response relationships and set criteria that

are protective of ecological condition. Inference based on fi eld

survey data alone can be problematic because of confounding,

uncontrolled variation in fi eld experiments. Conversely,

controlled laboratory experiments alone are unable to replicate

the full range of factors that aff ect ecosystems. Causal assessment

is an approach that combines fi eld-survey information from

independent datasets with lab and fi eld experimental results

to develop multiple lines of evidence that elucidate stressor-

response associations (Angradi, 1999; USEPA, 2000). An

example of the causal assessment strategy is found in a recent

document on developing water quality criteria for suspended

and bedded sediments (USEPA, 2006). A key component of

this approach is CPA, which was identifi ed as having signifi cant

promise and has been used elsewhere to develop possible water

quality criteria and explore associations between ecological

conditions and human health (Paul and McDonald, 2005; Paul

et al., 2008).

In our experience, causal assessment and CPA are well received

by many state environmental managers; however, a common stum-

bling block for implementation is access to user-friendly software.

As such, we developed a computational tool that uses Microsoft

Excel (Microsoft Corp., 2003) as an interface and the R Language

(R Development Core Team, 2006) for calculating and plotting

conditional probabilities. In this short communication we describe

CPA and our motivation for developing a CPA software tool. We

further demonstrate the use of the tool with two examples: (i) a wa-

ter quality criterion example from Paul and McDonald (2005) and,

(ii) building on prior work, a landscape ecological example (Come-

leo et al., 1996; Paul et al., 2002; Hollister et al., 2008a, 2008b).

goal of many applied environmental science eff orts is

to understand the association between environmental

Abbreviations: CPA, Conditional Probability Analysis; NLCD, National Land Cover

Dataset; EMAP, Environmental Monitoring and Assessment Program.

J.W. Hollister, U.S. Environmental Protection Agency, Offi ce of Research and

Development, National Health and Environmental Eff ects Research Lab., Atlantic

Ecology Div., 27 Tarzwell Drive, Narragansett, RI 02882. H.A. Walker, U.S. Environmental

Protection Agency, Offi ce of Research and Development, National Health and

Environmental Eff ects Research Lab., Atlantic Ecology Div., 27 Tarzwell Drive,

Narragansett, RI 02882. J.F. Paul, U.S. Environmental Protection Agency, Offi ce of

Research and Development, Mail Code B343-06, Research Triangle Park, NC 27711.

Copyright © 2008 by the American Society of Agronomy, Crop Science

Society of America, and Soil Science Society of America. All rights

reserved. No part of this periodical may be reproduced or transmitted

in any form or by any means, electronic or mechanical, including pho-

tocopying, recording, or any information storage and retrieval system,

without permission in writing from the publisher.

Published in J. Environ. Qual. 37:2392–2396 (2008).

doi:10.2134/jeq2007.0536

Received 9 Oct. 2007.

*Corresponding author (Hollister.jeff @epa.gov).

© ASA, CSSA, SSSA

677 S. Segoe Rd., Madison, WI 53711 USA

Page 2

Hollister et al.: CProb: A Computational Tool for Conducting Conditional Probability Analysis 2393

Conditional Probability Analysis

A conditional probability is the probability of an event Y

occurring given that some other event X also has occurred. It

is denoted P(Y | X). Th us, a conditional probability describes

the probability of observing an event of interest in a subset of

samples drawn from the original statistical population. Th ese

subsets are defi ned by conditions when X has occurred, in

addition to those used to defi ne the entire statistical popula-

tion. Conditional probability is calculated as the ratio of the

joint probability that Y and X occur simultaneously in a given

sample from the original statistical population, P(Y, X), to the

probability of X in the original population. Th e notation for

this is written:

),(

)|(

XP

)(

XYP

XYP

=

Th e possible applications of conditional probability are

very broad. We focus on the more limited context of

environmental condition, in which conditional probability

may be interpreted as the probability of environmental or

ecological impairment given a pollutant, nutrient, or other

stressor is larger or smaller than a certain amount (e.g., the

probability of benthic community impact given copper

concentration in sediments exceed 10 ppm). Refer to Paul

and McDonald (2005) for additional information.

Motivation for Developing a Conditional Probability

Analysis Tool

Conditional probability analysis usually involves informal

spreadsheet routines or scripts. Th e spreadsheet routines require

a high degree of user interaction, which increases the potential

for transcription errors and incorrect formulas and may lead to

an overall decline in the reproducibility of the analysis. Scripts

improve reproducibility and reliability but are much less ac-

cessible as they require statistical programming languages such

as S-Plus (Insightful Corp., 2005) or R. While these software

applications are quite robust and reliable, they may be quite

expensive (e.g., S-Plus) or require a steep learning curve.

Th e anticipated end-users of CPA are local or state en-

vironmental managers who have the technical expertise to

understand CPA but are unlikely to have the time to learn a

new application. Additionally, these users often prefer spread-

sheets, such as Microsoft Excel, for statistical analysis and data

management. Although the limitations of using spreadsheets

for these purposes are well documented (Su et al., 2003;

Knüsel, 2005; McCullough and Wilson, 2005), spreadsheets

provide both a familiar interface for managers and a useful

front-end for other computational resources and new soft-

ware tools (Su et al., 2003; Baier and Neuwirth, 2007). With

this in mind, we used the statistical programming language

R to build a script to calculate CPA, and we combined the

script with the familiar front-end of Microsoft Excel 2003.

We accomplished this with a CPA Excel Add-in, CProb, that

uses the R(D)COM server and RExcel (Baier and Neuwirth,

2006a, 2006b, 2007).

Materials and Methods

Software Requirements and Installation Information

Freely available for download from

http://www.epa.gov/emap/nca/html/regions/cprob, CProb

version 1.0 is a Microsoft Excel Add-in developed with Mi-

crosoft Offi ce Excel 2003, Visual Basic for Applications, R

version 2.4.0, R(D)Com Server version 2.01, and RExcel

version 1.50. Th e Add-in uses Microsoft Excel as a front-end

interface and calculates CPA using R as the statistical proces-

sor with R(D)Com providing the connection. Details about

acquiring, using, and installing these programs are available

from Microsoft, Th e R Project for Statistical Computing, and

the R(D)Com page at the University of Vienna, respectively

(Table 1). Detailed installation instructions for CProb are

in a README fi le (CProbREADME.txt) included with the

download. Since development, new versions of R, R(D)Com

Server, and RExcel have become available, and CProb is com-

patible with these.

CProb Data Requirements

Data with paired stressor and response variables are re-

quired for CPA. Th e stressor is either a discrete or continuous

variable, such as dissolved oxygen concentration, density of

suspended sediments, or number of landscape patches. Th e

response is either a dichotomous variable or a continuous or

discrete variable transformed to a dichotomous variable (e.g.,

with a threshold) that is thought to be responsive to the stres-

sor. For instance, Paul and McDonald (2005) compared per-

cent fi nes in bedded sediments to probability of taxa richness

of sensitive species less than a threshold. CProb also accepts

inclusion probabilities (i.e., from data acquired with a prob-

ability sampling design) as an optional parameter. Th ese prob-

abilities are used to calculate unbiased estimators from the

data for extrapolation to the statistical population from which

the sample was drawn.

CProb does not accept non-numeric values and will return

an error message. Non-numeric values must be corrected, re-

moved, or converted to a missing value before running CProb.

Allowable missing values are permissible with the missing value

settings in RExcel (either blank cells or “#N/A”). CProb will

pass these to R and, since paired data are required, the records

with missing values will be omitted from all calculations.

Table 1. Software sources and links.

Software

Excel 2003

R

R(D)Com Server and RExcel

Source Link

Microsoft

R Project for Statistical Computing

Sunsite

http://offi ce.microsoft.com/excel

http://www.r-project.org

http://rcom.univie.ac.at

Page 3

2394 Journal of Environmental Quality • Volume 37 • November–December 2008

Bootstrapped Confi dence Intervals

CProb uses standard bootstrap resampling to estimate the

confi dence intervals for the conditional probabilities (Manly,

2007). Th e raw data are resampled with replacement and a

conditional probability is calculated for each bootstrap sam-

ple. Th e upper and lower confi dence intervals are extracted as

percentiles from the distribution of all bootstrapped condi-

tional probabilities. Th e interval is determined by the “Con-

fi dence Interval Alpha” and the default value is 0.05. Th e de-

fault value for the number of bootstrap iterations is 100. Th is

particular default value is provided to allow for an exploratory

analysis that balances the interpretability of the analysis while

limiting the demand for computational resources. One hun-

dred iterations often results in unstable confi dence intervals,

thus the default value should only be used on initial, explor-

atory calculations. As a rule of thumb, users of CProb should

consider at least 1000 iterations with an α of 0.05 and 5000

iterations with an α of 0.01 (Manly, 2007).

Water Quality Criterion Example

To verify the output and test the functionality of CProb,

we repeated the analysis conducted by Paul and McDonald

(2005) for stream impairment and bedded sediments. In this

example, stream condition is defi ned by the total taxa richness

of benthic macroinvertebrates in the orders Ephemeroptera

(mayfl ies), Plecoptera (stonefl ies), and Trichoptera (caddisfl ies;

referred to as EPT Taxa Richness). Bedded sediments are char-

acterized by percentage fi nes in the substrate. Data for this

example are from the USEPA’s Environmental Monitoring

and Assessment Program’s Mid-Atlantic Highlands Streams

Assessment. See Paul and McDonald (2005) for details and

additional references. Th e dataset for our example (included

with the software download as JEQData_wq.csv) includes re-

cords for 99 stations and contains four columns: Station Iden-

tifi er (SITE.ID), Inclusion Probabilities (WGT.FS), EPT Taxa

Richness (EPT.RICH), and Percentage of Embedded Fines

(PCT.FN). We chose arguments to match the methodology in

Paul and McDonald (2005) as closely as possible.

Landscape Example

To test CProb independently, we conducted a CPA be-

tween landscape metrics and the condition of estuarine

sediments. It is accepted that developed lands impact the

ecological condition of sediments in Mid-Atlantic estuaries

of the United States (Comeleo et al., 1996; Paul et al., 2002;

Hollister et al., 2008a, 2008b). We used estuarine data (in-

cluded with software download as JEQData_le.csv) from the

USEPA’s Environmental Monitoring and Assessment Program

Estuaries program (EMAP) and total developed land cover

data from the 1992 National Land Cover Dataset (NLCD;

see http://www.epa.gov/emap/ and http://www.epa.gov/mrlc/

nlcd.html for more information). We used 112 stations. Hol-

lister et al. (2008a) identifi ed spatial extents with the strongest

linear association with diff erent sediment metals. Following

suggestions in Hollister et al. (2008a), we defi ned contribut-

ing watersheds as all land area ≤15 km from a station that

also drains to that estuarine system. For this example, we

defi ned impaired ecological condition by any site that had

at least one metal exceeding the eff ects range median (ERM)

value, a benchmark for potential eff ects on benthic biota

(Long et al., 1995). A site with at least one metal exceeding

the ERM value has been previously used as a measure of im-

pairment (Paul et al., 1999). We used the metals examined in

Hollister et al. (2008a). We expected that the probability of

impairment should increase as the total amount of developed

lands in the contributing watershed increases and, conversely,

that the probability of being impaired would decline as total

amount of developed lands decreases.

Results and Discussion

Water Quality Criterion Example

To ensure that CProb works as expected, we compared

the CProb results to those obtained by Paul and McDonald

(2005). For the Mid-Atlantic Highlands Data, the resultant

CProb scatterplots (Fig. 1a) and cumulative distribution func-

tion plots (Fig. 1b) match Fig. 3 and 4 of Paul and McDonald

(2005). Th e conditional probability plot from CProb (Fig. 1c)

very closely matches Paul and McDonald’s (2005) Fig. 5. As

expected, the only diff erences are a product of the confi dence

intervals resulting from the diff erent bootstrapped samples.

Landscape Example

Of the 112 stations in the landscape example, 100 have no

metals exceeding an ERM value (Fig. 2a). On average, these sites

have 6169 ha of developed land with a standard deviation of

8041 ha. Th e average size of the contributing watersheds (i.e.,

drainage within 15 km of the station) is 62,749 ha with a stan-

dard deviation of 7513 ha. Th e CPA results for the landscape

example corroborate past studies and suggest that higher amounts

of developed land in contributing watersheds lead to poor condi-

tion in estuaries. Although separating developed land into fi ner

classifi cations (e.g., industrial and residential) might yield stron-

ger relationships, the accuracy of those fi ner classes is less than the

documented accuracy of the coarser developed land classifi cation

and would lead to greater uncertainty (Yang et al., 2001; USGS,

2003; Hollister et al., 2004). Th e cumulative distribution func-

tion indicates that the amount of developed land can be used to

discriminate between unimpaired sites (no metals greater than

an ERM) and impaired sites (at least one metal greater than an

ERM) as indicated by the separation between the two curves

(Fig. 2b). In fact 100% of all observed impaired sites have at least

14000 ha of developed land in the contributing watershed (i.e.,

drainage within 15 km of the sampling station). Th e unimpaired

sites have a range of developed land between 51 and 43,000 ha,

but no sites with less than 14,000 ha of developed land have any

metals exceeding an ERM value. Only 15% of the unimpaired

sites have total developed land amounts that overlapped those of

the impaired sites (Fig. 2b).

Th e conditional probability of impairment increases as total

developed area in a watershed increases (Fig. 2c). Non-over-

Page 4

Hollister et al.: CProb: A Computational Tool for Conducting Conditional Probability Analysis 2395

Fig. 1. (a) Scatterplot showing relationship between taxa richness

of Ephemeroptera (mayfl ies), Plecoptera (stonefl ies), and

Trichoptera (caddisfl ies; referred to as EPT Taxa Richness)

and Bedded Sediments. (b) Cumulative distribution function

showing percentage of stations with a given amount, or more,

of percent fi nes. (c) Conditional probability plot showing

probability of impairment in streams with a given percentage

of fi ne material in the sediment. Solid lines indicate 95%

confi dence intervals.

Fig. 2. (a) Scatterplot showing relationship between number of

metals exceeding the Eff ects Range Median (ERM) and total

developed land (ha) within 15 km of EMAP stations. (b)

Cumulative distribution function showing percentage of

stations with a given amount, or more, of total developed land

(ha) within 15 km of Environmental Monitoring and Assessment

Program (EMAP) stations. (c) Conditional probability plot

showing probability of at least one metal exceeding the ERM

given the total developed land (ha) is greater than or equal to Xc.

Page 5

2396 Journal of Environmental Quality • Volume 37 • November–December 2008

lapping confi dence intervals (i.e., where the conditional confi -

dence interval does not overlap the unconditional confi dence

interval) may be used as a conservative test for diff erences. Th e

unconditional probability and confi dence interval is repre-

sented by the fi rst point in the conditional probability plot. In

this case, non-overlapping confi dence intervals suggest that the

conditional probability of impairment is signifi cantly higher

between approximately 7500 and 37,500 ha than the uncon-

ditional probability of impairment of approximately 10% [i.e.,

P(Y)]. Th is suggests that if development was limited to approxi-

mately 7500 ha within 15 km of a sampling station, the proba-

bility of metal contamination would not be signifi cantly greater

than what would be normally expected for the area. Th ere are

few stations with total developed land greater than 37,500 ha,

resulting in signifi cant uncertainty about this part of the curve

prohibiting us from drawing any conclusions.

Conclusions

Our goals for this short communication were to describe

a new tool for conducting CPA and illustrate its use with two

examples — one to validate the results and one to show the

versatility of the tool with an independent dataset. Our results

validate the calculations and analyses from CProb, as we re-

produced the results of Paul and McDonald (2005). Further-

more, the power of CPA is that it provides a measure of the

association between any two variables, environmental or oth-

erwise, that explicitly incorporate uncertainty. Lastly, CProb

is proving to be user-friendly and is gaining acceptance with

a number of state environmental managers in the Northeast.

Th is tool, along with the analytical techniques described here

and in Paul and McDonald (2005), are helping set scientifi -

cally defensible water quality criteria and more clearly describ-

ing stressor-response relationships.

Acknowledgments

We would like to thank Sandra Benyi, Anita Morzillo,

and Patricia Shaw-Allen for their thoughtful reviews. Th ey

greatly enhanced the quality of the manuscript and of CProb.

Also, we would like to thank the environmental managers

who have used and given feedback on CProb. Th e research

described in this paper has been funded by the USEPA. Th is

paper has not been subjected to USEPA review. Th erefore, it

does not necessary refl ect the views of the USEPA. Mention

of trade names or commercial products does not constitute

endorsement or recommendation for use. Th is is contribution

number AED-07-095 of the Atlantic Ecology Division,

Offi ce of Research and Development, National Health and

Environmental Eff ects Research Laboratory.

References

Angradi, T.R. 1999. Fine sediment and macroinvertebrate assemblages

in Applachian streams: A fi eld experiment with biomonitoring

applications. J. North Am. Benthol. Soc. 18:49–66.

Baier, T., and E. Neuwirth. 2006a. R(D)com server v 2.01. Available at

http://cran.r-project.org/contrib/extra/dcom/ (verifi ed 8 July 2008).

Baier, T., and E. Neuwirth. 2006b. RExcel v 1.50. Available at http://cran.r-

project.org/contrib/extra/dcom/ (verifi ed 8 July 2008).

Baier, T., and E. Neuwirth. 2007. Excel::Com::R. Comput. Stat. 22:91–108.

Comeleo, R.L., J.F. Paul, P.V. August, J.L. Copeland, C. Baker, S.S. Hale,

and R.W. Latimer. 1996. Relationships between watershed stressors and

sediment contamination in Chesapeake Bay estuaries. Landscape Ecol.

11:307–319.

Hollister, J.W., P.V. August, and J.F. Paul. 2008a. Eff ects of spatial extent on

landscape structure and sediment metal concentration relationships

in small estuarine systems of the United States’ mid-Atlantic coast.

Landscape Ecol. 23(S1):91–106.

Hollister, J.W., P.V. August, J.F. Paul, and H.A. Walker. 2008b. Predicting

estuarine sediment metal concentrations and inferred ecological

conditions: An information theoretic approach. J. Environ. Qual.

37:234–244.

Hollister, J.W., M.L. Gonzalez, J.F. Paul, P.V. August, and J.L. Copeland.

2004. Assessing the accuracy of the National Land Cover Dataset area

estimates at multiple spatial extents. Photogramm. Eng. Remote Sens.

70:405–414.

Insightful Corp. 2005. S-PLUS 7.0 for Windows. Insightful Corp., Seattle, WA.

Knüsel, L. 2005. On the accuracy of statistical distributions in Microsoft

Excel 2003. Comput. Stat. Data Anal. 48:445–449.

Long, E.R., D.D. MacDonald, S.L. Smith, and F.D. Calder. 1995. Incidence

of adverse biological eff ect within ranges of chemical concentration in

marine and estuarine sediment. Environmental Management 19:81–97.

Manly, B.F.J. 2007. Randomization, bootstrap, and Monte Carlo methods in

biology. 3rd ed. Chapman Hall/CRC, Boca Raton.

McCullough, B.D., and B. Wilson. 2005. On the accuracy of statistical procedures

in Microsoft Excel 2003. Comput. Stat. Data Anal. 49:1244–1252.

Microsoft Corp. 2003. Microsoft Offi ce Excel 2003 SP3. Microsoft Corp.,

Redmond, WA.

Paul, J.F., R.L. Comeleo, and J.L. Copeland. 2002. Landscape metrics and

estuarine sediment contamination in the mid-Atlantic and southern

New England regions. J. Environ. Qual. 31:836–845.

Paul, J.F., J.H. Gentile, K.J. Scott, D.E. Campbell, and R.W. Latimer. 1999.

EMAP-Virginian province four-year assessment report (1990–1993),

119 p. U.S. Environmental Protection Agency, Atlantic Ecology Div.,

Narragansett, RI.

Paul, J.F., and M.E. McDonald. 2005. Development of empricial,

geographically specifi c water quality criteria: A conditional probability

analysis approach. J. Am. Water Resour. Assoc. 41:1211–1223.

Paul, J.F., M.E. McDonald, and S.F. Hedtke. 2008. Stream condition and

infant mortality in U.S. mid-Atlantic states. Hum. Ecol. Risk Assess.

(in press).

R Development Core Team. 2006. R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria.

Su, Y., C.H. Langmuir, and P.D. Asimow. 2003. Petroplot: A plotting and

data management tool set for Microsoft Excel. Geochem. Geophys.

Geosyst. 4:1–14.

USEPA. 2000. Stressor identifi cation guidance document. Offi ce of Water,

Offi ce of Research and Development, Washington, DC.

USEPA. 2006. Framework for developing suspended and bedded sediments

(sabs) water quality criteria EPA-822-R-06-001. USEPA, Washington, DC.

USGS. 2003. Accuracy assessment of 1992 National Land Cover Data.

USGS, Washington, DC.

Yang, L., S.V. Stehman, J.H. Smith, and J.D. Wickham. 2001. Th ematic

accuracy of MRLC land cover for the eastern United States. Remote

Sens. Environ. 76:418–422.