Page 1

Utilizzo di modelli zero-inflazionati per il trattamento di dati ambientali

con eccesso di zeri1

Lorena C.M. Viviano, Vito M.R. Muggeo, Gianfranco Lovison

Dipartimento di Scienze Statistiche e Matematiche “S.Vianelli”

Universit` a degli Studi di Palermo - Viale delle Scienze - 90128 Palermo

email: (viviano,vmuggeo,lovison) @dssm.unipa.it

Riassunto: L’analisi di dati di conteggio pu` o essere talvolta complessa a causa di un

numero di zeri superiore a quello atteso sotto il modello Poissoniano, che rappresenta

l’assunzione standard per la modellazione di questo tipo di dati. Obbiettivo primario

della comunicazione ` e quello di impiegare modelli alternativi a quello di Poisson, che

contemplino la possibilit` a di trattare esplicitamente questo eccesso di zeri, per valutare

eventuali differenze in termini di bont` a di adattamento e di stima dei parametri regressivi.

Vengono discussi modelli Zero Inflated Posson (ZIP), Zero Inflated Negative Binomial

(ZINB) e Hurdle Poisson (HP) e applicati a due insiemi di dati ambientali reali con un

elevato numero di zeri.

Keywords: Count data; Poisson; Negative Binomial; Zero-Inflated; Hurdle.

1. Zero inflated and Hurdle models: an overview

The Poisson distribution is the probability model usually assumed for count data; how-

ever, in many real applications it is likely to observe a number of zeroes greater (zero

inflation) or smaller (zero deflation) than that expected under the Poisson model. Such

situations can be dealt with through models that accommodate the excess of zeroes, such

as Zero Inflated and Hurdle models. These models are encountered in the econometric,

demographic and medical literature, and are characterized by a parametric structure that

models the ‘zero’ and ‘non-zero’ responses separately. Zorn (1996) justifies this approach

in terms of a ‘dual regime’ data generating process: in the first stage, a presence/absence

model determines whether the count is zero or non-zero; in the second stage, a count

model governs the actual magnitude of the count. Hence the underlying mixture model is :

Pr(Yi= yi) = πif1(yi) + (1 − πi)f2(yi)

where for the ithunit, yiis the count, πiis the probability of a zero count in the pres-

ence/absence model, f1(yi) = I{0}(yi) and f2(yi) is the p.d.f. of a count random variable.

All models considered in this paper can be represented using (1), through appropriate

choices of πiand f2(yi). In particular, if πi= 0 and f2(yi) is the p.d.f. of the Poisson

distribution, we obtain the standard Poisson model as a (degenerate) sub-case. The dis-

tinction between Zero-Inflated models and Hurdle models refers to the form of f2(yi):

in fact, while in Zero-Inflated models a zero can be contributed by the presence/absence

model or by the count model, in Hurdle models a zero can only come from the pres-

ence/absence model and therefore a (zero)-truncated distribution caters for counts greater

i = 1,2,...,n

(1)

1Lavoro svolto con finanziamento PRIN Cofin MIUR 2004 prot.2004173478 003

Using Zero-inflated Models to Analyze Environmental Data

Sets with Many Zeroes

– 47 –

Page 2

than zero. In both cases, the usual choices for f2(yi) are the Poisson or Negative Binomial

distributions, either in their standard (for Zero-Inflated) or truncated (for Hurdle) form;

the Negative Binomial is usually preferred when the counts also exhibit over-dispersion.

Therefore possible alternatives to the standard Poisson are Zero Inflated Posson (ZIP),

Zero Inflated Negative Binomial (ZINB), Hurdle Poisson (HP) and Hurdle Negative Bi-

nomial (HNB) models. In practice, due to the computational difficulties met with the

HNB model, we focus on the first three models, whose p.d.f. are as follows:

?πi+ (1 − πi)exp(−µi)

HP:

Pr(Y = yi) =

(1−exp(−µi))yi!

where µiis the expected value of the model and φ−1is the (over-)dispersion parame-

ter. Usually a more realistic context considers vectors of covariates, xiand zisay, to be

related to µiand πithrough proper link functions in the spirit of Generalized Linear Mod-

els: log(µi) = x

serves the purpose of identifying the possibly different roles of the same explanatory vari-

able in each stage; ii)πiand µican be unrelated or function of each other. The inferential

method usually applied to obtain maximum likelihood estimators for β and γ is the EM

algorithm and results are based to asymptotic theory of those estimators. Score tests can

be useful to compare models. Lambert (1992) introduced ZIP models in a manufactural

context, while HP models were introduced by Mullay (1986) and then modified by King

(1989); see also Long (1997); Zorn (1996); Ridout et al. (1998); Tu (2002).

ZIP:

Pr(Y = yi) =

for

for

yi= 0

yi≥ 0

yi= 0

(1 − πi)exp(−µi)µyi

πi+ (1 − πi)

(1 − πi)Γ(yi+φ)

?πi

i

yi!

ZINB:

Pr(Y = yi) =

?

φ

µi+φ

?

?φ

φ

µi+φ

for

Γ(φ) yi!

?φ?

µi

µi+φ

?yi

for

yi≥ 0

yi= 0

yi> 0

for

for

(1−πi)exp(−µi)µyi

i

?

iβ and logit(πi) = z

?

iγ. Moreover note i)using the same set of covariates

2. Comparisons of models

In this section, two environmental real data sets are analyzed. Comparisons in terms of

parameter estimates and AIC in particular are carried out for four models: Poisson, ZIP,

ZINB and HP. The fitted models in both cases are completely additive and levels of factors

are coded as dummy variables through a corner point parameterization.

The first data set refers to a daily time series (1997-1999) data to study the effect

of air pollution on health in Palermo. The response is the number of deaths for breathing

complications which presents a high number of zeroes (≈ 41%). Some covariates that can

influence the response have been included in the final model: Influenza epidemics (binary

variable where 0 corresponds to absence of influenza), Month (twelve-levels categorical

variable), Temperature (24-hours average in◦C) and PM10concentration (moving average

of lag 0-3 in µg/m3). The interest lies in estimating the effect of PM10which is one of the

major causes of health problems in air pollution studies.

Table 1 reports estimates from the fitted models. Comparisons can be made at two

levels: the former refers to the first stage of ZIP, ZINB and HP models (columns ‘P(Yi=

0)’ in which zero vs non zero outcomes are modelled) and the latter refers to the standard

Poisson and the second stage of ZIP, HP and ZINB models. Comparing the same models

globally, itispossibletoseethattheAIC’sareclose, butthebestmodelisstillthestandard

– 48 –

Page 3

Table 1: Results for the Poisson, ZIP, ZINB, HP models (air pollution data set)

PoissonZIPZINBHP

P(Yi≥ 0)

.008

.002

0

P(Yi= 0)P(Yi≥ 0)

.007

.002

.005

P(Yi= 0)P(Yi≥ 0)

.007

.002

.005

P(Yi= 0)P(Yi> 0)

PM10

s.e.

p-value

-2.457

1.782

.168

-5.59

3.47

.095

-.019

.006

.002

.005

.004

.208

AIC (n.par.)2735.30 (16)2737.78 (32) 2736.34 (33) 2756.42 (32)

Poisson. Furthermore such conclusion is also confirmed by a modified score test (van den

Broek, 1995) comparing a Poisson versus a ZIP model. For such data set the excess of

zeroes is really plausible under the standard distribution for count data (p = 1.00). In

the first stage of HP, estimate of PM10has a negative sign suggesting that the probability

of a zero outcome is lower than that of a non zero outcome. Results are very similar for

PM10(with a positive effect on expected counts) when comparing the second stage of

ZIP, ZINB and the classical Poisson model. It is important to underline the presence of

very large standard errors for the coefficients in the inflation equation, especially for HP

model. p-value of PM10is not significant in the first stage for ZIP and significant for HP,

and the opposite happens in the second stage of the two models. It could be explained by

a different way of considering zero counts in the two models and/or by specific features of

such dataset, including a modest percentage of zeroes and a low expected value for non-

zero counts. However, the findings concerning the health effect of PM10are substantially

unchanged.

The second data set is referred to a study of bathing water quality in the district of

Palermo. Data are collected in 2001 and are characterized by n = 1386 observations.

Our goal is to analyze the effect of some covariates (Month, Water Temperature, Oxygen,

Sea Condition) on the response variable ‘Number of Fecal Streptococcuses’ (counts in

100 ml of water), that ranges from 0 to 200 and presents a great percentage of zeroes

(≈ 54%). Results from the fitted models are displayed in Table 2 where the mo- variables

are the dummies relevant to April-September period (data are collected only in the bathing

season); the sea- variables refer to categories of ‘Sea Condition’ (respectively ‘calm’,

‘almost wavy’, ‘wavy’) and temp and oxy stand for the continuous ‘Water Temperature’

and ‘Oxygen’. The van den Broek test suggests that it is advisable to consider a ZIP

model instead of the classical Poisson distribution (p < 0.0001); this is confirmed also

by the AIC value of the Poisson model which is dramatically larger then the AIC of

the Zero Inflated models. However among the inflated models, there exists a noticeable

improvement in accounting for extra-variability: the ZINB has to be preferred by far,

likely due to its capability to catch both excess of zeroes and overdispersion. As regards

to parameter estimates, it is worth noting that the sign of coefficients is substantially

unchanged among the different models (both logit and log-linear components); however

in ignoring the zero-inflation and/or overdispersion the significance is heavily overstated.

Forinstancethesignificanteffectofsomevariables(actuallymo3, mo5, sea2andtemp)

observed in the Poisson model disappears in the ZINB. From a biological standpoint it

is worthwhile to stress the role of the months corresponding to beginning and closing of

bathing season.

– 49 –

Page 4

Table 2: Results for the Poisson, ZIP, ZINB, HP models (bathing water quality data set)

Poisson ZIPZINBHP

P(Yi≥ 0)

1.15(.04)

.00

.16(.05)

.00

.08(.06)

.22

.69(.06)

.00

1.06(.05)

.00

-.18(.02)

.00

.03(.03)

.22

-.03(.01)

.00

-.06(.002 )

.00

P(Yi= 0)P(Yi≥ 0)

.76(.04)

.00

.12(.05)

.02

.08(.06)

.20

.21(.06)

.00

.78(.06)

.00

.001(.02)

.95

-.20(.03)

.00

-.03(.01)

.00

-.04(.18)

.00

P(Yi= 0)P(Yi≥ 0)

.81(.21)

.00

.04(.27)

.87

-.21(.39)

.60

.01(.39)

.987

.70(.35)

.05

.006(.13)

.96

-.31(.18)

.08

-.000(.04)

.99

-.04(.01)

.00

P(Yi= 0)P(Yi> 0)

mo2(s.e.)

p-value

mo3(s.e.)

p-value

mo4(s.e.)

p-value

mo5(s.e.)

p-value

mo6(s.e.)

p-value

sea2(s.e.)

p-value

sea3(s.e.)

p-value

temp(s.e.)

p-value

oxy(s.e.)

p-value

-.79(.21)

.00

-.35(.28)

.22

-.31(.37)

.41

-1.45(.38)

.00

-.79(.33)

.02

.27(.12)

.03

-.60(.19)

.00

.03(.04)

.36

.02(.01)

.02

-.79(.25)

.00

-.45(.35)

.20

-.49(.47)

.30

-1.92(.53)

.00

-.84(.42)

.05

.30(.15)

.05

-.88(.29)

.00

.05(.05)

.27

.02(.01)

.05

-.79(.21)

.00

-.35(.28)

.21

-.31(.37)

.41

-1.45(.38)

.00

-.79(.33)

.01

.27(.12)

.03

-.60(.19)

.00

.03(.04)

.36

.02(.01)

.02

.76(.04)

.00

.12(.05)

.02

.08(.06)

.20

.21(.06)

.00

.78(.06)

.00

.001(.02)

.95

-.20(.03)

.00

-.03(.01)

.00

-.04(.001)

.00

AIC (n.par.)37701 (10) 21467.14 (20)6792.54 (21)21459.24 (20)

3. Conclusions

Count data with zero mass need particular care and should be properly modelled. Unlike

the seeming excess of zeroes, given the covariates, sometimes the standard Poisson suf-

fices. Otherwise wrong conclusions can be reached and different models (ZIP, ZINB, HP)

should be considered. In the mortality data set, the classical Poisson model is still the best

choice, while in the second data set ZINB is preferable. Possible drawback in employing

these alternative models is the difficulty of using standard software as computational as-

pects are often non-negligible. Our analysis were conducted in R employing two libraries

(zeroinfl - hurdle) created by S. Jackman (http://pscl/standford.edu/content.html). Other

possible functions are yipp, zipbipp, zipoissonX (vgam), zicounts and fmr (gnlm) created

by J. Lindsey (http://www.luc.ac.be/ ∼ jlindsey/rcode).

References

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in

manufacturing, Technometrics 34, 1–14.

Long, J.(1997). Regressionmodelsforcategoricalandlimiteddependentvariables, Sage.

Mullay, J. (1986). Specifications and testing of some modified count data model, Journal

of Econometrics 33, 341–365.

Ridout, M., Demetrio, C. and Hinde, J. (1998). Models for count data with many zeros,

Proceedings of the XIXth International Biometric Conference pp. 179–192.

Tu, W. (2002). Encyclopedia of Environmetrics (Zero-inflated data), Vol. 4, Wiley.

van den Broek, J. (1995). A score test for zero inflation in a Poisson distribution, Biomet-

rics 51, 738–743.

Zorn, C. (1996). Evaluating zero-inflated and Hurdle Poisson specifications, Midwest

Political Science Association 18-20 april, 1–16.

– 50 –