## All Answers (18)

- May be you can use the concept of mixed type random variable which is a variable that can be decomposed in a discrete(in your case degenerate at 0) and a continuous random variable(in your case non zero).
- There are two-part models that define the E[y|x]=p(y>0|x)*E[y|x,y>0] where the 1st part can be logistic and the 2nd part is a glm based model, where you can choose any distribution in the exponential family and the link function is also user defined.
- The field to check out is Tobit analysis. More generally you can use censored regression. The topic arises very often in connection with pharmacokinetic and pharmacodynamic modelling. I have a paper with Nick Holford and Hans Hockey that is currently available pre-publication on the Statistics in Medicine website. See "The ghosts of departed quantities: approaches to dealing with observations below the limit of quantitation, "

http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291097-0258/earlyview - Yes, you need to look at the Tobit model, as Stephen said above. (I've been working on a Tobit model all this week.) The Tobit model uses the same explanatory variables for both the magnitude and the departure from zero. It is commonly available in different statistical packages, for instance, there are at least three different implementations in R.

There is a second model, called the Cragg model, that may also be of interest. It's a two-step model that provides a probit model to determine departure from zero and a Poisson model to estimate the magnitude. It has the downside of still being a count-model internally, and that may not fit your dependent variable, however.

Please feel free to contact me if you need assistance. - Not being a classically trained statistician, I lack the reservour of knowledge to name the particular models mentioned by others...I'm keeping this link for reference! (thanks folks). However, it seems to me if you have the programming/stats knowledge you can combine any distributions that suit your situation. Any exemplar is just a combination reflecting potential real-world processes. If you work in SAS, nlmixed provides a nice framework with some fitting with built-in loglikelihood functions as well as fitting with a general LL you compose.
- Two basic approaches are the two part model and the Tobit model. They make slightly different assumptions - the Tobit model assumes left censoring, and the two part model would be treating the zeroes as true zeroes. The easiest way to do the analysis using the two part model is to fit a logistic or probit model for the "Is it a zero or not?" question, and the log-normal (these data are almost always skewed to the right) for the intensity (non-zero) piece. The likelihood for the two part model is maximized by modeling the likelihood for each part of the model separately, and you can obtain the mean by multiplying the probability of being non-zero by the mean of the non-zero data. If you have repeated measures, I have a SAS macro you can use to do this; although it's a slightly different model.
- The Tobit analysis, as Stephen mentioned, is a straightforward way and you only need one model for the continuous part as well as the probability of zero. But in a dataset there might be more (or less) zeros than the Tobit model can accommodate. If you are looking for an analogue to, say, Poisson or NB model with inflated zeros, then the two part model is one with similar flexibility, at the cost of one additional model for the probability of zero.
- In addition to Tobit models, you might consider more general mixture modeling, of which ZIP and ZINB models are special cases. In the ZIP and ZINB models, we fit a mixture of a Poisson or NB population and a population with degenerate distribution and all mass at 0. In that framework, it's trivial to substitute the normal (or other continuous) distribution for the Poisson. I actually can't recall offhand if this is what Tobit does. The nice thing about a more general mixture is that instead of forcing the second component of the mixture to be degenerate-- because that's pretty artificial for continuous outcomes, I think-- you can allow for two normal components, which may fit better.
- I agree with those who have mentioned that you need to decide whether the zeros are real or not. If not, then a censored GLM is what you want. If so, then a zero-inflated model is required. An additional point is that if the zeros are real, then you have at least one bound on the support of your continuous DV. If the only bound is at zero then the appropriate GLMs should be based on the "life-time" distributions (e.g., log-normal, gamma, Weibull). If the DV is bounded above as well as below, then you'd want a GLM whose distribution's support is appropriately double-bounded. A popular choice is beta regression (i.e., based on the beta distribution). There is a literature on zero- and one-inflated beta distribution models, and they can be estimated in R and Stata.
- or you could consider the inverse hyperbolic sine transformation. See e.g. http://worthwhile.typepad.com/worthwhile_canadian_initi/2011/07/a-rant-on-inverse-hyperbolic-sine-transformations.html
- still you can use zero inflated models with distribution as continous.Many literature had used this with continous distribution to be gamma.Also you can think of tweedie distrinution where you can have a mass at zero and more than zero you can use continous models.These models are more commonly used in acturial statistics.
- This lecture should be very useful for you in deciding what is pertinent in your case: http://www.unc.edu/~enorton/DebManningNortonPresentation.pdf
- Another possibility is to explicitely model the phenomenon. I was toying with this about a year ago, if memory serves, to deal with RT-PCR data, in which a process theoretically doubles (a little less than this) the amount of some genetic material (deliberately vague) with cycles of heating and cooling or something like that. A reporting chemical is included whose intensity increases as the quantity of the genetic material increases...generating a curve (intensity as a function of cycle). The operator/software then choses a common threshold in intensity across many such curves (generated simultaneously). The cycle at which each curve intersects this threshold then becomes the "outcome". Theoretically, curves which intersect at earlier cycles must have started out with more of the genetic material...and by virtue of the "doubling" with each cycle, we're operating on a log2 scale (approximately...would be the subject of a different discussion). So, if one curve took 1 extra cycle to reach threshold than another, it must have had approximately 1/2 the starting genetic matierial. Okay...that's a long way to get back to your problem. The machine, by convention is stopped after 40 cycles. If a curve has not yet intersected the threshold, then we don't REALLY know how much material was in there at the start relative to starting...the quantity is "undetectable". If you omit these values, you likely bias the mean. If you impute zeros, you likely bias the mean...and if there are a lot of zeros you definitely artificially reduce the variance, inflating type I error. Ultimately, the methodology was never designed with zero quantities in mind...but that's how it's being used.

I've treating these as time-event data with some success (censored CT at 40), but sample size is always small. A more complicated and robust method I'd that may bear some relevance was directly modeling the "40" inflation by using a model which explicitely truncates a pdf (cycles-to-threshold tend to be Gaussian in my experience)...and then proceed as you would with any other linear model...figure attached...this was designed to illustrate the impact of imputing 40s and then simply treating as if normal...shows the degree of bias increasing as the central tendency approaches 40 (red line is arithmetic mean...grean is best fit allowing for truncation). A key calculation is the difference between a target gene and a housekeeping gene...expressed as 2^deltaCT..."fold" difference - the 2-part model is the analogon for continous outcomes (Med Care2009;47: S109 –S114, Med Care2009;47: S104 –S108). interesstingly, the negbin can also be useful for continous outcome (see references at http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/)
- Tobit models work wonderfully when you can assume that your zero value is a censored value and that the underlying distribution is Gaussian. If you assume that your zero's are truly zero's (mirroring the hurdle), then logit/linear or probit/linear two part models specified by Han and Kronmal (2006, Comm in Stat) or the probit/log-skew-normal (Chai and Bailey, 2008, Stat Med) may be more appropriate. These models can be implemented using MLE in SAS PROC NLMIXED.
- Hi Damian,

This has already been mentioned by @Vasudeva, but it's probably worth repeating that Tweedie models appear to be the most direct answer to your question. With parameter 1 < p < 2, they assume a point mass at zero and a gamma distribution on the positives. They have another parameter which specifies the dispersion, making them a pretty flexible and (at least to me) interesting model of the error distribution.

If you wish to use Tweedie models, you are probably best off using the R package 'tweedie'. Having used them myself, I did sometimes have convergence issues, and in some cases ended up using a composite logit / log-normal model, as mentioned by @Nathaniel.

One important practical difference between the two approaches, is that composite models (whether done manually, or via the R functions 'hurdle' or 'zeroinfl') estimate separate linear predictors for the two parts of the model. In many cases, this might make perfect sense. In other situations, it might be more sensible a single set of coefficients associated with a the common linear predictor employed by tweedie.

One last thing to be aware of is that handy methods like stepAIC (stepwise variable selection based on information criteria) are not (to my knowledge) yet available for tweedie models. So, if this kind of thing is required, it might be easier to go for the composite models. - James Hardin and I have recently written full stata .commands for the zero-inflated binomial, the beta-binomial and zero-beta-binomial . In our book, Generalized Linear Models and Extensions (2012) we have a generalized binomial command and a number of advanced count models; eg POisson inverse Gaussian NB-P, and so forth. These are all discussed in my book, Negative Binomial Regression, 2nd ed (2012, Cambridge University Press). I will be adding the binomial models to the 2nd editon of Logistic Regression Models, which I will be writing this coming year.

The binomial commands will be in the Stata Journal, and I will be posting them shortly to my web site, http://works.bepress.com/joseph_hilbe/

Joseph M Hilbe

hilbe@asu.edu - I have one query which fits well with this discussion and I hope that someone in the discussion can give some advice.

Would you find it acceptable to use a negative binomial regression model to analyse a QUANTATIVE DISCRETE score (ranging 0-10) that shows a marked excess of zeroes? This model is generally used for COUNT variables. I found this application for a quantitative discrete score (Byers AL et al. J Clin Epidemiol. 2003;56:559-64), but one could argue that this application violates the assumption of "independence across trials" (items that are summed up to devise the score are neither bernoullian trials nor independent for one subject).

Any reference and idea will be very much appreciated.

## Popular Answers

Janet A Tooze· Wake Forest School of MedicineSebastian E. Baumeister· Universität Regensburg